Benchmarking (velot.benchmark)

class velot.benchmark.BenchmarkTimer[source]

Bases: object

Context manager to time pipeline stages.

start(stage)[source]
Parameters:

stage (str)

stop()[source]
property total: float
summary()[source]
Return type:

Dict[str, float]

velot.benchmark.save_benchmark(adata, results, timer, model_name, dataset_name, output_dir='benchmark_results', extra_info=None)[source]

Save benchmark results for one (model, dataset) run.

Parameters:
  • results (Dict[str, Any]) – Output of velot.metrics.summary().

  • timer (BenchmarkTimer) – BenchmarkTimer with timing information.

  • model_name (str) – Name of the model (e.g., "scvelo_dynamical", "velot").

  • dataset_name (str) – Name of the dataset (e.g., "pancreas").

  • output_dir (str) – Directory to save results.

  • extra_info (Optional[Dict]) – Any additional metadata.

  • adata (AnnData)

Return type:

Path to the saved JSON file.

velot.benchmark.load_benchmarks(output_dir='benchmark_results', models=None, datasets=None)[source]

Load benchmark summaries: one row per (model, dataset).

Extracts scalar metrics (*_mean) and timing information.

Returns:

DataFrame with columns

Return type:

model, dataset, time_*, metric_mean, …

Parameters:
velot.benchmark.load_benchmarks_per_group(output_dir='benchmark_results', models=None, datasets=None, metric=None)[source]

Load per-cell metric values for boxplots.

Parses the nested structure in the JSON where each metric (e.g., "cbdir", "iccoh") contains a dict of group → array-of-values. Groups can be edges ("Fev+ Alpha") or clusters ("Alpha").

Parameters:
  • output_dir (str) – Directory containing JSON result files.

  • models (Optional[List[str]]) – Filter to specific models. None for all.

  • datasets (Optional[List[str]]) – Filter to specific datasets. None for all.

  • metric (Optional[str]) – Specific metric to load (e.g., "cbdir"). If None, loads all metrics that contain per-cell arrays.

Return type:

DataFrame

Returns:

  • Long-format DataFrame with columns – model, dataset, metric, group, value

  • Each row is one cell’s value for one (model, dataset, metric, group).

velot.benchmark.load_benchmarks_per_group_summary(output_dir='benchmark_results', models=None, datasets=None)[source]

Load one row per (model, dataset, metric, group) with summary stats.

Useful for bar charts or compact comparisons where per-cell resolution is not needed.

Returns:

model, dataset, metric, group, mean, median, std, n

Return type:

DataFrame with columns

Parameters:
velot.benchmark.benchmark_comparison(output_dir='benchmark_results', models=None, models_order=None, datasets=None, datasets_order=None, metrics=None, detail='aggregated', reference_model='velot', show_significance=True, show_timing=True, figsize_per_panel=(5, 4), show=True, save=None)[source]

Compare benchmark results across models and datasets.

Parameters:
  • output_dir (str) – Directory with saved JSON benchmark files.

  • models (Optional[List[str]]) – Which models to include. None for all.

  • datasets (Optional[List[str]]) – Which datasets to include. None for all.

  • metrics (Optional[List[str]]) – Which metrics to plot (e.g., ["cbdir", "iccoh"]). None to auto-detect all metrics with per-cell data.

  • detail (str) –

    Level of detail for the metric panels:

    • "aggregated" (default): one boxplot per model, pooling all groups and all datasets.

    • "per_dataset": one boxplot per (model, dataset), grouped by dataset, colored by model.

    • "per_group": one boxplot per (model, dataset, group), every edge/cluster visible, organized by dataset.

  • reference_model (Optional[str]) – Model to compare others against for significance testing. Set to None to disable significance brackets entirely.

  • show_significance (bool) – Whether to draw significance brackets. Only applies when reference_model is not None.

  • show_timing (bool) – Whether to include a timing comparison panel.

  • figsize_per_panel (tuple) – Size of each subplot.

  • show (bool) – Display the plot.

  • save (Optional[str]) – Path to save.

  • models_order (List[str] | None)

  • datasets_order (List[str] | None)

Return type:

matplotlib Figure.

Examples

Default — pooled, with significance vs velot:

velot.pl.benchmark_comparison("benchmark_results")

Without significance:

velot.pl.benchmark_comparison(
    "benchmark_results", show_significance=False,
)

Compare against a different reference:

velot.pl.benchmark_comparison(
    "benchmark_results", reference_model="scvelo_dynamical",
)

Per dataset:

velot.pl.benchmark_comparison(
    "benchmark_results", detail="per_dataset",
)

Full detail:

velot.pl.benchmark_comparison(
    "benchmark_results", detail="per_group",
    figsize_per_panel=(12, 4),
)
velot.benchmark.benchmark_comparison_individual(output_dir='benchmark_results', models=None, models_order=None, datasets=None, datasets_order=None, metrics=None, detail='aggregated', reference_model='velot', show_significance=True, show_timing=True, figsize=(6, 4), ylim_timing=None, save=False, save_path=None, save_prefix='benchmark_boxplots', save_legend=False, show=True)[source]
Parameters:
velot.benchmark.benchmark_summary_table(output_dir='benchmark_results', models=None, datasets=None, metrics=None, reference_model='velot', detail='aggregated', save_csv=None)[source]

Build a summary DataFrame with statistics and p-values.

Parameters:
  • output_dir (str) – Directory containing JSON result files.

  • models (Optional[List[str]]) – Filter to specific models. None for all.

  • datasets (Optional[List[str]]) – Filter to specific datasets. None for all.

  • metrics (Optional[List[str]]) – Which metrics to include. None for all.

  • reference_model (str) – Model to compare others against. P-values are computed between this model and each other model.

  • detail (str) –

    Aggregation level:

    • "aggregated": one row per (model, metric), pooling all datasets and groups.

    • "per_dataset": one row per (model, dataset, metric), pooling groups within each dataset.

    • "per_group": one row per (model, dataset, group, metric), no aggregation.

  • save_csv (Optional[str]) – Path to save the DataFrame as CSV. None to skip saving.

Return type:

DataFrame

Returns:

  • DataFrame with columns including mean, median, std, n, and

  • p-value vs the reference model.

Examples

Quick overview:

df = velot.benchmark.benchmark_summary_table("benchmark_results")
print(df)

Per dataset with CSV export:

df = velot.benchmark.benchmark_summary_table(
    "benchmark_results",
    detail="per_dataset",
    save_csv="benchmark_results/summary_per_dataset.csv",
)

Full detail:

df = velot.benchmark.benchmark_summary_table(
    "benchmark_results",
    detail="per_group",
)
velot.benchmark.benchmark_heatmaps(output_dir='benchmark_results', models=None, datasets=None, metrics=None, stat='mean', annot_fmt='.3f', cmap='viridis', figsize_per_panel=(5, 3), show=True, save=None)[source]

Heatmap of model × dataset for each metric.

Parameters:
  • output_dir (str) – Directory with saved JSON benchmark files.

  • models (Optional[List[str]]) – Which models to include. None for all.

  • datasets (Optional[List[str]]) – Which datasets to include. None for all.

  • metrics (Optional[List[str]]) – Which metrics to plot. None for all.

  • stat (str) –

    Statistic to display in each cell:

    • "mean": mean of per-cell values.

    • "median": median of per-cell values.

    • "rank": average rank across groups within each dataset. Rank 1 = best. Averaged across all edges/clusters.

  • annot_fmt (str) – Format string for cell annotations.

  • cmap (str) – Colormap name. For rank, this is automatically reversed (lower rank = better = greener).

  • figsize_per_panel (tuple) – Size per metric panel.

  • show (bool) – Display the plot.

  • save (Optional[str]) – Path to save.

Return type:

matplotlib Figure.

Examples

Mean performance heatmap:

velot.pl.benchmark_heatmaps("benchmark_results", stat="mean")

Rank heatmap:

velot.pl.benchmark_heatmaps("benchmark_results", stat="rank")

Specific metrics:

velot.pl.benchmark_heatmaps(
    "benchmark_results", metrics=["cbdir"], stat="median",
)
velot.benchmark.benchmark_ranking(output_dir='benchmark_results', models=None, models_order=None, datasets=None, metrics=None, include_timing=True, figsize=(8, 5), show=True, save=None)[source]

Overall ranking summary: average rank across all datasets and groups.

For each (dataset, group, metric) combination, models are ranked 1 to N (1 = best). Ranks are then averaged to produce a single score per model. Lower is better.

Optionally includes execution time as a ranked criterion.

Parameters:
  • output_dir (str) – Directory with saved JSON benchmark files.

  • models (Optional[List[str]]) – Which models to include. None for all.

  • datasets (Optional[List[str]]) – Which datasets to include. None for all.

  • metrics (Optional[List[str]]) – Which metrics to include. None for all.

  • include_timing (bool) – Whether to include execution time as a ranking criterion.

  • figsize (tuple) – Figure size.

  • show (bool) – Display the plot.

  • save (Optional[str]) – Path to save.

  • models_order (List[str] | None)

Return type:

matplotlib Figure.

Examples

velot.pl.benchmark_ranking("benchmark_results")
velot.benchmark.benchmark_dotplot(output_dir='benchmark_results', models=None, datasets=None, datasets_order=None, metrics=None, include_timing=True, normalize_mode='global', stat='mean', metric_colors=None, metric_labels=None, model_labels=None, higher_is_better=None, figsize=None, legend=True, summary=True, show=True, save=None)[source]

Integrated dotplot benchmark summary.

Circle area reflects per-column normalised score (for visual comparison only). Displayed values match the heatmap values for the chosen stat.

Parameters:
  • output_dir (str) – Directory with saved JSON benchmark files.

  • models (Optional[List[str]]) – Which models to include. None for all.

  • datasets (Optional[List[str]]) – Which datasets to include. None for all.

  • metrics (Optional[List[str]]) – Which metrics to include. None for all found.

  • include_timing (bool) – Whether to include execution time as a metric group.

  • stat (str) –

    Aggregation statistic — same as in benchmark_heatmaps:

    • "mean": mean of per-cell values per (model, dataset).

    • "median": median of per-cell values.

    • "rank": average rank across groups within each dataset (1 = best). Consistent with heatmap stat="rank".

  • metric_colors (Optional[Dict[str, str]]) – Dict mapping metric name → color.

  • metric_labels (Optional[Dict[str, str]]) – Dict mapping metric name → display label.

  • model_labels (Optional[Dict[str, str]]) – Dict mapping model name → display label.

  • higher_is_better (Optional[Dict[str, bool]]) – Dict mapping metric name → bool. Only used for normalisation direction and stat="rank". Default True for accuracy metrics, False for time.

  • figsize (Optional[tuple]) – Figure size. Auto-computed if None.

  • show (bool) – Display the plot.

  • save (Optional[str]) – Path to save.

  • datasets_order (List[str] | None)

  • normalize_mode (str)

  • legend (bool)

  • summary (bool)

Return type:

matplotlib Figure.