Benchmarking (`velot.benchmark`)

class velot.benchmark.BenchmarkTimer[source]

Bases: object

Context manager to time pipeline stages.

start(stage)[source]

Parameters:: stage (str)

stop()[source]

property total: float

summary()[source]

Return type:: Dict[str, float]

velot.benchmark.save_benchmark(adata, results, timer, model_name, dataset_name, output_dir='benchmark_results', extra_info=None)[source]

Save benchmark results for one (model, dataset) run.

Parameters:

results (Dict[str, Any]) – Output of velot.metrics.summary().
timer (BenchmarkTimer) – BenchmarkTimer with timing information.
model_name (str) – Name of the model (e.g., "scvelo_dynamical", "velot").
dataset_name (str) – Name of the dataset (e.g., "pancreas").
output_dir (str) – Directory to save results.
extra_info (Optional[Dict]) – Any additional metadata.
adata (AnnData)

Return type:

Path to the saved JSON file.

velot.benchmark.load_benchmarks(output_dir='benchmark_results', models=None, datasets=None)[source]

Load benchmark summaries: one row per (model, dataset).

Extracts scalar metrics (*_mean) and timing information.

Returns:

DataFrame with columns

Return type:

model, dataset, time_*, metric_mean, …

Parameters:

output_dir (str)
models (List[str] | None)
datasets (List[str] | None)

velot.benchmark.load_benchmarks_per_group(output_dir='benchmark_results', models=None, datasets=None, metric=None)[source]

Load per-cell metric values for boxplots.

Parses the nested structure in the JSON where each metric (e.g., "cbdir", "iccoh") contains a dict of group → array-of-values. Groups can be edges ("Fev+ → Alpha") or clusters ("Alpha").

Parameters:

output_dir (str) – Directory containing JSON result files.
models (Optional[List[str]]) – Filter to specific models. None for all.
datasets (Optional[List[str]]) – Filter to specific datasets. None for all.
metric (Optional[str]) – Specific metric to load (e.g., "cbdir"). If None, loads all metrics that contain per-cell arrays.

Return type:

DataFrame

Returns:

Long-format DataFrame with columns – model, dataset, metric, group, value
Each row is one cell’s value for one (model, dataset, metric, group).

velot.benchmark.load_benchmarks_per_group_summary(output_dir='benchmark_results', models=None, datasets=None)[source]

Load one row per (model, dataset, metric, group) with summary stats.

Useful for bar charts or compact comparisons where per-cell resolution is not needed.

Returns:

model, dataset, metric, group, mean, median, std, n

Return type:

DataFrame with columns

Parameters:

output_dir (str)
models (List[str] | None)
datasets (List[str] | None)

velot.benchmark.benchmark_comparison(output_dir='benchmark_results', models=None, models_order=None, datasets=None, datasets_order=None, metrics=None, detail='aggregated', reference_model='velot', show_significance=True, show_timing=True, figsize_per_panel=(5, 4), show=True, save=None)[source]

Compare benchmark results across models and datasets.

Parameters:

output_dir (str) – Directory with saved JSON benchmark files.
models (Optional[List[str]]) – Which models to include. None for all.
datasets (Optional[List[str]]) – Which datasets to include. None for all.
metrics (Optional[List[str]]) – Which metrics to plot (e.g., ["cbdir", "iccoh"]). None to auto-detect all metrics with per-cell data.
detail (str) –
Level of detail for the metric panels:
- "aggregated" (default): one boxplot per model, pooling all groups and all datasets.
- "per_dataset": one boxplot per (model, dataset), grouped by dataset, colored by model.
- "per_group": one boxplot per (model, dataset, group), every edge/cluster visible, organized by dataset.
reference_model (Optional[str]) – Model to compare others against for significance testing. Set to None to disable significance brackets entirely.
show_significance (bool) – Whether to draw significance brackets. Only applies when reference_model is not None.
show_timing (bool) – Whether to include a timing comparison panel.
figsize_per_panel (tuple) – Size of each subplot.
show (bool) – Display the plot.
save (Optional[str]) – Path to save.
models_order (List[str] | None)
datasets_order (List[str] | None)

Return type:

matplotlib Figure.

Examples

Default — pooled, with significance vs velot:

velot.pl.benchmark_comparison("benchmark_results")

Without significance:

velot.pl.benchmark_comparison(
    "benchmark_results", show_significance=False,
)

Compare against a different reference:

velot.pl.benchmark_comparison(
    "benchmark_results", reference_model="scvelo_dynamical",
)

Per dataset:

velot.pl.benchmark_comparison(
    "benchmark_results", detail="per_dataset",
)

Full detail:

velot.pl.benchmark_comparison(
    "benchmark_results", detail="per_group",
    figsize_per_panel=(12, 4),
)

velot.benchmark.benchmark_comparison_individual(output_dir='benchmark_results', models=None, models_order=None, datasets=None, datasets_order=None, metrics=None, detail='aggregated', reference_model='velot', show_significance=True, show_timing=True, figsize=(6, 4), ylim_timing=None, save=False, save_path=None, save_prefix='benchmark_boxplots', save_legend=False, show=True)[source]

Parameters:

output_dir (str)
models (List[str] | None)
models_order (List[str] | None)
datasets (List[str] | None)
datasets_order (List[str] | None)
metrics (List[str] | None)
detail (str)
reference_model (str | None)
show_significance (bool)
show_timing (bool)
figsize (tuple)
ylim_timing (int | None)
save (bool)
save_path (str | None)
save_prefix (str)
save_legend (bool)
show (bool)

velot.benchmark.benchmark_summary_table(output_dir='benchmark_results', models=None, datasets=None, metrics=None, reference_model='velot', detail='aggregated', save_csv=None)[source]

Build a summary DataFrame with statistics and p-values.

Parameters:

output_dir (str) – Directory containing JSON result files.
models (Optional[List[str]]) – Filter to specific models. None for all.
datasets (Optional[List[str]]) – Filter to specific datasets. None for all.
metrics (Optional[List[str]]) – Which metrics to include. None for all.
reference_model (str) – Model to compare others against. P-values are computed between this model and each other model.
detail (str) –
Aggregation level:
- "aggregated": one row per (model, metric), pooling all datasets and groups.
- "per_dataset": one row per (model, dataset, metric), pooling groups within each dataset.
- "per_group": one row per (model, dataset, group, metric), no aggregation.
save_csv (Optional[str]) – Path to save the DataFrame as CSV. None to skip saving.

Return type:

DataFrame

Returns:

DataFrame with columns including mean, median, std, n, and
p-value vs the reference model.

Examples

Quick overview:

df = velot.benchmark.benchmark_summary_table("benchmark_results")
print(df)

Per dataset with CSV export:

df = velot.benchmark.benchmark_summary_table(
    "benchmark_results",
    detail="per_dataset",
    save_csv="benchmark_results/summary_per_dataset.csv",
)

Full detail:

df = velot.benchmark.benchmark_summary_table(
    "benchmark_results",
    detail="per_group",
)

velot.benchmark.benchmark_heatmaps(output_dir='benchmark_results', models=None, datasets=None, metrics=None, stat='mean', annot_fmt='.3f', cmap='viridis', figsize_per_panel=(5, 3), show=True, save=None)[source]

Heatmap of model × dataset for each metric.

Parameters:

output_dir (str) – Directory with saved JSON benchmark files.
models (Optional[List[str]]) – Which models to include. None for all.
datasets (Optional[List[str]]) – Which datasets to include. None for all.
metrics (Optional[List[str]]) – Which metrics to plot. None for all.
stat (str) –
Statistic to display in each cell:
- "mean": mean of per-cell values.
- "median": median of per-cell values.
- "rank": average rank across groups within each dataset. Rank 1 = best. Averaged across all edges/clusters.
annot_fmt (str) – Format string for cell annotations.
cmap (str) – Colormap name. For rank, this is automatically reversed (lower rank = better = greener).
figsize_per_panel (tuple) – Size per metric panel.
show (bool) – Display the plot.
save (Optional[str]) – Path to save.

Return type:

matplotlib Figure.

Examples

Mean performance heatmap:

velot.pl.benchmark_heatmaps("benchmark_results", stat="mean")

Rank heatmap:

velot.pl.benchmark_heatmaps("benchmark_results", stat="rank")

Specific metrics:

velot.pl.benchmark_heatmaps(
    "benchmark_results", metrics=["cbdir"], stat="median",
)

velot.benchmark.benchmark_ranking(output_dir='benchmark_results', models=None, models_order=None, datasets=None, metrics=None, include_timing=True, figsize=(8, 5), show=True, save=None)[source]

Overall ranking summary: average rank across all datasets and groups.

For each (dataset, group, metric) combination, models are ranked 1 to N (1 = best). Ranks are then averaged to produce a single score per model. Lower is better.

Optionally includes execution time as a ranked criterion.

Parameters:

output_dir (str) – Directory with saved JSON benchmark files.
models (Optional[List[str]]) – Which models to include. None for all.
datasets (Optional[List[str]]) – Which datasets to include. None for all.
metrics (Optional[List[str]]) – Which metrics to include. None for all.
include_timing (bool) – Whether to include execution time as a ranking criterion.
figsize (tuple) – Figure size.
show (bool) – Display the plot.
save (Optional[str]) – Path to save.
models_order (List[str] | None)

Return type:

matplotlib Figure.

Examples

velot.pl.benchmark_ranking("benchmark_results")

velot.benchmark.benchmark_dotplot(output_dir='benchmark_results', models=None, datasets=None, datasets_order=None, metrics=None, include_timing=True, normalize_mode='global', stat='mean', metric_colors=None, metric_labels=None, model_labels=None, higher_is_better=None, figsize=None, legend=True, summary=True, show=True, save=None)[source]

Integrated dotplot benchmark summary.

Circle area reflects per-column normalised score (for visual comparison only). Displayed values match the heatmap values for the chosen stat.

Parameters:

output_dir (str) – Directory with saved JSON benchmark files.
models (Optional[List[str]]) – Which models to include. None for all.
datasets (Optional[List[str]]) – Which datasets to include. None for all.
metrics (Optional[List[str]]) – Which metrics to include. None for all found.
include_timing (bool) – Whether to include execution time as a metric group.
stat (str) –
Aggregation statistic — same as in benchmark_heatmaps:
- "mean": mean of per-cell values per (model, dataset).
- "median": median of per-cell values.
- "rank": average rank across groups within each dataset (1 = best). Consistent with heatmap stat="rank".
metric_colors (Optional[Dict[str, str]]) – Dict mapping metric name → color.
metric_labels (Optional[Dict[str, str]]) – Dict mapping metric name → display label.
model_labels (Optional[Dict[str, str]]) – Dict mapping model name → display label.
higher_is_better (Optional[Dict[str, bool]]) – Dict mapping metric name → bool. Only used for normalisation direction and stat="rank". Default True for accuracy metrics, False for time.
figsize (Optional[tuple]) – Figure size. Auto-computed if None.
show (bool) – Display the plot.
save (Optional[str]) – Path to save.
datasets_order (List[str] | None)
normalize_mode (str)
legend (bool)
summary (bool)

Return type:

matplotlib Figure.

Benchmarking (velot.benchmark)

Benchmarking (`velot.benchmark`)