Benchmarking (velot.benchmark)
- class velot.benchmark.BenchmarkTimer[source]
Bases:
objectContext manager to time pipeline stages.
- velot.benchmark.save_benchmark(adata, results, timer, model_name, dataset_name, output_dir='benchmark_results', extra_info=None)[source]
Save benchmark results for one (model, dataset) run.
- Parameters:
results (
Dict[str,Any]) – Output ofvelot.metrics.summary().timer (
BenchmarkTimer) – BenchmarkTimer with timing information.model_name (
str) – Name of the model (e.g.,"scvelo_dynamical","velot").dataset_name (
str) – Name of the dataset (e.g.,"pancreas").output_dir (
str) – Directory to save results.adata (AnnData)
- Return type:
Path to the saved JSON file.
- velot.benchmark.load_benchmarks(output_dir='benchmark_results', models=None, datasets=None)[source]
Load benchmark summaries: one row per (model, dataset).
Extracts scalar metrics (
*_mean) and timing information.
- velot.benchmark.load_benchmarks_per_group(output_dir='benchmark_results', models=None, datasets=None, metric=None)[source]
Load per-cell metric values for boxplots.
Parses the nested structure in the JSON where each metric (e.g.,
"cbdir","iccoh") contains a dict of group → array-of-values. Groups can be edges ("Fev+ → Alpha") or clusters ("Alpha").- Parameters:
output_dir (
str) – Directory containing JSON result files.models (
Optional[List[str]]) – Filter to specific models. None for all.datasets (
Optional[List[str]]) – Filter to specific datasets. None for all.metric (
Optional[str]) – Specific metric to load (e.g.,"cbdir"). If None, loads all metrics that contain per-cell arrays.
- Return type:
DataFrame- Returns:
Long-format DataFrame with columns – model, dataset, metric, group, value
Each row is one cell’s value for one (model, dataset, metric, group).
- velot.benchmark.load_benchmarks_per_group_summary(output_dir='benchmark_results', models=None, datasets=None)[source]
Load one row per (model, dataset, metric, group) with summary stats.
Useful for bar charts or compact comparisons where per-cell resolution is not needed.
- velot.benchmark.benchmark_comparison(output_dir='benchmark_results', models=None, models_order=None, datasets=None, datasets_order=None, metrics=None, detail='aggregated', reference_model='velot', show_significance=True, show_timing=True, figsize_per_panel=(5, 4), show=True, save=None)[source]
Compare benchmark results across models and datasets.
- Parameters:
output_dir (
str) – Directory with saved JSON benchmark files.models (
Optional[List[str]]) – Which models to include. None for all.datasets (
Optional[List[str]]) – Which datasets to include. None for all.metrics (
Optional[List[str]]) – Which metrics to plot (e.g.,["cbdir", "iccoh"]). None to auto-detect all metrics with per-cell data.detail (
str) –Level of detail for the metric panels:
"aggregated"(default): one boxplot per model, pooling all groups and all datasets."per_dataset": one boxplot per (model, dataset), grouped by dataset, colored by model."per_group": one boxplot per (model, dataset, group), every edge/cluster visible, organized by dataset.
reference_model (
Optional[str]) – Model to compare others against for significance testing. Set to None to disable significance brackets entirely.show_significance (
bool) – Whether to draw significance brackets. Only applies whenreference_modelis not None.show_timing (
bool) – Whether to include a timing comparison panel.figsize_per_panel (
tuple) – Size of each subplot.show (
bool) – Display the plot.
- Return type:
matplotlib Figure.
Examples
Default — pooled, with significance vs velot:
velot.pl.benchmark_comparison("benchmark_results")
Without significance:
velot.pl.benchmark_comparison( "benchmark_results", show_significance=False, )
Compare against a different reference:
velot.pl.benchmark_comparison( "benchmark_results", reference_model="scvelo_dynamical", )
Per dataset:
velot.pl.benchmark_comparison( "benchmark_results", detail="per_dataset", )
Full detail:
velot.pl.benchmark_comparison( "benchmark_results", detail="per_group", figsize_per_panel=(12, 4), )
- velot.benchmark.benchmark_comparison_individual(output_dir='benchmark_results', models=None, models_order=None, datasets=None, datasets_order=None, metrics=None, detail='aggregated', reference_model='velot', show_significance=True, show_timing=True, figsize=(6, 4), ylim_timing=None, save=False, save_path=None, save_prefix='benchmark_boxplots', save_legend=False, show=True)[source]
- velot.benchmark.benchmark_summary_table(output_dir='benchmark_results', models=None, datasets=None, metrics=None, reference_model='velot', detail='aggregated', save_csv=None)[source]
Build a summary DataFrame with statistics and p-values.
- Parameters:
output_dir (
str) – Directory containing JSON result files.models (
Optional[List[str]]) – Filter to specific models. None for all.datasets (
Optional[List[str]]) – Filter to specific datasets. None for all.metrics (
Optional[List[str]]) – Which metrics to include. None for all.reference_model (
str) – Model to compare others against. P-values are computed between this model and each other model.detail (
str) –Aggregation level:
"aggregated": one row per (model, metric), pooling all datasets and groups."per_dataset": one row per (model, dataset, metric), pooling groups within each dataset."per_group": one row per (model, dataset, group, metric), no aggregation.
save_csv (
Optional[str]) – Path to save the DataFrame as CSV. None to skip saving.
- Return type:
DataFrame- Returns:
DataFrame with columns including mean, median, std, n, and
p-value vs the reference model.
Examples
Quick overview:
df = velot.benchmark.benchmark_summary_table("benchmark_results") print(df)
Per dataset with CSV export:
df = velot.benchmark.benchmark_summary_table( "benchmark_results", detail="per_dataset", save_csv="benchmark_results/summary_per_dataset.csv", )
Full detail:
df = velot.benchmark.benchmark_summary_table( "benchmark_results", detail="per_group", )
- velot.benchmark.benchmark_heatmaps(output_dir='benchmark_results', models=None, datasets=None, metrics=None, stat='mean', annot_fmt='.3f', cmap='viridis', figsize_per_panel=(5, 3), show=True, save=None)[source]
Heatmap of model × dataset for each metric.
- Parameters:
output_dir (
str) – Directory with saved JSON benchmark files.models (
Optional[List[str]]) – Which models to include. None for all.datasets (
Optional[List[str]]) – Which datasets to include. None for all.metrics (
Optional[List[str]]) – Which metrics to plot. None for all.stat (
str) –Statistic to display in each cell:
"mean": mean of per-cell values."median": median of per-cell values."rank": average rank across groups within each dataset. Rank 1 = best. Averaged across all edges/clusters.
annot_fmt (
str) – Format string for cell annotations.cmap (
str) – Colormap name. For rank, this is automatically reversed (lower rank = better = greener).figsize_per_panel (
tuple) – Size per metric panel.show (
bool) – Display the plot.
- Return type:
matplotlib Figure.
Examples
Mean performance heatmap:
velot.pl.benchmark_heatmaps("benchmark_results", stat="mean")
Rank heatmap:
velot.pl.benchmark_heatmaps("benchmark_results", stat="rank")
Specific metrics:
velot.pl.benchmark_heatmaps( "benchmark_results", metrics=["cbdir"], stat="median", )
- velot.benchmark.benchmark_ranking(output_dir='benchmark_results', models=None, models_order=None, datasets=None, metrics=None, include_timing=True, figsize=(8, 5), show=True, save=None)[source]
Overall ranking summary: average rank across all datasets and groups.
For each (dataset, group, metric) combination, models are ranked 1 to N (1 = best). Ranks are then averaged to produce a single score per model. Lower is better.
Optionally includes execution time as a ranked criterion.
- Parameters:
output_dir (
str) – Directory with saved JSON benchmark files.models (
Optional[List[str]]) – Which models to include. None for all.datasets (
Optional[List[str]]) – Which datasets to include. None for all.metrics (
Optional[List[str]]) – Which metrics to include. None for all.include_timing (
bool) – Whether to include execution time as a ranking criterion.figsize (
tuple) – Figure size.show (
bool) – Display the plot.
- Return type:
matplotlib Figure.
Examples
velot.pl.benchmark_ranking("benchmark_results")
- velot.benchmark.benchmark_dotplot(output_dir='benchmark_results', models=None, datasets=None, datasets_order=None, metrics=None, include_timing=True, normalize_mode='global', stat='mean', metric_colors=None, metric_labels=None, model_labels=None, higher_is_better=None, figsize=None, legend=True, summary=True, show=True, save=None)[source]
Integrated dotplot benchmark summary.
Circle area reflects per-column normalised score (for visual comparison only). Displayed values match the heatmap values for the chosen
stat.- Parameters:
output_dir (
str) – Directory with saved JSON benchmark files.models (
Optional[List[str]]) – Which models to include. None for all.datasets (
Optional[List[str]]) – Which datasets to include. None for all.metrics (
Optional[List[str]]) – Which metrics to include. None for all found.include_timing (
bool) – Whether to include execution time as a metric group.stat (
str) –Aggregation statistic — same as in
benchmark_heatmaps:"mean": mean of per-cell values per (model, dataset)."median": median of per-cell values."rank": average rank across groups within each dataset (1 = best). Consistent with heatmapstat="rank".
metric_colors (
Optional[Dict[str,str]]) – Dict mapping metric name → color.metric_labels (
Optional[Dict[str,str]]) – Dict mapping metric name → display label.model_labels (
Optional[Dict[str,str]]) – Dict mapping model name → display label.higher_is_better (
Optional[Dict[str,bool]]) – Dict mapping metric name → bool. Only used for normalisation direction andstat="rank". Default True for accuracy metrics, False for time.figsize (
Optional[tuple]) – Figure size. Auto-computed if None.show (
bool) – Display the plot.normalize_mode (str)
legend (bool)
summary (bool)
- Return type:
matplotlib Figure.