Datasets (velot.datasets)
Real datasets (require scVelo):
adata = velot.datasets.pancreas()
adata = velot.datasets.erythroid()
adata = velot.datasets.dentategyrus()
adata = velot.datasets.bonemarrow()
Synthetic datasets (no dependencies):
adata = velot.datasets.synthetic_linear()
adata = velot.datasets.synthetic_bifurcation()
adata = velot.datasets.synthetic_tree(topology)
- velot.datasets.pancreas()[source]
Pancreas endocrinogenesis (day 15.5) — scVelo.
~3,700 cells, 4 main lineages branching from endocrine progenitors. Temporally mixed — progenitors and differentiated cells coexist spatially. Good stress test for VelOT’s spatial-temporal windowing.
Clusters: Ductal, Ngn3 low EP, Ngn3 high EP, Pre-endocrine, Alpha, Beta, Delta, Epsilon.
Reference: Bastidas-Ponce et al. (2019).
- Return type:
- velot.datasets.erythroid()[source]
Gastrulation erythroid lineage — scVelo.
Erythroid cells from mouse gastrulation. Well-ordered linear trajectory where simpler velocity methods already perform well. Good positive control.
Reference: Pijuan-Sala et al. (2019).
- Return type:
- velot.datasets.dentategyrus()[source]
Dentate gyrus neurogenesis — scVelo.
~2,900 cells from the hippocampal dentate gyrus with neurogenic trajectory. Contains both cycling progenitors and mature neurons.
Reference: Hochgerner et al. (2018).
- Return type:
- velot.datasets.bonemarrow()[source]
Human bone marrow — scVelo.
~5,780 cells from human bone marrow with multiple hematopoietic lineages. Complex branching structure.
Reference: Setty et al. (2019).
- Return type:
- velot.datasets.synthetic_linear(densities=(100, 100), positions=(1.0, 2.0), noise_level=0.05, extra_dimensions=0, n_neighbors=50, seed=42)[source]
Synthetic linear trajectory.
Generates cells along a straight line with optional noise and extra dimensions. Each segment gets a unique cell type label.
- Parameters:
densities (
Sequence[int]) – Number of cells in each segment.positions (
Sequence[float]) – End coordinate of each segment (first starts at 0).noise_level (
float) – Standard deviation of Gaussian noise added to coordinates. Set to 0 for a perfectly clean line.extra_dimensions (
int) – Number of additional noisy dimensions beyond the 2D plane.n_neighbors (
int) – Number of neighbors for the KNN graph.seed (
int) – Random seed.
- Returns:
adata.obs['celltype']: segment labelsadata.obs['true_pseudotime']: ground truth pseudotime in [0,1]adata.obsm['X_pca']: PCA coordinatesadata.obsm['X_umap']: UMAP coordinates
- Return type:
AnnData with
Example
adata = velot.datasets.synthetic_linear( densities=[200, 200, 200], positions=[1, 2, 3], noise_level=0.05, )
- velot.datasets.synthetic_bifurcation(root_density=200, root_position=1.0, branch_densities=(100, 100), branch_positions=(2.0, 2.0), branch_slopes=(2.0, -2.0), noise_level=0.05, extra_dimensions=0, n_neighbors=50, seed=42)[source]
Synthetic bifurcating trajectory.
A root trunk splits into multiple branches diverging in the y-direction. This is the classic test case for velocity methods at branching points.
- Parameters:
root_density (
int) – Number of cells in the root trunk.root_position (
float) – End coordinate of the root (starts at 0).branch_densities (
Sequence[int]) – Number of cells per branch.branch_positions (
Sequence[float]) – End coordinate of each branch.branch_slopes (
Sequence[float]) – Y-axis slope of each branch (controls separation).noise_level (
float) – Gaussian noise standard deviation.extra_dimensions (
int) – Extra noisy dimensions.n_neighbors (
int) – KNN neighbors.seed (
int) – Random seed.
- Return type:
AnnData with celltype labels and ground truth pseudotime.
Example
adata = velot.datasets.synthetic_bifurcation( root_density=300, branch_densities=[200, 200], branch_slopes=[2, -2], )
- velot.datasets.synthetic_tree(topology, noise_level=0.05, extra_dimensions=0, n_neighbors=50, seed=42)[source]
Synthetic tree-structured trajectory.
Build arbitrarily complex branching topologies by specifying a list of branch definitions. Each branch starts where its parent ended.
- Parameters:
- List of dicts, each with keys:
name(str): branch labelparent(str or None): parent branch namen_cells(int): number of cellslength(float): pseudotime durationslope(float): y-axis slope
noise_level (
float) – Gaussian noise standard deviation.extra_dimensions (
int) – Extra noisy dimensions.n_neighbors (
int) – KNN neighbors.seed (
int) – Random seed.
- Return type:
AnnData with celltype labels and ground truth pseudotime.
Example
topology = [ {"name": "Root", "parent": None, "n_cells": 400, "length": 0.5, "slope": 0.0}, {"name": "Branch_A", "parent": "Root", "n_cells": 200, "length": 0.5, "slope": 2.0}, {"name": "Branch_B", "parent": "Root", "n_cells": 200, "length": 0.5, "slope": -2.0}, {"name": "Branch_C", "parent": "Branch_B", "n_cells": 150, "length": 0.3, "slope": 0.0}, ] adata = velot.datasets.synthetic_tree(topology)
- velot.datasets.symmetric_bifurcation(n_root=400, n_branch=200, noise_level=0.05, extra_dimensions=0, seed=42)[source]
Convenience: symmetric Y-shaped bifurcation.
Root ──┬── Branch_A (up) └── Branch_B (down)
- velot.datasets.multifurcation(n_root=400, n_per_branch=150, n_branches=3, noise_level=0.05, extra_dimensions=0, seed=42)[source]
Convenience: root splitting into multiple branches.
Branches are evenly spaced in angle.
- velot.datasets.linear_with_cycle(n_linear=300, n_cycle=200, noise_level=0.05, seed=42)[source]
Convenience: linear trajectory that loops back (challenging case).
This creates a dataset where pseudotime doesn’t correspond to a single direction — cells at the end are spatially near the start. Tests whether VelOT can handle cyclic structures.
- velot.datasets.synthetic_cycle(densities=(100, 100, 100, 100), spans=None, names=None, n_rotations=1.0, x_amplitude=1.0, y_amplitude=1.0, z_drift=0.0, noise_level=0.05, extra_dimensions=0, n_neighbors=50, seed=42)[source]
Synthetic cyclic / helicoidal trajectory.
Cells follow a circular or helical path parameterized by pseudotime. This creates a challenging scenario where cells at the end are spatially near the start, and pseudotime does not correspond to a single spatial direction.
The trajectory is divided into segments (cell types), each with controllable density and pseudotime span.
- Parameters:
densities (
Sequence[int]) – Number of cells per segment. Length determines the number of cell types.spans (
Optional[Sequence[float]]) – Pseudotime span of each segment. Must sum to any positive value (will be normalized internally). If None, segments are equally spaced.names (
Optional[Sequence[str]]) – Labels for each segment. If None, generates “Phase_1”, “Phase_2”, etc.n_rotations (
float) – Number of full 2π rotations over the entire pseudotime range. 1.0 = one full circle, 2.0 = two loops, 0.5 = half circle, etc.x_amplitude (
float) – Semi-axis length in x direction (controls elliptical shape).y_amplitude (
float) – Semi-axis length in y direction.z_drift (
float) – If > 0, adds a z-coordinate that increases linearly with pseudotime, turning the circle into a helix. The value controls the total height of the helix.noise_level (
float) – Gaussian noise standard deviation.extra_dimensions (
int) – Additional noisy dimensions.n_neighbors (
int) – KNN neighbors.seed (
int) – Random seed.
- Returns:
adata.obs['celltype']: phase/segment labelsadata.obs['true_pseudotime']: ground truth pseudotime in [0,1]adata.obsm['X_pca']: PCA coordinatesadata.obsm['X_umap']: UMAP coordinates
- Return type:
AnnData with
Examples
Simple circle with 4 equal phases:
adata = velot.datasets.synthetic_cycle()
Elliptical orbit with unequal phases:
adata = velot.datasets.synthetic_cycle( densities=[200, 50, 200, 50], spans=[0.4, 0.1, 0.4, 0.1], names=["G1", "S", "G2", "M"], x_amplitude=2.0, y_amplitude=1.0, )
Double helix (two full turns with vertical drift):
adata = velot.datasets.synthetic_cycle( densities=[150, 150, 150, 150], n_rotations=2.0, z_drift=3.0, )
Dense start, sparse end:
adata = velot.datasets.synthetic_cycle( densities=[400, 200, 100, 50], names=["Early", "Mid", "Late", "Terminal"], )