Datasets (velot.datasets)

Real datasets (require scVelo):

adata = velot.datasets.pancreas()
adata = velot.datasets.erythroid()
adata = velot.datasets.dentategyrus()
adata = velot.datasets.bonemarrow()

Synthetic datasets (no dependencies):

adata = velot.datasets.synthetic_linear()
adata = velot.datasets.synthetic_bifurcation()
adata = velot.datasets.synthetic_tree(topology)
velot.datasets.pancreas()[source]

Pancreas endocrinogenesis (day 15.5) — scVelo.

~3,700 cells, 4 main lineages branching from endocrine progenitors. Temporally mixed — progenitors and differentiated cells coexist spatially. Good stress test for VelOT’s spatial-temporal windowing.

Clusters: Ductal, Ngn3 low EP, Ngn3 high EP, Pre-endocrine, Alpha, Beta, Delta, Epsilon.

Reference: Bastidas-Ponce et al. (2019).

Return type:

AnnData

velot.datasets.erythroid()[source]

Gastrulation erythroid lineage — scVelo.

Erythroid cells from mouse gastrulation. Well-ordered linear trajectory where simpler velocity methods already perform well. Good positive control.

Reference: Pijuan-Sala et al. (2019).

Return type:

AnnData

velot.datasets.dentategyrus()[source]

Dentate gyrus neurogenesis — scVelo.

~2,900 cells from the hippocampal dentate gyrus with neurogenic trajectory. Contains both cycling progenitors and mature neurons.

Reference: Hochgerner et al. (2018).

Return type:

AnnData

velot.datasets.bonemarrow()[source]

Human bone marrow — scVelo.

~5,780 cells from human bone marrow with multiple hematopoietic lineages. Complex branching structure.

Reference: Setty et al. (2019).

Return type:

AnnData

velot.datasets.synthetic_linear(densities=(100, 100), positions=(1.0, 2.0), noise_level=0.05, extra_dimensions=0, n_neighbors=50, seed=42)[source]

Synthetic linear trajectory.

Generates cells along a straight line with optional noise and extra dimensions. Each segment gets a unique cell type label.

Parameters:
  • densities (Sequence[int]) – Number of cells in each segment.

  • positions (Sequence[float]) – End coordinate of each segment (first starts at 0).

  • noise_level (float) – Standard deviation of Gaussian noise added to coordinates. Set to 0 for a perfectly clean line.

  • extra_dimensions (int) – Number of additional noisy dimensions beyond the 2D plane.

  • n_neighbors (int) – Number of neighbors for the KNN graph.

  • seed (int) – Random seed.

Returns:

  • adata.obs['celltype'] : segment labels

  • adata.obs['true_pseudotime'] : ground truth pseudotime in [0,1]

  • adata.obsm['X_pca'] : PCA coordinates

  • adata.obsm['X_umap'] : UMAP coordinates

Return type:

AnnData with

Example

adata = velot.datasets.synthetic_linear(
    densities=[200, 200, 200],
    positions=[1, 2, 3],
    noise_level=0.05,
)
velot.datasets.synthetic_bifurcation(root_density=200, root_position=1.0, branch_densities=(100, 100), branch_positions=(2.0, 2.0), branch_slopes=(2.0, -2.0), noise_level=0.05, extra_dimensions=0, n_neighbors=50, seed=42)[source]

Synthetic bifurcating trajectory.

A root trunk splits into multiple branches diverging in the y-direction. This is the classic test case for velocity methods at branching points.

Parameters:
  • root_density (int) – Number of cells in the root trunk.

  • root_position (float) – End coordinate of the root (starts at 0).

  • branch_densities (Sequence[int]) – Number of cells per branch.

  • branch_positions (Sequence[float]) – End coordinate of each branch.

  • branch_slopes (Sequence[float]) – Y-axis slope of each branch (controls separation).

  • noise_level (float) – Gaussian noise standard deviation.

  • extra_dimensions (int) – Extra noisy dimensions.

  • n_neighbors (int) – KNN neighbors.

  • seed (int) – Random seed.

Return type:

AnnData with celltype labels and ground truth pseudotime.

Example

adata = velot.datasets.synthetic_bifurcation(
    root_density=300,
    branch_densities=[200, 200],
    branch_slopes=[2, -2],
)
velot.datasets.synthetic_tree(topology, noise_level=0.05, extra_dimensions=0, n_neighbors=50, seed=42)[source]

Synthetic tree-structured trajectory.

Build arbitrarily complex branching topologies by specifying a list of branch definitions. Each branch starts where its parent ended.

Parameters:
  • topology (Sequence[dict]) –

    List of dicts, each with keys:
    • name (str): branch label

    • parent (str or None): parent branch name

    • n_cells (int): number of cells

    • length (float): pseudotime duration

    • slope (float): y-axis slope

  • noise_level (float) – Gaussian noise standard deviation.

  • extra_dimensions (int) – Extra noisy dimensions.

  • n_neighbors (int) – KNN neighbors.

  • seed (int) – Random seed.

Return type:

AnnData with celltype labels and ground truth pseudotime.

Example

topology = [
    {"name": "Root",     "parent": None,   "n_cells": 400,
     "length": 0.5, "slope": 0.0},
    {"name": "Branch_A", "parent": "Root", "n_cells": 200,
     "length": 0.5, "slope": 2.0},
    {"name": "Branch_B", "parent": "Root", "n_cells": 200,
     "length": 0.5, "slope": -2.0},
    {"name": "Branch_C", "parent": "Branch_B", "n_cells": 150,
     "length": 0.3, "slope": 0.0},
]
adata = velot.datasets.synthetic_tree(topology)
velot.datasets.symmetric_bifurcation(n_root=400, n_branch=200, noise_level=0.05, extra_dimensions=0, seed=42)[source]

Convenience: symmetric Y-shaped bifurcation.

Root ──┬── Branch_A (up)
       └── Branch_B (down)
Return type:

AnnData

Parameters:
velot.datasets.multifurcation(n_root=400, n_per_branch=150, n_branches=3, noise_level=0.05, extra_dimensions=0, seed=42)[source]

Convenience: root splitting into multiple branches.

Branches are evenly spaced in angle.

Return type:

AnnData

Parameters:
  • n_root (int)

  • n_per_branch (int)

  • n_branches (int)

  • noise_level (float)

  • extra_dimensions (int)

  • seed (int)

velot.datasets.linear_with_cycle(n_linear=300, n_cycle=200, noise_level=0.05, seed=42)[source]

Convenience: linear trajectory that loops back (challenging case).

This creates a dataset where pseudotime doesn’t correspond to a single direction — cells at the end are spatially near the start. Tests whether VelOT can handle cyclic structures.

Return type:

AnnData

Parameters:
velot.datasets.synthetic_cycle(densities=(100, 100, 100, 100), spans=None, names=None, n_rotations=1.0, x_amplitude=1.0, y_amplitude=1.0, z_drift=0.0, noise_level=0.05, extra_dimensions=0, n_neighbors=50, seed=42)[source]

Synthetic cyclic / helicoidal trajectory.

Cells follow a circular or helical path parameterized by pseudotime. This creates a challenging scenario where cells at the end are spatially near the start, and pseudotime does not correspond to a single spatial direction.

The trajectory is divided into segments (cell types), each with controllable density and pseudotime span.

Parameters:
  • densities (Sequence[int]) – Number of cells per segment. Length determines the number of cell types.

  • spans (Optional[Sequence[float]]) – Pseudotime span of each segment. Must sum to any positive value (will be normalized internally). If None, segments are equally spaced.

  • names (Optional[Sequence[str]]) – Labels for each segment. If None, generates “Phase_1”, “Phase_2”, etc.

  • n_rotations (float) – Number of full 2π rotations over the entire pseudotime range. 1.0 = one full circle, 2.0 = two loops, 0.5 = half circle, etc.

  • x_amplitude (float) – Semi-axis length in x direction (controls elliptical shape).

  • y_amplitude (float) – Semi-axis length in y direction.

  • z_drift (float) – If > 0, adds a z-coordinate that increases linearly with pseudotime, turning the circle into a helix. The value controls the total height of the helix.

  • noise_level (float) – Gaussian noise standard deviation.

  • extra_dimensions (int) – Additional noisy dimensions.

  • n_neighbors (int) – KNN neighbors.

  • seed (int) – Random seed.

Returns:

  • adata.obs['celltype'] : phase/segment labels

  • adata.obs['true_pseudotime'] : ground truth pseudotime in [0,1]

  • adata.obsm['X_pca'] : PCA coordinates

  • adata.obsm['X_umap'] : UMAP coordinates

Return type:

AnnData with

Examples

Simple circle with 4 equal phases:

adata = velot.datasets.synthetic_cycle()

Elliptical orbit with unequal phases:

adata = velot.datasets.synthetic_cycle(
    densities=[200, 50, 200, 50],
    spans=[0.4, 0.1, 0.4, 0.1],
    names=["G1", "S", "G2", "M"],
    x_amplitude=2.0,
    y_amplitude=1.0,
)

Double helix (two full turns with vertical drift):

adata = velot.datasets.synthetic_cycle(
    densities=[150, 150, 150, 150],
    n_rotations=2.0,
    z_drift=3.0,
)

Dense start, sparse end:

adata = velot.datasets.synthetic_cycle(
    densities=[400, 200, 100, 50],
    names=["Early", "Mid", "Late", "Terminal"],
)