scATAC clustering, annotation and gene activity

scATAC clustering, annotation and gene activity#

The first tutorial took a raw fragments file to a tile matrix. Here we pick up from a larger, published and annotated dataset — the 10x PBMC 5k scATAC reference — to cover the downstream half of a single-cell ATAC analysis with ov.epi:

load an annotated tile-matrix AnnData (ov.epi.datasets.atac_pbmc5k(annotated=True))
compute an ArchR-style iterative-LSI embedding and cluster it
visualise the published cell-type labels on UMAP
compute gene-activity scores from tile accessibility (ov.epi.tl.add_gene_score_matrix) and use canonical PBMC marker genes to confirm the annotation

Gene-activity scores are the scATAC analogue of gene expression: they aggregate accessibility over each gene body + promoter, giving an interpretable, gene-level view of an otherwise peak-level assay. Everything downloads on demand and runs on CPU.

import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import omicverse as ov
ov.epi.pl.plot_set()
print('omicverse', ov.__version__)

└─ 🔬 Starting plot initialization...
  ├─ Apply Scanpy/matplotlib settings
  ├─ Custom font setup
  ├─ Suppress warnings
  ├─ 
___________      .__                      
\_   _____/_____ |__| ____   ____   ____  
 |    __)_\____ \|  |/  _ \ /    \_/ __ \ 
 |        \  |_> >  (  <_> )   |  \  ___/ 
/_______  /   __/|__|\____/|___|  /\___  >
        \/|__|                  \/     \/ 

  ├─ 🔖 Version: 0.0.1rc1   📚 Tutorials: https://epione.readthedocs.io/
└─ ✅ plot_set complete.

omicverse 2.2.1rc1

1 · Load the annotated PBMC 5k dataset#

The annotated build ships a 500 bp tile matrix, per-cell QC, a cell_type column with the published labels, and a precomputed spectral embedding. We start from the tiles and rebuild the embedding ourselves with ov.epi.

adata = ov.epi.datasets.atac_pbmc5k(annotated=True)
print(adata)
adata.obs['cell_type'].value_counts()

AnnData object with n_obs × n_vars = 4437 × 6062095
    obs: 'n_fragment', 'frac_dup', 'frac_mito', 'tsse', 'doublet_probability', 'doublet_score', 'cell_type'
    var: 'count', 'selected'
    uns: 'cell_type_colors', 'doublet_rate', 'frag_size_distr', 'leiden_colors', 'reference_sequences', 'scrublet_sim_doublet_score', 'spectral_eigenvalue'
    obsm: 'X_spectral', 'X_umap', 'fragment_paired'
    obsp: 'distances'

cell_type
cDC           842
CD14 Mono     658
CD4 Memory    599
CD8 Memory    526
CD4 Naive     444
Memory B      276
CD8 Naive     262
MAIT          246
NK            236
CD16 Mono     168
Naive B       146
pDC            34
Name: count, dtype: int64

2 · Iterative LSI, neighbors and UMAP#

ov.epi.tl.iterative_lsi recomputes the ArchR-style embedding on the selected tile features. We then build the neighbor graph and a UMAP — the same verbs as the RNA pipeline, just on accessibility.

ov.epi.tl.iterative_lsi(adata, n_components=30, iterations=2)
ov.epi.pp.neighbors(adata, use_rep='X_iterative_lsi')
ov.epi.tl.umap(adata)
print('embeddings:', list(adata.obsm.keys()))

└─ [iterative_lsi] Initial feature set: 500,000 / 6,062,095
└─ [iterative_lsi] Iter 1/2 | fit on 4,437 cells x 500,000 features
computing neighbors
finished: added to `.uns['neighbors']`
    `.obsp['distances']`, distances for each pair of neighbors
    `.obsp['connectivities']`, weighted adjacency matrix (0:00:30)
running Leiden clustering
finished: found 7 clusters and added
    'leiden', the cluster labels (adata.obs, categorical) (0:00:00)
└─ [iterative_lsi]   -> 7 clusters; selected 25,000 variable features for next round
└─ [iterative_lsi] Iter 2/2 | fit on 4,437 cells x 25,000 features
└─ [iterative_lsi] Done. Stored embedding (4,437 x 29) in adata.obsm['X_iterative_lsi']
Computing neighbors with n_neighbors=15
Finished computing neighbors
    Added to .uns['neighbors']
    .obsp['distances'], distances for each pair of neighbors
    .obsp['connectivities'], weighted adjacency matrix
Computing UMAP embedding...
Finished computing UMAP
    Added:
        'X_umap', UMAP coordinates (adata.obsm)
        'umap', UMAP parameters (adata.uns)
embeddings: ['X_spectral', 'X_umap', 'fragment_paired', 'X_iterative_lsi']

3 · Cell types on the UMAP#

Colouring the fresh UMAP by the published cell_type labels shows the major PBMC lineages — T cells, B cells, monocytes, NK and dendritic cells — separating cleanly on accessibility alone.

# omicverse embedding plotter, coloured by the published cell types.
ov.pl.embedding(adata, basis='X_umap', color='cell_type', frameon='small',
                title='PBMC 5k scATAC - cell types', show=False)
plt.show()

../_images/23375467fda578b55580aab55df0acbe8ae3e434f08b332c2a6837f1042f248b.png

4 · Gene-activity scores#

ov.epi.tl.add_gene_score_matrix builds a cell × gene activity matrix by aggregating tile accessibility over each gene (distance-weighted, ArchR-style). The result lands in adata.obsm['gene_score'] with gene names in adata.uns['gene_score_gene_names'].

genome = ov.epi.data.hg38
genes = ov.epi.io.get_gene_annotation(genome)
ov.epi.tl.add_gene_score_matrix(adata, genes)

gene_score = np.asarray(adata.obsm['gene_score'])
gene_names = np.asarray(adata.uns['gene_score_gene_names'])
print('gene activity matrix:', gene_score.shape)

└─ [gene_score] genes: 19,923 on 23 chroms (after excluding ['chrY', 'chrM'])
└─ [gene_score] input: adata.X (already a tile matrix)
└─ [gene_score] source=adata.X  cells=4,437  tiles=6,062,095  nnz=81M
└─ [gene_score] tile→gene W: (6062095, 19923)  nnz=5.1M
└─ [gene_score] computing gene_score = X_tile @ W
└─ [gene_score] done. obsm['gene_score'] = (4437, 19923) float32 (uns[gene_score_gene_names]: 19,923 genes)
gene activity matrix: (4437, 19923)

Marker gene activity on the UMAP#

Canonical PBMC markers should light up in the expected lineages: CD3D (T cells), MS4A1 / CD79A (B cells), CD14 / LYZ (monocytes), NKG7 (NK), FCGR3A (CD16 monocytes / NK).

# Wrap the gene-activity scores for the canonical markers into a gene-space
# AnnData so omicverse's plotters can render them directly.
import anndata as ad
markers = ['CD3D', 'CD8A', 'MS4A1', 'CD79A', 'CD14', 'LYZ', 'NKG7', 'FCGR3A']
markers = [g for g in markers if g in set(gene_names)]
name_to_col = {g: i for i, g in enumerate(gene_names)}
ga = ad.AnnData(np.stack([gene_score[:, name_to_col[g]] for g in markers], axis=1).astype('float32'))
ga.var_names = markers
ga.obs_names = list(adata.obs_names)
ga.obs['cell_type'] = adata.obs['cell_type'].values
ga.obsm['X_umap'] = np.asarray(adata.obsm['X_umap'])

# Marker gene-activity on the UMAP (one panel per marker).
ov.pl.embedding(ga, basis='X_umap', color=markers, frameon='small', ncols=4,
                cmap='magma', show=False)
plt.show()

../_images/37cbf765481b67ccb3ca4c472f7b54a1ff013c87228c902a97477afe367d19ba.png

Mean activity per cell type#

An omicverse dotplot of mean marker gene-activity per cell type confirms each marker is highest (and most broadly detected) in its expected lineage.

# Per-cell-type marker gene-activity as an omicverse dotplot.
ov.pl.dotplot(ga, var_names=markers, groupby='cell_type', use_raw=False,
              standard_scale='var', show=False)
plt.show()

../_images/58be6c99cac671c36950fec8dcef0640f02bf37846dbf5fa4197b8933c0e9308.png

Summary#

stage	function
load annotated tiles	`ov.epi.datasets.atac_pbmc5k(annotated=True)`
embed	`ov.epi.tl.iterative_lsi`, `ov.epi.pp.neighbors`, `ov.epi.tl.umap`
gene activity	`ov.epi.tl.add_gene_score_matrix` → `obsm['gene_score']`

Gene-activity scores connect the accessibility landscape back to genes, letting you reuse RNA-style markers to interpret scATAC clusters. The next tutorial scores transcription-factor motif activity per cell with chromVAR.