分析 Visium HD 数据

分析 Visium HD 数据#

Visium HD 是 10x Genomics 推出的高分辨率空间转录组方案，用于在组织切片上直接测量基因表达。与更早期、分辨率较低的 spot 级方案相比，Visium HD 在分析上的实际变化是：我们往往要从更密集的空间单元出发，再在同一块组织的多个视图之间切换，例如规则 bin 视图和分割后的细胞视图。因此，分析重点不再只是表达矩阵本身，还包括坐标、组织图像和分割结果之间如何彼此对应。

这个教程演示了如何在 OmicVerse 中完成一条典型的 Visium HD 工作流程，同时使用 bin-level output 和 cell segmentation output 两种数据视图。整个示例覆盖四项任务：

将 Visium HD 数据读入 AnnData；
在组织图像上可视化基因表达；
在局部区域和全切片范围内识别空间变异基因；
构建低维表示，用于聚类和空间解释。

下面的代码默认你已经把 Visium HD 输出文件下载并整理到 data/visium_hd/ 目录下。

环境设置#

我们先导入 OmicVerse，并启用两个 notebook 中常用的便利功能：

ov.style(font_path='Arial')：为后续图形设置统一的绘图风格；
%load_ext autoreload 和 %autoreload 2：在开发或调试时自动重载修改过的 Python 模块，无需反复重启内核。

import omicverse as ov
ov.style(font_path='Arial')

# 启用自动重载，便于开发和调试
%load_ext autoreload
%autoreload 2

🔬 Starting plot initialization...
Using already downloaded Arial font from: /tmp/omicverse_arial.ttf
Registered as: Arial
🧬 Detecting GPU devices…
✅ NVIDIA CUDA GPUs detected: 1
    • [CUDA 0] NVIDIA A100-SXM4-40GB
      Memory: 39.4 GB | Compute: 8.0

   ____            _     _    __                  
  / __ \____ ___  (_)___| |  / /__  _____________ 
 / / / / __ `__ \/ / ___/ | / / _ \/ ___/ ___/ _ \ 
/ /_/ / / / / / / / /__ | |/ /  __/ /  (__  )  __/ 
\____/_/ /_/ /_/_/\___/ |___/\___/_/  /____/\___/                                              

🔖 Version: 1.7.10rc1   📚 Tutorials: https://omicverse.readthedocs.io/
✅ plot_set complete.

ov.settings.cpu_gpu_mixed_init() 会以 CPU/GPU 混合模式初始化 OmicVerse。对于一部分仍然偏 CPU 的预处理步骤，以及一部分可以从 GPU 加速中受益的数值计算步骤，这种模式通常更稳妥。

ov.settings.cpu_gpu_mixed_init()

CPU-GPU mixed mode activated
Available GPU accelerators: CUDA

加载 Visium HD 数据集#

Visium HD 在非常精细的空间分辨率下提供表达测量结果。实际分析时，同一份数据通常有两种视图都很有价值：

bin-level representation：表达计数存放在规则网格上；
cell-segmentation representation：将 bin 重新分配到分割后的细胞。

本教程会同时演示这两种视图。bin-level 对象更适合检查原始空间信号和组织图像对齐情况，而 segmentation-level 对象则更适合后续以细胞为中心的分析、空间特征发现和聚类。

在正式加载矩阵前，先查看目录结构通常会更稳妥。下面的辅助函数会打印文件树，并跳过 analysis 目录，让输出更聚焦于数据读入真正需要的文件。

这里使用的是 10x Genomics 提供的公开 Visium HD 人前列腺癌 FFPE 示例数据集，计数矩阵和组织图像可从以下页面获取： https://www.10xgenomics.com/datasets/visium-hd-cytassist-gene-expression-libraries-human-prostate-cancer-ffpe

from pathlib import Path

ov.utils.print_tree(
    Path("data/visium_hd/binned_outputs/square_016um"),
    skip_dirs={"analysis"}
)

square_016um/
    spatial/
        aligned_fiducials.jpg
        detected_tissue_image.jpg
        tissue_positions.parquet
        aligned_tissue_image.jpg
        tissue_lowres_image.png
        cytassist_image.tiff
        tissue_hires_image.png
        scalefactors_json.json
    raw_feature_bc_matrix.h5
    filtered_feature_bc_matrix/
        barcodes.tsv.gz
        features.tsv.gz
        matrix.mtx.gz
    filtered_feature_bc_matrix.h5
    raw_feature_bc_matrix/
        barcodes.tsv.gz
        features.tsv.gz
        matrix.mtx.gz

读取 bin-level 输出#

第一个对象来自 square_016um 目录，这里存放的是基于规则网格分箱后的 Visium HD 数据。当你想先检查原始高分辨率空间信号，而不是立刻进入细胞层面解释时，这个视图尤其有用。

ov.io.read_visium_hd() 中有几个参数值得特别关注：

path：所选 Visium HD 输出目录的根路径；
data_type='bin'：告诉 OmicVerse 将输入解释为 bin-level 矩阵，而不是分割后的细胞；
cell_matrix_h5_path 和 count_mtx_dir：分别指向 HDF5 格式和目录格式的过滤后计数矩阵；
tissue_positions_path：提供每个 bin 的空间坐标；
hires_image_path、lowres_image_path 和 scalefactors_path：把分子数据与组织图像关联起来，供后续绘图使用。

adata_hd = ov.io.read_visium_hd(
    path="data/visium_hd/binned_outputs/square_016um",
    data_type="bin",
    cell_matrix_h5_path="filtered_feature_bc_matrix.h5",
    count_mtx_dir='filtered_feature_bc_matrix',
    tissue_positions_path = "spatial/tissue_positions.parquet",
    # if figure and scalefactor stored in  outs/spatial
    hires_image_path="spatial/tissue_hires_image.png",
    lowres_image_path="spatial/tissue_lowres_image.png",
    scalefactors_path="spatial/scalefactors_json.json",
)

[VisiumHD][INFO] read_visium_hd entry (data_type='bin')
[VisiumHD][START] Reading bin-level data from: /scratch/users/steorra/analysis/26_omic_protocol/data/visium_hd/binned_outputs/square_016um
[VisiumHD][INFO] Sample key: square_016um
[VisiumHD][STEP] Loading count matrix (h5='filtered_feature_bc_matrix.h5', mtx='filtered_feature_bc_matrix')
[VisiumHD][STEP] Loading tissue positions: /scratch/users/steorra/analysis/26_omic_protocol/data/visium_hd/binned_outputs/square_016um/spatial/tissue_positions.parquet
[VisiumHD][STEP] Loading images and scale factors
[VisiumHD][OK] Done (n_obs=139446, n_vars=18132)

返回结果是一个标准的 AnnData 对象。此时快速检查一下观测数、基因数，以及 obs、var、obsm 和 uns 的内容，会很有帮助。

adata_hd

AnnData object with n_obs × n_vars = 139446 × 18132
    obs: 'in_tissue', 'array_row', 'array_col', 'pxl_row_in_fullres', 'pxl_col_in_fullres'
    var: 'gene_ids', 'feature_types', 'genome'
    uns: 'spatial'
    obsm: 'spatial'

第一张图把表达信号叠加在组织图像上。

几个关键绘图参数：

color=[None, "EPCAM"]：同时显示纯组织图像和 EPCAM 的表达；
size：每个空间单元的点大小；
linewidth=0：移除点边框，让高密度绘图更干净；
cmap='Reds'：对非负表达值使用连续红色配色。

ov.pl.spatial(
    adata_hd, color=[None,"EPCAM"],
    size=3, linewidth=0,
    legend_fontsize=13, frameon=None,
    cmap='Reds'
)

../_images/8e660cb3aa681c1280c0afa5dde56cfbd66268a75ff2a1dcd25e44aec358f177.png

读取细胞分割输出#

接下来切换到 segmentation-level 输出，此时每个观测值对应的是分割后的细胞，而不再是规则 bin。这个表示通常更适合细胞层面的可视化、空间特征发现和聚类分析。

相较于 bin-level 导入，主要差别在于这里额外引入了分割文件：

data_type='cellseg'：启用细胞分割解析模式；
cell_segmentations_path：指向包含细胞 polygon 边界的 geojson 文件。

这两个对象实际上提供的是同一块组织的互补视图：bin-level 更适合检查原始空间结构，segmentation-level 更适合后续细胞层面的分析。

adata_seg = ov.io.read_visium_hd(
    path="data/visium_hd/segmented_outputs",
    data_type="cellseg",
    cell_matrix_h5_path="filtered_feature_cell_matrix.h5",
    cell_segmentations_path="graphclust_annotated_cell_segmentations.geojson",
    # if figure and scalefactor stored in  outs/spatial
    hires_image_path="spatial/tissue_hires_image.png",
    lowres_image_path="spatial/tissue_lowres_image.png",
    scalefactors_path="spatial/scalefactors_json.json",
)

[VisiumHD][INFO] read_visium_hd entry (data_type='cellseg')
[VisiumHD][START] Reading cell-segmentation data from: /scratch/users/steorra/analysis/26_omic_protocol/data/visium_hd/segmented_outputs
[VisiumHD][INFO] Sample key: segmented_outputs
[VisiumHD][STEP] Loading segmentation geometry: /scratch/users/steorra/analysis/26_omic_protocol/data/visium_hd/segmented_outputs/graphclust_annotated_cell_segmentations.geojson
[VisiumHD][STEP] Loading count matrix: /scratch/users/steorra/analysis/26_omic_protocol/data/visium_hd/segmented_outputs/filtered_feature_cell_matrix.h5
[VisiumHD][STEP] Loading images and scale factors
[VisiumHD][OK] Done (n_obs=227257, n_vars=18132)

这张图在 segmentation-level 对象上展示同一个 marker EPCAM。由于每个观测现在对应一个分割细胞，因此空间模式会更容易从“细胞中心”的角度来解释。

ov.pl.spatial(
    adata_seg, color=[None,"EPCAM"],
    size=6, linewidth=0,
    legend_fontsize=13, frameon=None,
    cmap='Reds'
)

../_images/2621186facc3d7587346b31f24b4a3a86d91cad3aeb302cfee491abf9c43e969.png

聚焦局部感兴趣区域#

全切片视图适合看整体背景，但很多时候，局部窗口更方便做细节检查。下面的辅助函数会根据空间坐标对子集进行筛选。

参数说明：

xlim 和 ylim：定义保留区域的矩形范围；
adata.obsm["spatial"]：存储用于空间筛选的 x/y 坐标；
obs_names_make_unique() 和 var_names_make_unique()：避免子集化后出现重复标识符。

对于高密度的 Visium HD 数据来说，这种局部放大并不只是为了好看，而是判断图像结构、分割几何和基因表达是否一致的最快方式之一。

def subset_data(adata, xlim=(24500, 26000), ylim=(5000, 6000)):
    x, y = adata.obsm["spatial"].T
    bdata = adata[(xlim[0] <= x) & (x <= xlim[1]) & (ylim[0] <= y) & (y <= ylim[1])].copy()
    bdata.obs_names_make_unique()
    bdata.var_names_make_unique()
    return bdata
bdata = subset_data(adata_seg)

接下来的几个面板会比较同一块局部分割区域的不同渲染方式。虽然这一步主要属于可视化调整，但它很适合帮助你判断最终图里应该保留多少分割边界信息。

这里的分割 polygon 没有显示可见边框（edges_width=0）。这种设置更强调表达信号本身，在视野很拥挤时通常是一个不错的默认方案。

ov.pl.spatialseg(
    bdata,
    color="EPCAM",
    edges_color='white',
    edges_width=0,
    figsize=(6, 4),
    alpha_img=1,
    alpha=1,
    legend_fontsize=13,
    #cmap='Reds',
    #img_key=False,
    #alpha=1,
)

../_images/65c0c3cd7b71e10a1fb6f0ebc508b03801d964dc7c6c2f5d9700cd071c9b4c14.png

<Axes: title={'center': 'EPCAM'}, xlabel='spatial 1', ylabel='spatial 2'>

给每个 polygon 增加一条很细的边界（edges_width=0.1）有助于区分相邻细胞。当相邻 segment 的表达值比较接近时，这通常会让图更容易读。

ov.pl.spatialseg(
    bdata,
    color="EPCAM",
    edges_color='white',
    edges_width=0.1,
    figsize=(6, 4),
    alpha_img=1,
    alpha=1,
    legend_fontsize=13,
    #cmap='Reds',
    #img_key=False,
)

../_images/37772863cf286343f87db729dc448c213fa2ea26c17deb6e00b2d43985d946a8.png

<Axes: title={'center': 'EPCAM'}, xlabel='spatial 1', ylabel='spatial 2'>

seg_contourpx 用像素单位控制分割轮廓的粗细。适当增大这个值能让边界更明显，尤其适合汇报图或背景组织纹理较复杂的情况。

ov.pl.spatialseg(
    bdata,
    color="EPCAM",
    edges_color='white',
    edges_width=0,
    figsize=(6, 4),
    #library_id='1',
    alpha_img=1,
    seg_contourpx=1.5,
    alpha=1,
    legend_fontsize=13,
)

../_images/9b91bd90688b77b0b2f4263a1dfa58991d84040de73178d0d93b33f755d01667.png

<Axes: title={'center': 'EPCAM'}, xlabel='spatial 1', ylabel='spatial 2'>

识别局部区域的空间变异基因#

接下来，我们在子集区域上运行 ov.space.svg()。

几个重要参数：

mode='moranI'：使用 Moran’s I 按空间自相关对基因排序；
n_svgs=3000：保留前 3000 个候选空间变异基因；
n_perms=100：通过置换估计统计显著性；
n_jobs=8：在 CPU worker 间并行计算。

bdata=ov.space.svg(
    bdata,mode='moranI',
    n_svgs=3000,
    n_perms=100,n_jobs=8,
)
bdata

Spatial neighbors: 961 cells, 5582 connections (avg 5.8 neighbors/cell).
Stored in adata.obsp['spatial_connectivities'] and adata.obsp['spatial_distances'].
Stored 18132 gene results in adata.uns['moranI'].

AnnData object with n_obs × n_vars = 961 × 18132
    obs: 'geometry'
    var: 'gene_ids', 'feature_types', 'genome', 'moranI', 'moranI_pval', 'pval_adj', 'space_variable_features', 'highly_variable'
    uns: 'spatial', 'omicverse_io', 'spatial_neighbors', 'REFERENCE_MANU', 'moranI'
    obsm: 'spatial'
    obsp: 'spatial_connectivities', 'spatial_distances'

下表可以快速查看排名靠前的基因。moranI 表示空间自相关强度，moranI_pval 记录显著性估计，space_variable_features 标记该基因是否被保留为空间特征。

bdata.var[['moranI','moranI_pval','space_variable_features']].sort_values('moranI',ascending=False)

	moranI	moranI_pval	space_variable_features
CHGA	0.422765	9.294550e-111	True
NPY	0.412106	2.276425e-105	True
PLA2G2A	0.387953	1.158365e-93	True
COL1A1	0.343667	5.067257e-74	True
KLK3	0.341580	3.733330e-73	True
...	...	...	...
SNU13	-0.033147	9.546428e-01	False
TRIOBP	-0.034662	9.617609e-01	False
SETD5	-0.036784	9.701712e-01	False
HNRNPL	-0.041451	9.833822e-01	False
TXNIP	-0.043999	9.881957e-01	False

18132 rows × 3 columns

这里以 NPY 为例展示一个具有局部空间结构的基因。使用 segmentation 渲染后，更容易判断它的信号究竟是在相邻细胞群中连续分布，还是只表现为零散点位。

ov.pl.spatialseg(
    bdata,
    color="NPY",
    edges_color='white',
    edges_width=0.1,
    figsize=(6, 4),
    alpha_img=0.8,
    alpha=1,
    legend_fontsize=13,
)

../_images/b21328ba12e27588575b7657f0ed83739a3d524bc3ada4a9877acd8fe7099575.png

<Axes: title={'center': 'NPY'}, xlabel='spatial 1', ylabel='spatial 2'>

在归一化之前查看 X 的最大值，是一个简单但有用的 sanity check。它能帮助你快速感知当前子集对象中原始计数的大致量级。

bdata.X.max()

21.0

对局部子集做归一化和对数变换#

这里的标准预处理步骤是：

ov.pp.normalize_total(bdata)：把每个观测归一化到可比较的文库大小；
ov.pp.log1p(bdata)：做 log(1+x) 变换，压缩动态范围。

这样处理后，空间表达图通常更容易在不同细胞之间进行比较。

ov.pp.normalize_total(bdata)
ov.pp.log1p(bdata)

🔍 Count Normalization:
   Target sum: median
   Exclude highly expressed: False

✅ Count Normalization Completed Successfully!
   ✓ Processed: 961 cells × 18,132 genes
   ✓ Runtime: 0.00s

完成归一化和 log 变换后，数值范围应明显小于原始矩阵，这说明变换已经生效。

bdata.X.max()

4.4248466

在预处理之后重新绘制 NPY，有助于判断归一化如何改变信号的视觉对比度。

ov.pl.spatialseg(
    bdata,
    color="NPY",
    edges_color='white',
    edges_width=0.1,
    figsize=(6, 4),
    alpha_img=0.8,
    alpha=1,
    legend_fontsize=13,
)

../_images/8ce3b1016c4fd06ba3e2252478af67073dd254c94958e9f775a812f1224e5f13.png

<Axes: title={'center': 'NPY'}, xlabel='spatial 1', ylabel='spatial 2'>

同一个基因也可以用 ov.pl.spatial() 展示，此时观测值会被渲染为点，而不是填充的分割 polygon。

这里有两个参数尤其有用：

size=1.5：控制每个观测的点大小；
vmax='p99.2'：把颜色上限裁剪到 99.2 百分位，减少极端值对视觉动态范围的影响。

fig, ax = ov.plt.subplots(figsize=(6, 4))
ov.pl.spatial(
    bdata, color="NPY",
    size=1.5, linewidth=0,
    legend_fontsize=13, frameon=True,
    cmap='Reds',vmax='p99.2',
    ax=ax,
)

../_images/d53262f22aa17138def0ec7a3d0e65a991775861a7b70f6d64fd589bca4a668d.png

在完整 segmentation-level 数据集上计算空间变异基因#

在局部区域探索之后，我们对完整 segmentation 对象重复进行 SVG 检测。这里 notebook 使用 mode='pearsonr'，它与 Moran’s I 不同，是另一种识别空间结构的标准。

adata=ov.space.svg(
    adata_seg,mode='pearsonr',
    n_svgs=3000,
)
adata

🔍 [2026-03-08 04:48:27] Running preprocessing in 'cpu-gpu-mixed' mode...
Begin robust gene identification
    After filtration, 18109/18132 genes are kept.
    Among 18109 genes, 13307 genes are robust.
✅ Robust gene identification completed successfully.
Begin size normalization: shiftlog and HVGs selection pearson

🔍 Count Normalization:
   Target sum: 500000.0
   Exclude highly expressed: True
   Max fraction threshold: 0.2
   ⚠️ Excluding 1,410 highly-expressed genes from normalization computation
   ⚠️ Warning: 620 cells have zero counts

✅ Count Normalization Completed Successfully!
   ✓ Processed: 227,257 cells × 13,307 genes
   ✓ Runtime: 7.77s

🔍 Highly Variable Genes Selection (Experimental):
   Method: pearson_residuals
   Target genes: 3,000
   Theta (overdispersion): 100

✅ Experimental HVG Selection Completed Successfully!
   ✓ Selected: 3,000 highly variable genes out of 13,307 total (22.5%)
   ✓ Results added to AnnData object:
     • 'highly_variable': Boolean vector (adata.var)
     • 'highly_variable_rank': Float vector (adata.var)
     • 'highly_variable_nbatches': Int vector (adata.var)
     • 'highly_variable_intersection': Boolean vector (adata.var)
     • 'means': Float vector (adata.var)
     • 'variances': Float vector (adata.var)
     • 'residual_variances': Float vector (adata.var)
    Time to analyze data in cpu: 17.67 seconds.
✅ Preprocessing completed successfully.
    Added:
        'highly_variable_features', boolean vector (adata.var)
        'means', float vector (adata.var)
        'variances', float vector (adata.var)
        'residual_variances', float vector (adata.var)
        'counts', raw counts layer (adata.layers)
    End of size normalization: shiftlog and HVGs selection pearson

╭─ SUMMARY: preprocess ──────────────────────────────────────────────╮
│  Duration: 18.605s                                                 │
│  Shape:    227,257 x 18,132 -> 227,257 x 13,307                    │
│                                                                    │
│  CHANGES DETECTED                                                  │
│  ────────────────                                                  │
│  ● VAR    │ ✚ highly_variable (bool)                               │
│           │ ✚ highly_variable_features (bool)                      │
│           │ ✚ highly_variable_rank (float)                         │
│           │ ✚ means (float)                                        │
│           │ ✚ n_cells (int)                                        │
│           │ ✚ percent_cells (float)                                │
│           │ ✚ residual_variances (float)                           │
│           │ ✚ robust (bool)                                        │
│           │ ✚ variances (float)                                    │
│                                                                    │
│  ● UNS    │ ✚ REFERENCE_MANU                                       │
│           │ ✚ history_log                                          │
│           │ ✚ hvg                                                  │
│           │ ✚ log1p                                                │
│           │ ✚ status                                               │
│           │ ✚ status_args                                          │
│                                                                    │
│  ● LAYERS │ ✚ counts (sparse matrix, 227257x13307)                 │
│                                                                    │
╰────────────────────────────────────────────────────────────────────╯

AnnData object with n_obs × n_vars = 227257 × 18132
    obs: 'geometry'
    var: 'gene_ids', 'feature_types', 'genome', 'space_variable_features', 'highly_variable'
    uns: 'spatial', 'omicverse_io', 'REFERENCE_MANU'
    obsm: 'spatial'

同样地，在对完整数据集做预处理之前，先检查一下当前数值范围会更稳妥。

adata.X.max()

332.0

完整数据集会以与局部子集相同的方式完成归一化和 log 变换，这样后续降维和聚类就在可比较的尺度上进行。

ov.pp.normalize_total(adata)
ov.pp.log1p(adata)

🔍 Count Normalization:
   Target sum: median
   Exclude highly expressed: False

✅ Count Normalization Completed Successfully!
   ✓ Processed: 227,257 cells × 18,132 genes
   ✓ Runtime: 0.10s

将矩阵限制在空间变异基因上#

通过 adata.raw = adata 可以保留过滤前的完整对象，方便后续参考。下一步再把矩阵限制到 adata.var.space_variable_features 标记出的基因上，从而降低噪声，并让下游分析更聚焦于空间信息更强的特征。

%%time
adata.raw = adata
adata = adata[:, adata.var.space_variable_features]
adata

CPU times: user 13 ms, sys: 70.1 ms, total: 83.1 ms
Wall time: 82.1 ms

View of AnnData object with n_obs × n_vars = 227257 × 3000
    obs: 'geometry'
    var: 'gene_ids', 'feature_types', 'genome', 'space_variable_features', 'highly_variable'
    uns: 'spatial', 'omicverse_io', 'REFERENCE_MANU', 'log1p'
    obsm: 'spatial'

构建邻接图和 UMAP 嵌入#

这一单元遵循常见的流程：缩放、PCA、构图，再计算 UMAP 嵌入。

参数说明：

PCA 中的 n_pcs=50：保留前 50 个主成分；
n_neighbors=15：定义构图时局部邻域的大小；
use_rep='scaled|original|X_pca'：指定 ov.pp.neighbors() 在这一工作流约定下应使用哪种处理后的表示。实际执行时，这里依赖的是上一步刚生成的 PCA 表示。

%%time
ov.pp.scale(adata)
ov.pp.pca(adata,layer='scaled',n_pcs=50)
ov.pp.neighbors(adata, n_neighbors=15, n_pcs=50,
               use_rep='scaled|original|X_pca')
ov.pp.umap(adata)

    Converting scaled data to csr_matrix format...

╭─ SUMMARY: scale ───────────────────────────────────────────────────╮
│  Duration: 29.9081s                                                │
│  Shape:    227,257 x 3,000 (Unchanged)                             │
│                                                                    │
│  CHANGES DETECTED                                                  │
│  ────────────────                                                  │
│  ● UNS    │ ✚ status                                               │
│           │ ✚ status_args                                          │
│                                                                    │
│  ● LAYERS │ ✚ scaled (sparse matrix, 227257x3000)                  │
│                                                                    │
╰────────────────────────────────────────────────────────────────────╯
🚀 Using GPU to calculate PCA...
NVIDIA CUDA GPUs detected:
📊 [CUDA 0] NVIDIA A100-SXM4-40GB
    ------------------------------ 4/40960 MiB (0.0%)
computing PCA🔍
    with n_comps=50

Using CUDA device: NVIDIA A100-SXM4-40GB
✅ Using built-in torch_pca for GPU-accelerated PCA
   🚀 Using torch_pca PCA for CUDA GPU acceleration
   🚀 torch_pca PCA backend: CUDA GPU acceleration (supports sparse matrices)
   📊 PCA input data type: SparseCSRView, shape: (227257, 3000), dtype: float64
   📊 Sparse matrix density: 100.00%
    finished✅ (2227.14s)

╭─ SUMMARY: pca ─────────────────────────────────────────────────────╮
│  Duration: 2227.5208s                                              │
│  Shape:    227,257 x 3,000 (Unchanged)                             │
│                                                                    │
│  CHANGES DETECTED                                                  │
│  ────────────────                                                  │
│  ● UNS    │ ✚ pca                                                  │
│           │ └─ params: {'zero_center': True, 'use_highly_variable': Tr...│
│           │ ✚ scaled|original|cum_sum_eigenvalues                  │
│           │ ✚ scaled|original|pca_var_ratios                       │
│                                                                    │
│  ● OBSM   │ ✚ X_pca (array, 227257x50)                             │
│           │ ✚ scaled|original|X_pca (array, 227257x50)             │
│                                                                    │
╰────────────────────────────────────────────────────────────────────╯
🚀 Using torch CPU/GPU mixed mode to calculate neighbors...
NVIDIA CUDA GPUs detected:
📊 [CUDA 0] NVIDIA A100-SXM4-40GB
    ------------------------------ 425/40960 MiB (1.0%)

🔍 K-Nearest Neighbors Graph Construction:
   Mode: cpu-gpu-mixed
   Neighbors: 15
   Method: torch
   Metric: euclidean
   Representation: scaled|original|X_pca
   PCs used: 50
   🔍 Computing neighbor distances...
   🔍 Computing connectivity matrix...
   💡 Using UMAP-style connectivity
   ✓ Graph is fully connected

✅ KNN Graph Construction Completed Successfully!
   ✓ Processed: 227,257 cells with 15 neighbors each
   ✓ Results added to AnnData object:
     • 'neighbors': Neighbors metadata (adata.uns)
     • 'distances': Distance matrix (adata.obsp)
     • 'connectivities': Connectivity matrix (adata.obsp)

╭─ SUMMARY: neighbors ───────────────────────────────────────────────╮
│  Duration: 52.9185s                                                │
│  Shape:    227,257 x 3,000 (Unchanged)                             │
│                                                                    │
│  CHANGES DETECTED                                                  │
│  ────────────────                                                  │
│  ● UNS    │ ✚ neighbors                                            │
│           │ └─ params: {'n_neighbors': 15, 'method': 'torch', 'random_...│
│                                                                    │
│  ● OBSP   │ ✚ connectivities (sparse matrix, 227257x227257)        │
│           │ ✚ distances (sparse matrix, 227257x227257)             │
│                                                                    │
╰────────────────────────────────────────────────────────────────────╯
🔍 [2026-03-08 05:27:26] Running UMAP in 'cpu-gpu-mixed' mode...
🚀 Using torch GPU to calculate UMAP...
NVIDIA CUDA GPUs detected:
📊 [CUDA 0] NVIDIA A100-SXM4-40GB
    ------------------------------ 425/40960 MiB (1.0%)

🔍 UMAP Dimensionality Reduction:
   Mode: cpu-gpu-mixed
   Method: pumap
   Components: 2
   Min distance: 0.5
{'n_neighbors': 15, 'method': 'torch', 'random_state': 0, 'metric': 'euclidean', 'use_rep': 'scaled|original|X_pca', 'n_pcs': 50}
   ⚠️ Connectivities matrix was not computed with UMAP method
   🔍 Computing UMAP parameters...
   🔍 Computing UMAP embedding (Parametric PyTorch method)...
   Using device: cuda
   Dataset: 227257 samples × 50 features
   Batch size: 512
   Learning rate: 0.001
   Training parametric UMAP model...
============================================================
🚀 Parametric UMAP Training
============================================================
📊 Device: cuda
📈 Data shape: torch.Size([227257, 50])
🔗 Building UMAP graph...
   🚀 Using PyTorch Geometric KNN (faster)

🏋️  Starting Training...
────────────────────────────────────────────────────────────
✓ New best loss: 0.2845
✓ New best loss: 0.2564
✓ New best loss: 0.2523
✓ New best loss: 0.2500
✓ New best loss: 0.2483
✓ New best loss: 0.2468
✓ New best loss: 0.2455
✓ New best loss: 0.2445
✓ New best loss: 0.2434
✓ New best loss: 0.2425
✓ New best loss: 0.2416
✓ New best loss: 0.2409
✓ New best loss: 0.2402
✓ New best loss: 0.2396
✓ New best loss: 0.2389
✓ New best loss: 0.2386
✓ New best loss: 0.2380
✓ New best loss: 0.2375
✓ New best loss: 0.2369
✓ New best loss: 0.2366

────────────────────────────────────────────────────────────
✅ Training Completed!
📉 Final best loss: 0.2366
============================================================

   💡 Using Parametric UMAP (PyTorch) on cuda

✅ UMAP Dimensionality Reduction Completed Successfully!
   ✓ Embedding shape: 227,257 cells × 2 dimensions
   ✓ Results added to AnnData object:
     • 'X_umap': UMAP coordinates (adata.obsm)
     • 'umap': UMAP parameters (adata.uns)
✅ UMAP completed successfully.

╭─ SUMMARY: umap ────────────────────────────────────────────────────╮
│  Duration: 142.413s                                                │
│  Shape:    227,257 x 3,000 (Unchanged)                             │
│                                                                    │
│  CHANGES DETECTED                                                  │
│  ────────────────                                                  │
│  ● UNS    │ ✚ umap                                                 │
│           │ └─ params: {'a': 0.583030019901822, 'b': 1.3341669931033755}│
│                                                                    │
│  ● OBSM   │ ✚ X_umap (array, 227257x2)                             │
│                                                                    │
╰────────────────────────────────────────────────────────────────────╯
⚙️ Using torch CPU/GPU mixed mode to calculate Leiden...
NVIDIA CUDA GPUs detected:
📊 [CUDA 0] NVIDIA A100-SXM4-40GB
    ------------------------------ 575/40960 MiB (1.4%)
Using batch size `n_batches` calculated from sqrt(n_obs): 476
Running GPU Leiden (batched)
Device: cpu
done: 54 clusters (0:04:36)

╭─ SUMMARY: leiden ──────────────────────────────────────────────────╮
│  Duration: 276.2389s                                               │
│  Shape:    227,257 x 3,000 (Unchanged)                             │
│                                                                    │
│  CHANGES DETECTED                                                  │
│  ────────────────                                                  │
│  ● OBS    │ ✚ leiden (category)                                    │
│                                                                    │
│  ● UNS    │ ✚ leiden                                               │
│           │ └─ params: {'resolution': 1.0, 'random_state': 0, 'local_i...│
│                                                                    │
╰────────────────────────────────────────────────────────────────────╯
CPU times: user 1h 49min 38s, sys: 27min 27s, total: 2h 17min 5s
Wall time: 45min 29s

Leiden 聚类会把图结构划分成多个社区。resolution 参数控制聚类粒度：值越低，簇越少且更粗；值越高，划分越细。

ov.pp.leiden(adata,resolution=0.3)

⚙️ Using torch CPU/GPU mixed mode to calculate Leiden...
NVIDIA CUDA GPUs detected:
📊 [CUDA 0] NVIDIA A100-SXM4-40GB
    ------------------------------ 575/40960 MiB (1.4%)
Using batch size `n_batches` calculated from sqrt(n_obs): 476
Running GPU Leiden (batched)
Device: cpu
done: 13 clusters (0:05:32)

╭─ SUMMARY: leiden ──────────────────────────────────────────────────╮
│  Duration: 332.9071s                                               │
│  Shape:    227,257 x 3,000 (Unchanged)                             │
│                                                                    │
│  CHANGES DETECTED                                                  │
│  ────────────────                                                  │
╰────────────────────────────────────────────────────────────────────╯

直接在组织空间中绘制 leiden，是判断转录组簇是否与连贯的解剖或组织学区域对应的快捷方式。

ov.pl.spatial(
    adata, color=["leiden"],
    size=2, linewidth=0,
    legend_fontsize=13, frameon=None,
    cmap='Reds'
)

../_images/dab581a3b0fd5931451a2d88ee8738f8af72c8d63fe4a4d82e9486e314aebc4b.png

在同一局部区域中放大查看聚类结果#

最后，我们把已经聚类的数据对象裁剪到前面相同的空间窗口中。这样更方便把聚类结果与原始 marker 表达和分割边界做直接比较。

def subset_data(adata, xlim=(24500, 26000), ylim=(5000, 6000)):
    x, y = adata.obsm["spatial"].T
    bdata = adata[(xlim[0] <= x) & (x <= xlim[1]) & (ylim[0] <= y) & (y <= ylim[1])].copy()
    bdata.obs_names_make_unique()
    bdata.var_names_make_unique()
    return bdata
cdata = subset_data(adata)

这个视图把聚类标签以部分透明的方式叠加在 segmentation polygon 上，适合检查簇与簇之间的过渡是否贴合可见的组织结构。

ov.pl.spatialseg(
    cdata,
    color="leiden",
    edges_color='white',
    edges_width=0.1,
    figsize=(6, 4),
    alpha_img=0.8,
    alpha=0.8,
    legend_fontsize=13,
    palette=ov.pl.sc_color
)

../_images/cd39567fad57b3718e50a6da5161289b068985f0df9a14f186c6ffc9c37554c4.png

<Axes: title={'center': 'leiden'}, xlabel='spatial 1', ylabel='spatial 2'>

增大 seg_contourpx 会让 polygon 边界更锐利，这有助于观察相邻簇之间更细的分界。

ov.pl.spatialseg(
    cdata,
    color="leiden",
    edges_color='white',
    edges_width=0.1,
    figsize=(6, 4),
    alpha_img=0.8,
    alpha=1,
    legend_fontsize=13,
    palette=ov.pl.sc_color,
    seg_contourpx=1,
)

../_images/42cf422c627c1a4eaa321484a304d4ed03a9bed515dc93e1d5656dbe9750d211.png

<Axes: title={'center': 'leiden'}, xlabel='spatial 1', ylabel='spatial 2'>

总结#

在这个 notebook 中，我们：

同时导入了 bin-level 和 segmentation-level 的 Visium HD 数据；
比较了不同的空间可视化方式；
在局部和全局两个尺度上识别了空间变异基因；
对数据完成归一化，并把分析限制到空间特征上；
构建了低维表示，并对分割细胞进行聚类。

综合来看，这套流程为 OmicVerse 中的 Visium HD 分析提供了一个很实用的起点：先做图像感知的质量检查，再做空间特征筛选，最后建立细胞层面的低维表示以支撑后续解释。