scPlantLLM — 基础模型教程#
scPlantLLM — 植物专用单细胞模型,支持多倍体和植物基因命名规范
属性 |
值 |
|---|---|
任务 |
embed, integrate |
物种 |
plant |
基因 ID |
symbol |
需要 GPU |
是 |
最低显存 |
16 GB |
嵌入维度 |
512 |
代码仓库 |
重要提示: scPlantLLM 专为植物物种(拟南芥、水稻、玉米等)设计,与人类或小鼠数据不兼容。
本教程演示如何通过统一的 ov.fm API 使用 scPlantLLM。
引用: Zeng, Z. et al. (2024). OmicVerse: a framework for bridging and deepening insights across bulk and single-cell sequencing. Nature Communications, 15(1), 5983.
import omicverse as ov
import scanpy as sc
import os
import warnings
warnings.filterwarnings('ignore')
ov.plot_set()
植物单细胞分析技巧#
使用 scPlantLLM 处理植物数据时:
多倍体 — scPlantLLM 原生支持多倍体基因组(常见于作物)
基因命名 — 使用植物基因命名规范(例如,拟南芥的
AT1G01010)组织类型 — 支持根、叶、花、种子和分生组织
发育阶段 — 捕获植物特异性的发育转变
# 拟南芥根系数据示例
result = ov.fm.run(
task='embed', model_name='scplantllm',
adata_path='arabidopsis_root.h5ad',
output_path='arabidopsis_scplantllm.h5ad',
)
步骤 1:查看模型规格#
使用 ov.fm.describe_model() 获取 scPlantLLM 的完整规格信息。
info = ov.fm.describe_model("scplantllm")
print("=== Model Info ===")
print(f"Name: {info['model']['name']}")
print(f"Version: {info['model']['version']}")
print(f"Tasks: {info['model']['tasks']}")
print(f"Species: {info['model']['species']}")
print(f"Embedding dim: {info['model']['embedding_dim']}")
print(f"Differentiator: {info['model']['differentiator']}")
print("\n=== Input Contract ===")
print(f"Gene ID scheme: {info['input_contract']['gene_id_scheme']}")
print(f"Preprocessing: {info['input_contract']['preprocessing']}")
print("\n=== Output Contract ===")
print(f"Embedding key: {info['output_contract']['embedding_key']}")
print(f"Embedding dim: {info['output_contract']['embedding_dim']}")
步骤 2:准备数据#
加载数据集并将其保存,以供 ov.fm 工作流使用。大多数基础模型需要原始计数(非负值)。
# scPlantLLM requires plant scRNA-seq data.
# Replace with your own plant dataset:
# adata = sc.read_h5ad('arabidopsis_root.h5ad')
#
# Supported species: Arabidopsis thaliana, Oryza sativa (rice),
# Zea mays (maize), and other plant species.
# For demonstration, we show the API pattern with PBMC RNA data.
# The validation step will correctly flag the species mismatch.
adata = sc.datasets.pbmc3k()
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
adata.write_h5ad('pbmc3k_scplantllm.h5ad')
print(f'Dataset: {adata.n_obs} cells x {adata.n_vars} genes')
print('Note: This is human data — scPlantLLM will flag incompatibility.')
步骤 3:分析数据并验证兼容性#
在运行推理之前,检查您的数据是否与 scPlantLLM 兼容。
profile = ov.fm.profile_data("pbmc3k_scplantllm.h5ad")
print("=== Data Profile ===")
print(f"Species: {profile['species']}")
print(f"Gene scheme: {profile['gene_scheme']}")
print(f"Modality: {profile['modality']}")
print(f"Cells: {profile['n_cells']:,}")
print(f"Genes: {profile['n_genes']:,}")
# Validate compatibility
validation = ov.fm.preprocess_validate("pbmc3k_scplantllm.h5ad", "scplantllm", "embed")
print(f"\n=== Validation: {validation['status']} ===")
for d in validation.get("diagnostics", []):
print(f" [{d['severity']}] {d['message']}")
if validation.get("auto_fixes"):
print("\nSuggested fixes:")
for fix in validation["auto_fixes"]:
print(f" - {fix}")
步骤 4:运行 scPlantLLM 推理#
通过 ov.fm.run() 执行 scPlantLLM。该函数负责处理预处理、模型加载、推理和输出写入。
result = ov.fm.run(
task="embed",
model_name="scplantllm",
adata_path="pbmc3k_scplantllm.h5ad",
output_path="pbmc3k_scplantllm_out.h5ad",
device="auto",
)
if "error" in result:
print(f"Error: {result['error']}")
if "suggestion" in result:
print(f"Suggestion: {result['suggestion']}")
else:
print(f"Status: {result['status']}")
print(f"Output keys: {result.get('output_keys', [])}")
print(f"Cells processed: {result.get('n_cells', 0)}")
步骤 5:可视化与结果解读#
加载输出,从 scPlantLLM 嵌入计算 UMAP,并评估质量。
if os.path.exists("pbmc3k_scplantllm_out.h5ad"):
adata_out = sc.read_h5ad("pbmc3k_scplantllm_out.h5ad")
emb_key = "X_scplantllm"
if emb_key in adata_out.obsm:
print(f"Embedding shape: {adata_out.obsm[emb_key].shape}")
# UMAP visualization
sc.pp.neighbors(adata_out, use_rep=emb_key)
sc.tl.umap(adata_out)
sc.tl.leiden(adata_out, resolution=0.5)
sc.pl.umap(adata_out, color=["leiden"],
title="scPlantLLM Embedding (PBMC 3k)")
# QA metrics
interpretation = ov.fm.interpret_results("pbmc3k_scplantllm_out.h5ad", task="embed")
if "embeddings" in interpretation["metrics"]:
for k, v in interpretation["metrics"]["embeddings"].items():
print(f"\n{k}: dim={v['dim']}", end="")
if "silhouette" in v:
print(f", silhouette={v['silhouette']:.4f}", end="")
print()
else:
print(f"Embedding key {emb_key} not found.")
print(f"Available keys: {list(adata_out.obsm.keys())}")
else:
print("Output file not found — check model installation and adapter status.")
print("See the Guide page for installation instructions.")
总结#
步骤 |
函数 |
功能说明 |
|---|---|---|
1 |
|
查看模型规格及输入/输出契约 |
2 |
|
准备输入数据 |
3 |
|
检查兼容性 |
4 |
|
执行 scPlantLLM 推理 |
5 |
|
评估嵌入质量 |
完整的模型目录请参见 ov.fm.list_models() 或 ov.fm API 概览。
scPlantLLM 的详细规格说明,请参见 scPlantLLM 指南。