GenePT — 基础模型教程#

GenePT — 基于 API 的 GPT-3.5 基因嵌入(1536 维),无需本地 GPU,生成基因级别(非细胞级别)嵌入

属性

任务

embed

物种

human

基因 ID

symbol

需要 GPU

否(CPU 即可)

最低显存

0 GB

嵌入维度

1536

代码仓库

yiqunchen/GenePT

注意: GenePT 通过 OpenAI API 生成基因级别(非细胞级别)嵌入。无需本地 GPU,但需要 OpenAI API 密钥。

本教程演示如何通过统一的 ov.fm API 使用 GenePT

引用: Zeng, Z. et al. (2024). OmicVerse: a framework for bridging and deepening insights across bulk and single-cell sequencing. Nature Communications, 15(1), 5983.

import omicverse as ov
import scanpy as sc
import os
import warnings
warnings.filterwarnings('ignore')

ov.plot_set()

基因级别与细胞级别嵌入的比较#

GenePT 与 ov.fm 中其他模型存在根本区别:

方面

细胞级别模型(scGPT 等)

GenePT

单元

每个细胞一个嵌入

每个基因一个嵌入

维度

200-1280

1536

来源

模型推理

OpenAI API(GPT-3.5)

GPU

大多数需要

不需要

成本

计算资源

API 费用

基因嵌入可用于:

  • 基因功能相似性分析

  • 语义匹配基因集富集

  • 通过加权基因聚合生成细胞嵌入

步骤 1:查看模型规格#

使用 ov.fm.describe_model() 获取 GenePT 的完整规格信息。

info = ov.fm.describe_model("genept")

print("=== Model Info ===")
print(f"Name: {info['model']['name']}")
print(f"Version: {info['model']['version']}")
print(f"Tasks: {info['model']['tasks']}")
print(f"Species: {info['model']['species']}")
print(f"Embedding dim: {info['model']['embedding_dim']}")
print(f"Differentiator: {info['model']['differentiator']}")

print("\n=== Input Contract ===")
print(f"Gene ID scheme: {info['input_contract']['gene_id_scheme']}")
print(f"Preprocessing: {info['input_contract']['preprocessing']}")

print("\n=== Output Contract ===")
print(f"Embedding key: {info['output_contract']['embedding_key']}")
print(f"Embedding dim: {info['output_contract']['embedding_dim']}")

步骤 2:准备数据#

加载数据集并将其保存,以供 ov.fm 工作流使用。大多数基础模型需要原始计数(非负值)。

# GenePT uses the OpenAI API to generate gene-level embeddings.
# No local GPU required, but you need an OpenAI API key:
# os.environ['OPENAI_API_KEY'] = 'your-key-here'
#
# Note: GenePT produces GENE embeddings (1536-dim per gene),
# not CELL embeddings. Cell embeddings are derived by aggregating
# gene embeddings weighted by expression.

adata = sc.datasets.pbmc3k()
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
print(f'Dataset: {adata.n_obs} cells x {adata.n_vars} genes')

adata.write_h5ad('pbmc3k_genept.h5ad')

步骤 3:分析数据并验证兼容性#

在运行推理之前,检查您的数据是否与 GenePT 兼容。

profile = ov.fm.profile_data("pbmc3k_genept.h5ad")

print("=== Data Profile ===")
print(f"Species: {profile['species']}")
print(f"Gene scheme: {profile['gene_scheme']}")
print(f"Modality: {profile['modality']}")
print(f"Cells: {profile['n_cells']:,}")
print(f"Genes: {profile['n_genes']:,}")

# Validate compatibility
validation = ov.fm.preprocess_validate("pbmc3k_genept.h5ad", "genept", "embed")
print(f"\n=== Validation: {validation['status']} ===")
for d in validation.get("diagnostics", []):
    print(f"  [{d['severity']}] {d['message']}")
if validation.get("auto_fixes"):
    print("\nSuggested fixes:")
    for fix in validation["auto_fixes"]:
        print(f"  - {fix}")

步骤 4:运行 GenePT 推理#

通过 ov.fm.run() 执行 GenePT。该函数负责处理预处理、模型加载、推理和输出写入。

result = ov.fm.run(
    task="embed",
    model_name="genept",
    adata_path="pbmc3k_genept.h5ad",
    output_path="pbmc3k_genept_out.h5ad",
    device="auto",
)

if "error" in result:
    print(f"Error: {result['error']}")
    if "suggestion" in result:
        print(f"Suggestion: {result['suggestion']}")
else:
    print(f"Status: {result['status']}")
    print(f"Output keys: {result.get('output_keys', [])}")
    print(f"Cells processed: {result.get('n_cells', 0)}")

步骤 5:可视化与结果解读#

加载输出,从 GenePT 嵌入计算 UMAP,并评估质量。

if os.path.exists("pbmc3k_genept_out.h5ad"):
    adata_out = sc.read_h5ad("pbmc3k_genept_out.h5ad")
    emb_key = "X_genept"
    
    if emb_key in adata_out.obsm:
        print(f"Embedding shape: {adata_out.obsm[emb_key].shape}")
        
        # UMAP visualization
        sc.pp.neighbors(adata_out, use_rep=emb_key)
        sc.tl.umap(adata_out)
        sc.tl.leiden(adata_out, resolution=0.5)
        sc.pl.umap(adata_out, color=["leiden"],
                   title="GenePT Embedding (PBMC 3k)")
        
        # QA metrics
        interpretation = ov.fm.interpret_results("pbmc3k_genept_out.h5ad", task="embed")
        if "embeddings" in interpretation["metrics"]:
            for k, v in interpretation["metrics"]["embeddings"].items():
                print(f"\n{k}: dim={v['dim']}", end="")
                if "silhouette" in v:
                    print(f", silhouette={v['silhouette']:.4f}", end="")
                print()
    else:
        print(f"Embedding key {emb_key} not found.")
        print(f"Available keys: {list(adata_out.obsm.keys())}")
else:
    print("Output file not found — check model installation and adapter status.")
    print("See the Guide page for installation instructions.")

总结#

步骤

函数

功能说明

1

ov.fm.describe_model("genept")

查看模型规格及输入/输出契约

2

sc.datasets.pbmc3k()

准备输入数据

3

ov.fm.profile_data() + preprocess_validate()

检查兼容性

4

ov.fm.run()

执行 GenePT 推理

5

ov.fm.interpret_results()

评估嵌入质量

完整的模型目录请参见 ov.fm.list_models()ov.fm API 概览。 GenePT 的详细规格说明,请参见 GenePT 指南