MarkTechPost • 81일 전

Scanpy를 활용한 단일세포 RNA-seq 분석 파이프라인 구축

IMP

7/10

핵심 요약

본 튜토리얼은 생물정보학의 핵심 기술인 단일세포 RNA 시퀀싱(scRNA-seq) 데이터를 분석하는 전체 파이프라인을 다룹니다. Python의 대표적인 라이브러리인 Scanpy를 활용하여 PBMC-3k 벤치마크 데이터셋을 전처리하고, Leiden 알고리즘으로 세포를 클러스터링하며, 세포 주기를 교정한 뒤 주요 마커를 통해 세포 유형을 주석(annotation) 달아냅니다. 나아가 PAGA와 확산 의사시간(diffusion pseudotime)을 통해 세포의 발달 궤적을 탐색하는 고급 실무 기법까지 제공하므로 데이터 사이언티스트와 생물학 연구자에게 매우 유용합니다.

번역된 본문

인공지능 애플리케이션 기술, 빅데이터, 데이터 사이언스, 에디터 추천, 소프트웨어 엔지니어링, 스태프 튜토리얼.

이 튜토리얼에서는 PBMC-3k 벤치마크 데이터셋을 사용하여 Scanpy를 통한 고급 단일세포 RNA-seq(scRNA-seq) 분석 워크플로우를 수행합니다. 먼저 데이터셋을 불러오고 그 구조를 살펴본 뒤, 유전자 수, 총 카운트, 미토콘드리아 함량, 리보솜 유전자 신호를 평가하기 위한 품질 관리(QC) 검사를 적용합니다. 그런 다음 저품질의 세포와 유전자를 필터링하고, Scrublet을 사용해 잠재적인 더블릿(doublet, 두 개 이상의 세포가 하나로 측정된 현상)을 탐지하며, 데이터를 정규화(normalize)하고 로그 변환을 적용한 뒤, 다운스트림(후속) 분석을 위해 고변동 유전자(highly variable genes)를 식별합니다. 또한 세포 주기 단계를 평가하고, 원치 않는 기술적 편차를 회귀 분석으로 제거(regress out)하며, 데이터를 스케일링(scaling)하고 PCA, UMAP, t-SNE를 사용하여 차원을 축소합니다. Leiden 알고리즘으로 세포를 클러스터링하고 마커 유전자를 식별한 후, PBMC의 표준 마커를 사용하여 세포 집단에 주석(annotation)을 달고, PAGA 및 확산 의사시간(diffusion pseudotime)으로 궤적 구조를 탐색하며, 사용자 정의 인터페론 반응 점수(interferon-response score)를 계산한 후 마지막으로 완전히 분석된 AnnData 객체를 향후 사용을 위해 저장합니다.

[코드 복사 및 실행 예시] !pip install -q scanpy leidenalg python-igraph scrublet import scanpy as sc import numpy as np import pandas as pd import matplotlib.pyplot as plt import warnings

warnings.filterwarnings("ignore") sc.settings.verbosity = 3 sc.settings.set_figure_params(dpi=80, facecolor="white", figsize=(5, 5)) sc.logging.print_header()

adata = sc.datasets.pbmc3k() adata.var_names_make_unique() print(adata)

adata.var["mt"] = adata.var_names.str.startswith("MT-") adata.var["ribo"] = adata.var_names.str.startswith(("RPS", "RPL")) sc.pp.calculate_qc_metrics( adata, qc_vars=["mt", "ribo"], percent_top=None, log1p=False, inplace=True )

sc.pl.violin( adata, ["n_genes_by_counts", "total_counts", "pct_counts_mt"], jitter=0.4, multi_panel=True, ) sc.pl.scatter(adata, x="total_counts", y="pct_counts_mt") sc.pl.scatter(adata, x="total_counts", y="n_genes_by_counts")

필요한 단일세포 분석 라이브러리를 설치하고 Scanpy, NumPy, Pandas, Matplotlib 및 경고 제어 모듈을 가져옵니다(import). PBMC-3k 벤치마크 데이터셋을 로드하고, 유전자 이름을 고유하게 만들고, AnnData 객체 구조를 확인합니다. 그런 다음 미토콘드리아 및 리보솜 유전자에 대한 품질 관리 지표를 계산하고 바이올린 플롯(violin plot)과 산점도(scatter plot)를 사용하여 카운트 수준의 품질 패턴을 시각화합니다.

[코드 복사 및 실행 예시] sc.pp.filter_cells(adata, min_genes=200) sc.pp.filter_genes(adata, min_cells=3) adata = adata[adata.obs.n_genes_by_counts < 2500, :].copy() adata = adata[adata.obs.pct_counts_mt < 5, :].copy()

sc.pp.scrublet(adata) print("Predicted doublets:", int(adata.obs["predicted_doublet"].sum())) adata = adata[~adata.obs["predicted_doublet"], :].copy()

adata.layers["counts"] = adata.X.copy() sc.pp.normalize_total(adata, target_sum=1e4) sc.pp.log1p(adata)

sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5) sc.pl.highly_variable_genes(adata)

adata.raw = adata adata = adata[:, adata.var.highly_variable].copy()

데이터셋의 신뢰도를 높이기 위해 저품질 세포와 거의 감지되지 않는 희소 유전자를 필터링합니다. Scanpy에 내장된 Scrublet을 사용하여 예측되는 더블릿을 식별하고 본격적인 분석 전에 이를 제거합니다. 그런 다음 원시 카운트 값을 보존하고, 발현 값을 정규화하며, 로그 변환을 적용하고, 고변동 유전자를 선택하여 가장 유용한 특징(feature)들만 남깁니다.

[코드 복사 및 실행 예시] s_genes = ["MCM5","PCNA","TYMS","FEN1","MCM2","MCM4","RRM1","UNG","GINS2", "MCM6","CDCA7","DTL","PRIM1","UHRF1","HELLS","RFC2","NASP", "RAD51AP1","GMNN","WDR76","SLBP","CCNE2","UBR7","POLD3","MSH2", "ATAD2","RAD51","RRM2","CDC45","CDC6","EXO1","TIPIN","DSCC1", "BLM","CASP8AP2","USP1","CLSPN","POLA1","CHAF1B","E2F8"] g2m_genes = ["HMGB2","CDK1","NUSAP1","UBE2C","BIRC5","TPX2","TOP2A","NDC80", "CKS2","NUF2","CKS1B","MKI67","TMPO","CENPF","TACC3","SMC4", "CCNB2","CKAP2L","CKAP2","AURKB","BUB1","KIF11","ANP32E", "TUBB4B","GTSE1","KIF20B","HJURP","CDCA3","CDC20","TTK", "CDC25...

원문 보기

원문 보기 (영어)

Artificial Intelligence Applications Technology Big Data Data Science Editors Pick Software Engineering Staff Tutorials In this tutorial, we perform an advanced single-cell RNA-seq analysis workflow using Scanpy on the PBMC-3k benchmark dataset. We start by loading the dataset, inspecting its structure, and applying quality control checks to evaluate gene counts, total counts, mitochondrial content, and ribosomal gene signals. We then filter low-quality cells and genes, detect potential doublets with Scrublet, normalize the data, apply log transformation, and identify highly variable genes for downstream analysis. Also, we score cell-cycle phases, regress out unwanted technical variation, scale the data, and reduce dimensionality using PCA, UMAP, and t-SNE. We also cluster cells with the Leiden algorithm, identify marker genes, annotate cell populations using canonical PBMC markers, explore trajectory structure with PAGA and diffusion pseudotime, calculate a custom interferon-response score, and finally save the fully analyzed AnnData object for future use. Copy Code Copied Use a different Browser !pip install -q scanpy leidenalg python-igraph scrublet import scanpy as sc import numpy as np import pandas as pd import matplotlib.pyplot as plt import warnings warnings.filterwarnings("ignore") sc.settings.verbosity = 3 sc.settings.set_figure_params(dpi=80, facecolor="white", figsize=(5, 5)) sc.logging.print_header() adata = sc.datasets.pbmc3k() adata.var_names_make_unique() print(adata) adata.var["mt"] = adata.var_names.str.startswith("MT-") adata.var["ribo"] = adata.var_names.str.startswith(("RPS", "RPL")) sc.pp.calculate_qc_metrics( adata, qc_vars=["mt", "ribo"], percent_top=None, log1p=False, inplace=True ) sc.pl.violin( adata, ["n_genes_by_counts", "total_counts", "pct_counts_mt"], jitter=0.4, multi_panel=True, ) sc.pl.scatter(adata, x="total_counts", y="pct_counts_mt") sc.pl.scatter(adata, x="total_counts", y="n_genes_by_counts") We install the required single-cell analysis libraries and import Scanpy, NumPy, Pandas, Matplotlib, and warning controls. We load the PBMC-3k benchmark dataset, make gene names unique, and inspect the AnnData object structure. We then calculate quality control metrics for mitochondrial and ribosomal genes and visualize count-level quality patterns using violin and scatter plots. Copy Code Copied Use a different Browser sc.pp.filter_cells(adata, min_genes=200) sc.pp.filter_genes(adata, min_cells=3) adata = adata[adata.obs.n_genes_by_counts < 2500, :].copy() adata = adata[adata.obs.pct_counts_mt < 5, :].copy() sc.pp.scrublet(adata) print("Predicted doublets:", int(adata.obs["predicted_doublet"].sum())) adata = adata[~adata.obs["predicted_doublet"], :].copy() adata.layers["counts"] = adata.X.copy() sc.pp.normalize_total(adata, target_sum=1e4) sc.pp.log1p(adata) sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5) sc.pl.highly_variable_genes(adata) adata.raw = adata adata = adata[:, adata.var.highly_variable].copy() We filter out low-quality cells and rarely detected genes to improve the reliability of the dataset. We use Scrublet through Scanpy to identify predicted doublets and remove them before deeper analysis. We then preserve raw counts, normalize expression values, apply log transformation, select highly variable genes, and keep only the most informative features. Copy Code Copied Use a different Browser s_genes = ["MCM5","PCNA","TYMS","FEN1","MCM2","MCM4","RRM1","UNG","GINS2", "MCM6","CDCA7","DTL","PRIM1","UHRF1","HELLS","RFC2","NASP", "RAD51AP1","GMNN","WDR76","SLBP","CCNE2","UBR7","POLD3","MSH2", "ATAD2","RAD51","RRM2","CDC45","CDC6","EXO1","TIPIN","DSCC1", "BLM","CASP8AP2","USP1","CLSPN","POLA1","CHAF1B","E2F8"] g2m_genes = ["HMGB2","CDK1","NUSAP1","UBE2C","BIRC5","TPX2","TOP2A","NDC80", "CKS2","NUF2","CKS1B","MKI67","TMPO","CENPF","TACC3","SMC4", "CCNB2","CKAP2L","CKAP2","AURKB","BUB1","KIF11","ANP32E", "TUBB4B","GTSE1","KIF20B","HJURP","CDCA3","CDC20","TTK", "CDC25C","KIF2C","RANGAP1","NCAPD2","DLGAP5","CDCA2","CDCA8", "ECT2","KIF23","HMMR","AURKA","PSRC1","ANLN","LBR","CKAP5", "CENPE","NEK2","G2E3","CBX5","CENPA"] s_genes = [g for g in s_genes if g in adata.var_names] g2m_genes = [g for g in g2m_genes if g in adata.var_names] sc.tl.score_genes_cell_cycle(adata, s_genes=s_genes, g2m_genes=g2m_genes) sc.pp.regress_out(adata, ["total_counts", "pct_counts_mt"]) sc.pp.scale(adata, max_value=10) sc.tl.pca(adata, svd_solver="arpack") sc.pl.pca_variance_ratio(adata, log=True, n_pcs=50) sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40) sc.tl.umap(adata) sc.tl.tsne(adata, n_pcs=40) We define S-phase and G2/M-phase marker genes and retain only those present in the dataset. We score each cell for cell-cycle phase, regress out unwanted variation from total counts and mitochondrial percentage, and scale the data for downstream modeling. We then run PCA, inspect explained variance, construct the neighborhood graph, and generate UMAP and t-SNE embeddings. Copy Code Copied Use a different Browser sc.tl.leiden(adata, resolution=0.5, flavor="igraph", n_iterations=2, directed=False) sc.pl.umap(adata, color="leiden", legend_loc="on data", title="Leiden clusters") sc.pl.tsne(adata, color="leiden", legend_loc="on data", title="t-SNE clusters") sc.tl.rank_genes_groups(adata, "leiden", method="wilcoxon") sc.pl.rank_genes_groups(adata, n_genes=20, sharey=False) result = adata.uns["rank_genes_groups"] groups = result["names"].dtype.names top_df = pd.DataFrame({g: result["names"][g][:10] for g in groups}) print("\nTop 10 markers per cluster:\n", top_df) marker_genes = { "B-cell": ["CD79A", "MS4A1"], "CD8 T-cell": ["CD8A", "CD8B"], "CD4 T-cell": ["IL7R", "CD4"], "NK": ["GNLY", "NKG7"], "CD14 Monocyte": ["CD14", "LYZ"], "FCGR3A Monocyte": ["FCGR3A", "MS4A7"], "Dendritic": ["FCER1A", "CST3"], "Megakaryocyte": ["PPBP"], } sc.pl.dotplot(adata, marker_genes, groupby="leiden", standard_scale="var") sc.pl.stacked_violin(adata, marker_genes, groupby="leiden", swap_axes=True) We apply Leiden clustering to group cells based on the neighborhood graph and visualize the clusters on UMAP and t-SNE plots. We perform differential expression analysis using the Wilcoxon test to identify the top marker genes for each cluster. We then use canonical PBMC marker genes to support cell-type annotation through dot plots and stacked violin plots. Copy Code Copied Use a different Browser sc.tl.paga(adata, groups="leiden") sc.pl.paga(adata, color="leiden", threshold=0.1) sc.tl.umap(adata, init_pos="paga") sc.pl.umap(adata, color="leiden", legend_loc="on data") sc.tl.diffmap(adata) sc.pp.neighbors(adata, n_neighbors=10, use_rep="X_diffmap") adata.uns["iroot"] = np.flatnonzero(adata.obs["leiden"] == adata.obs["leiden"].cat.categories[0])[0] sc.tl.dpt(adata) sc.pl.umap(adata, color=["leiden", "dpt_pseudotime"], legend_loc="on data") ifn_genes = ["ISG15", "IFI6", "IFIT1", "IFIT3", "MX1", "OAS1", "STAT1", "IRF7"] ifn_genes = [g for g in ifn_genes if g in adata.raw.var_names] sc.tl.score_genes(adata, gene_list=ifn_genes, score_name="IFN_score") sc.pl.umap(adata, color="IFN_score", cmap="viridis") adata.write("pbmc3k_analyzed.h5ad") print("\n✅ Analysis complete — saved to pbmc3k_analyzed.h5ad") print(adata) We run PAGA to model connectivity between Leiden clusters and reinitialize UMAP using the PAGA graph to obtain a clearer trajectory structure. We compute diffusion maps and diffusion pseudotime to explore possible progression patterns across cell states. We also calculate an interferon-response gene-set score, visualize it on UMAP, and save the final analyzed object as an .h5ad file. In conclusion, we built an end-to-end Scanpy pipeline for single-cell RNA-seq analysis, transforming raw PBMC data into interpretable biological insights. We cleaned and preprocessed the dataset, removed noisy cells and doublets, selected informative genes, and generated meaningful embeddings to visualize cellular structure. We then used Leiden clustering and differe

데이터-사이언스 생물정보학 파이썬 유전체-분석 빅데이터