文章复现子宫腺肌病在位内膜和异位病灶的单细胞转录组分析

Single‑cell transcriptomic analysis of eutopic endometrium and ectopic lesions of adenomyosis

发表杂志：Cell & Bioscience（IF=5.026）

目标：利用scRNA-seq鉴定了AM中在位内膜（Eutopic endometrium，EC）和异位病灶（Ectopic lesions，EM）的基因表达模式，并探索了AM的潜在发病机制

Single‑cell transcriptomic analysis of eutopic endometrium and ectopic lesions of adenomyosis1、上游处理1.1 下载病例fastq数据1.2 定量1.2 cellranger count1.3 cellranger aggr2、Seurat整合分析2.1 创建Seurat对象2.2 质控和过滤2.3 归一化：LogNormalize2.4 寻找高可变基因（HVGs）2.5 标准化并删除不必要的变异源2.6 细胞分群-PCA聚类分析、t-SNE二维可视化分群结果2.7 差异表达分析FindMarkers2.8 细胞群注释3、利用DEGs进行EM和CTRL的GO和KEGG分析4、inferCNV5、monocle35.1 从Seurat结果中导入monocle3数据5.2、数据预处理5.3、降维及可视化5.4、聚类分析5.5、拟时分析6、细胞通讯分析7、细胞速率分析（未完）1、从cellranger得到loom文件2、运行velocyto.R改进

分析流程图(绘图工具-AI)

1、上游处理

1.1 下载病例fastq数据

存放地址：~/AM/AM_EM

~/AM/AM_EC

~/AM/AM_CTRL

共3例样本：

病例组：46岁，女性，AM患者，分别取其在位子宫内膜（AM_EM）-SRR12791872和异位病灶（AM_EC）SRR12791871样本各1例

对照组：50岁，女性，子宫肌瘤患者，排除子宫内膜异位症和AM，取在位内膜1例（AM_CTRL）SRR12791873

1.2 定量

所用软件 cellranger v3.0.0

cellranger原理：

cellranger是10X genomics公司为单细胞RNA测序分析量身打造的数据分析软件，可以直接输入Illumina 原始数据（raw base call ，BCL)输出表达定量矩阵、降维（pca），聚类（Graph-based& K-Means）以及可视化（t-SNE）结果，结合配套的Loupe Cell Browser给予研究者更多探索单细胞数据的机会。cellranger的高度集成化，使得单细胞测序数据探索变得更加简单，研究者有更多的时间来做生物学意义的挖掘

1、去官网注册后，点右下角选择以前的版本，得到v3.0.0的下载代码

存放地址 ~/AM

curl -o cellranger-3.0.2.tar.gz "/releases/cell-exp/cellranger-3.0.2.tar.gz?Expires=1620759081&Policy=eyJTdGF0ZW1lbnQiOlt7IlJlc291cmNlIjoiaHR0cHM6Ly9jZi4xMHhnZW5vbWljcy5jb20vcmVsZWFzZXMvY2VsbC1leHAvY2VsbHJhbmdlci0zLjAuMi50YXIuZ3oiLCJDb25kaXRpb24iOnsiRGF0ZUxlc3NUaGFuIjp7IkFXUzpFcG9jaFRpbWUiOjE2MjA3NTkwODF9fX1dfQ__&Signature=fLmJsmKcHxmq2mIz-MgoVoJojnFe1LmZDYEqH9h7H2L2d0~IJUiVhntEIJDsJKv~EfeG14vFcUhVkJcF~UTDmfAMoPlbmvHWPsPPCZledVr0nlum49GoBJ64LGttViW0aFqS7w4Xw0vMMh-jT6HYDCQ4cYxV9UB3Hrs0mV-HXpX7PdkWPiGzzZ4ea9ntXzzdGwnGXHZXWtAEwSUCOaVnfIwBi03vdUv4SNNt7QcgBHK~OCPfKh73cPoc9PJvB9KoOND9kDHb3Ef91VaB4SAGBp-DIC2AuIW4skuv3MV9sIq8vG67ZPNJidwYu61FNe3~-L6TP9FfUqRnNVLlGgbUJA__&Key-Pair-Id=APKAI7S6A5RYOXBWRPDA"

2、解压

tar -xzvf cellranger-3.0.0.tar.gz

3、添加至环境变量

export PATH="/public/workspace/stu18230129/anaconda3/bin:$PATH"source ~/.bashrc

4、测试能否使用

1.2 cellranger count

1、按照cellranger count要求的名称给样本改名，S1前是可识别的样本名

#命名很重要！！！！！！这里的mv是重命名cd ~/AM/datamv SRR12791871_1.fastq.gz AM_EC_S1_L0001_R1_001.fastq.gzmv SRR12791871_2.fastq.gz AM_EC_S1_L0001_R2_001.fastq.gzmv SRR12791872_1.fastq.gz AM_EM_S1_L0001_R1_001.fastq.gzmv SRR12791872_2.fastq.gz AM_EM_S1_L0001_R2_001.fastq.gzmv SRR12791873_1.fastq.gz AM_CRTL_S1_L0001_R1_001.fastq.gzmv SRR12791873_2.fastq.gz AM_CRTL_S1_L0001_R2_001.fastq.gz##放入不同的文件夹，这里的mv是移动文件,就写一个样本，其余两个相同操作cd ~/AMmkdir AM_ECmv AM_EC_S1_L0001_R1_001.fastq.gz AM_ECmv AM_EC_S1_L0001_R2_001.fastq.gz AM_EC

2、查询cellranger count --help

id:对你运行的项目起个名字，可以任意取名（输出结果在建文件夹时以这个名字命名）

fastqs:包含fastq文件的路径

sample：如果上述路径中包含的文件不只一个样本的，则需要指定该参数，该参数是根据fastq文件名的前缀对文件进行识别的，可以用来区分不同的样本

transcriptome：用来保存参考基因组的路径

cd ~/AMcellranger count --id=AM_EC_count --fastqs=./AM_EC --sample=AM_EC --transcriptome=/public/workspace/lincs/cellranger/ref/refdata-gex-GRCh38--A cellranger count --id=AM_EM_count --fastqs=./AM_EM --sample=AM_EM --transcriptome=/public/workspace/lincs/cellranger/ref/refdata-gex-GRCh38--Acellranger count --id=run_count_CRTL --fastqs=../../data/ --sample=AM_CRTL --transcriptome=/public/workspace/lincs/cellranger/ref/refdata-gex-GRCh38--A

1.3 cellranger aggr

用cellranger aggr合并三个样本，但由于占用学校服务器内存太多，所以**只使用一个病例组（AM_EM）和一个对照组（AM_CTRL）**进行下面的分析。

cellranger aggr --id=Tumor --csv=Tumor_libraries.csv --normalize=mappedcat Tumor_libraries.csv

2、Seurat整合分析

scRNA-seq揭示AM在位内膜和异位内膜的细胞异质性

主要是展现细胞分群和marker基因可视化

2.1 创建Seurat对象

library(Seurat)library(tidyverse)library(Matrix)##Read10xseurat_data <- Read10X("/public/workspace/stu18230129/AM/Tumor/outs/count/filtered_feature_bc_matrix")sa

2.2 质控和过滤

# The [[ operator can add columns to object metadata. This is a great place to stash QC statsseurat_obj[["percent.mt"]] <- PercentageFeatureSet(seurat_obj, pattern = "^MT-")##1、根据nUMI和nGene的分布特征，保留mean ± 2*SD范围内的细胞；##2、去除线粒体基因比例>25%的细胞；seurat_obj <- subset(seurat_obj, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 25)##nFeature_RNA代表每个细胞测到的基因数目，nCount代表每个细胞测到所有基因的表达量之和，percent.mt代表测到的线粒体基因的比例。

过滤前后对比，细胞数从27593过滤到了21615

3、使用Scrublet鉴定潜在的doublets，设定估计的doublet rate为10%

2.3 归一化：LogNormalize

seurat_obj <- NormalizeData(seurat_obj, normalization.method = "LogNormalize", scale.factor = 10000)seurat_obj <- NormalizeData(seurat_obj)

2.4 寻找高可变基因（HVGs）

文章中提到，分别计算每个基因的平均表达量和离散程度，得到HVGs，估计应该是用selection.method=“mvp”

#这一步的目的是鉴定出细胞与细胞之间表达量相差很大的基因，用于后续鉴定细胞类型seurat_obj <- FindVariableFeatures(seurat_obj, selection.method = "vst", nfeatures = 2000)# Identify the 10 most highly variable genes,找出前10个高变基因top10 <- head(VariableFeatures(seurat_obj), 10)# plot variable features with and without labels，展示高变基因plot1 <- VariableFeaturePlot(seurat_obj)plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)CombinePlots(plots = list(plot1, plot2))

2.5 标准化并删除不必要的变异源

all.genes <- rownames(seurat_obj)seurat_obj <- ScaleData(seurat_obj, features = all.genes)

2.6 细胞分群-PCA聚类分析、t-SNE二维可视化分群结果

seurat_obj <- RunPCA(seurat_obj, features = VariableFeatures(object =seurat_obj))##查看PCA结果DimHeatmap(seurat_obj, dims = 1, cells = 500, balanced = TRUE)##图聚类seurat_obj <- FindNeighbors(seurat_obj,dims=1:10)seurat_obj <- FindClusters(seurat_obj,resolution=0.5)#tSNE可视化#install.packages("ggsci")library(ggsci)seurat_obj<-RunTSNE(seurat_obj,dims=1:10)DimPlot(seurat_obj,reduction="tsne",label=TRUE)+scale_color_npg()

2.7 差异表达分析FindMarkers

markers_df <- FindMarkers(object=seurat_obj,ident.1=0,logfc.threshold=0.58,test.use = "negbinom")print(x = head(markers_df))markers_genes = rownames(head(x = markers_df, n = 5))VlnPlot(object = seurat_obj, features =markers_genes,log =T )FeaturePlot(object =seurat_obj, features=markers_genes )

2.8 细胞群注释

##使用SingleRlibrary(SingleR)hpca.se <- readRDS("/public/workspace/lincs/lab7/singlecell/SingleR/hpca.se.rds")library(scRNAseq)library(Seurat)hESCs <- seurat_obj#SingleR运行要求输入test是SingleCellExperiment，所以转成hESCs <- as.SingleCellExperiment(hESCs)#BiocManager::install(scater)library(scater)hESCs <-logNormCounts(hESCs)pred.hesc<-SingleR(test=hESCs,ref=hpca.se,labels=hpca.se$label.main)plotScoreHeatmap(pred.hesc)查看新注释的标签与seurat分类的结果的交叠情况

#绘制带cell label的tsnenew.cluster.ids <-pred.hesc$pruned.labelsnames(new.cluster.ids) <- levels(seurat_obj)levels(seurat_obj)# 0 1 2 3 4 5 6 7 8 9 10 11 12 13seurat_obj<- RenameIdents(seurat_obj, new.cluster.ids)levels(seurat_obj)# "Endothelial_cells" "Fibroblasts" "Smooth_muscle_cells" "Tissue_stem_cells" "Epithelial_cells" "Macrophage"pdf(file="TSNE_lable.pdf",width=6.5,height=6)TSNEPlot(object = seurat_obj, pt.size = 0.5, label = TRUE) dev.off()pdf(file="10.uMAP_label.pdf",width=6.5,height=6)DimPlot(scRNA1, reduction = "umap",pt.size=0.5,label = TRUE)dev.off()

3、利用DEGs进行EM和CTRL的GO和KEGG分析

GO和KEGG原理：

GO数据库，全称是Gene Ontology(基因本体)，他们把基因的功能分成了三个部分分别是：细胞组分（cellular component, CC）、分子功能（molecular function, MF）、生物过程（biological process, BP）。

KEGG数据库：除了对基因本身功能的注释，我们也知道基因会参与人体的各个通路，基于人体通路而形成的数据库就是通路相关的数据库。而KEGG就是通路相关的数据库的一种。其实通路数据库有很多，类似于wikipathway，reactome都是相关的通路数据库。只是因为KEGG比较被人熟知，所以基本上都做这个分析的。

AM_EM/AM_CTRL

gene_name <- rownames(markers_df)#将基因名转换为entrez_idid = bitr(gene_name, fromType="SYMBOL", toType="ENTREZID", OrgDb="org.Hs.eg.db")ego <- enrichGO(OrgDb="org.Hs.eg.db", gene = id$ENTREZID, ont = "MF", pvalueCutoff = 0.01, readable= TRUE) #GO富集分析dotplot(ego,showCategory=10,title="Enrichment GO Top10") #泡泡图barplot(ego, showCategory=20,title="EnrichmentGO") #柱状图library(clusterProfile)plotGOgraph(ego) #GO图，看不清楚可以尝试左上角另存为pdfekk <- enrichKEGG(gene= id$ENTREZID,organism = 'hsa', qvalueCutoff = 0.05) #KEGG富集分析dotplot(ekk,font.size=8)# 画气泡图browseKEGG(ekk,'mmu01100')

左边是EM，右边是CTRL。

GO:EMvsCTRL

KEGG:EMvsCTRL

与CTRL组相比，EM组上调表达包括MMP1、ESM1、ANGPT2和CYP1B1在内的基因。在单独对内皮细胞的比较分析中，EM组上调的基因与多种血管生成、癌症、细胞迁移、NF-κB和PI3K-Akt信号通路等有关。作者由此推断异位病灶中内皮细胞的血管生成功能增强，并可能表现出某些恶性肿瘤类似的特征。

4、inferCNV

inferCNV原理：用与探索肿瘤单细胞RNA-seq数据，分析其中的体细胞大规模染色体拷贝数变化(copy number alterations, CNA), 例如整条染色体或大片段染色体的增加或丢失(gain or deletions)。工作原理是，以一组"正常"细胞作为参考，分析肿瘤基因组上各个位置的基因表达量强度变化. 通过热图的形式展示每条染色体上的基因相对表达量，相对于正常细胞，肿瘤基因组总会过表达或者低表达。

1、单细胞RNA-seq表达量的原始矩阵

dfcount = as.data.frame(seurat_obj@assays$RNA@counts)

2、注释文件，记录肿瘤和正常细胞

“/public/workspace/lincs/lab7/singlecell/inferCNV/extdata/oligodendroglioma_annotations_downsampled.txt”

3、基因或染色体位置文件

library(AnnoProbe)

geneInfor=annoGene(rownames(dfcount),“SYMBOL”,‘human’)

geneInfor=geneInfor[with(geneInfor, order(chr, start)),c(1,4:6)]

geneInfor=geneInfor[!duplicated(geneInfor[,1]),]

## 这里可以去除性染色体# 也可以把染色体排序方式改变lsdfcount =dfcount [match( geneInfor[,1], rownames(dfcount) ),] # 输出expFile='expFile.txt'setwd("/public/workspace/stu18230129/AM")write.table(dfcount ,file = expFile,sep = '\t',quote = F)anno=data.frame(cellId=rownames(pred.hesc))anno$cellType=pred.hesc$pruned.labels#anno$cellType="wt"groupFiles="annotations.txt"write.table(anno,file=groupFiles,sep="\t",quote=F,col.names = F,row.names = F)head(geneInfor)geneFile='geneFile.txt'write.table(geneInfor,file = geneFile,sep = '\t',quote = F,col.names = F,row.names = F)rm(list=ls())options(stringsAsFactors = F)library(Seurat)library(ggplot2)library(infercnv)expFile='expFile.txt' groupFiles='anno.txt' geneFile='geneFile.txt'infercnv_obj = CreateInfercnvObject(raw_counts_matrix=expFile,annotations_file=groupFiles,gene_order_file= geneFile,ref_group_names=NULL) # 如果有正常细胞的话，把正常细胞的分组填进去future::plan("multiprocess",workers=12)# 多核并行处理infercnv_obj = infercnv::run(infercnv_obj,cutoff=0.1, # cutoff=1 works well for Smart-seq2, and cutoff=0.1 works well for 10x Genomicsout_dir='infercnv_out/', cluster_by_groups=TRUE, denoise=TRUE,HMM=TRUE)# HMM添加后需要跑很长的时间，这里要注意一一下### 可以添加 denoise=TRUE以及 HMM=TRUE

使用inferCNV包，将基因根据染色体位置排序。选取T/NK细胞、巨噬细胞和肥大细胞作为正常细胞，剩余细胞作为恶性细胞计算CNV。

5、monocle3

monocle3原理：

Monocle介绍了利用RNA-Seq进行单细胞轨迹分析的策略。Monocle不是通过实验将细胞纯化成离散状态，而是使用一种算法来学习每个细胞必须经历的基因表达变化序列，作为动态生物学过程的一部分。一旦它了解了基因表达变化的整体“轨迹”，Monocle就可以将每个细胞置于轨迹中的适当位置。

5.1 从Seurat结果中导入monocle3数据

##1、重新创建seurat对象rm(list=ls())library(Seurat)library(tidyverse)library(Matrix)##Read10xseurat_data <- Read10X("/public/workspace/stu18230129/AM/Tumor/outs/count/filtered_feature_bc_matrix")seurat_obj<-CreateSeuratObject(counts=seurat_data)pbmc <-seurat_obj##2.拆分成3个文件expression_matrix<-as(as.matrix(pbmc@assays$RNA@counts),"sparseMatrix")cell_metadata<-pbmc@meta.datagene_annotation<-data.frame(gene_short_name=row.names(expression_matrix), row.names=row.names(expression_matrix))

5.2、数据预处理

library(monocle3)##本步为数据的标准化，以及预降维，为之后的降维做准备cds <- new_cell_data_set(expression_matrix,cell_metadata = cell_metadata,gene_metadata = gene_annotation)cds = preprocess_cds(cds,norm_method="log",method="PCA",num_dim=100)

5.3、降维及可视化

#通过降维，可以再二维图中观察细胞之间的关系cds <- reduce_dimension(cds,preprocess_method="UMAP")plot_celss(cds,label_groups_by_cluster=FALSE)#查看特定基因的表达情况ciliated_genes <- c("EPCAM","PECAM1","F2R","CDH1","VWF","FLT1","KRT7","CDH5","KDR")

5.4、聚类分析

#为了找出哪些细胞在一个发育轨迹上cds <- cluster_cells(cds,preprocess_method = "UMAP")plot_cells(cds, reduction_method="UMAP")

5.5、拟时分析

#在上一步分群的基础上，找出每个群内的细胞发育轨迹（无监督，没有基于生物学背景知识）cds <- learn_graph(cds)plot_cells(cds,label_groups_by_cluster=FALSE,label_leaves=FALSE,label_branch_points=FALSE)

附作者作图

作者将Cluster1细胞、上皮细胞和内皮细胞提取出来做拟时间分析，发现Cluster1细胞（亚群1）呈现由上皮细胞（亚群7、17）特征向内皮细胞特征（亚群2、9、10、15）分化。在拟时间轴上，marker基因也呈现出从上皮细胞（EPCAM、CDH1、KRT7）向内皮细胞（PECAM1、VWF、CDH5）的转化。作者进一步发现该轨迹的末端出现了血管生成拟态（VM）相关的基因（F2R、FLT1、KDR）上调表达。

6、细胞通讯分析

CellPhoneDB是包含配体、受体及其相互作用的数据库，可以对细胞间的通讯分子进行全面、系统的分析，研究不同细胞类型之间的相互交流及通讯网络。

conda activate cellphonedbpy36cd ~/AM/cellphonedbwrite.table(as.matrix(seurat_obj@assays$RNA@data), 'cellphonedb_count.txt', sep='\t', quote=F)meta_data <- cbind(rownames(seurat_obj@meta.data), pred.hesc$pruned.labels) meta_data <- as.matrix(meta_data)meta_data[is.na(meta_data)] = "Unkown" # 细胞类型中不能有NAwrite.table(meta_data, 'cellphonedb_meta.txt', sep='\t', quote=F, row.names=F)cellphonedb method statistical_analysis cellphonedb_meta.txt cellphonedb_count.txt--counts-data=gene_name#如果我们count的基因是基因名格式，需要添加参数--counts-data=gene_name，如果行名为ensemble名称的话，可以不添加这个参数，使用默认值即可。

相关细胞通讯体现了上皮细胞的间质转化和迁移能力

7、细胞速率分析（未完）

通过对unspliced(nascent) 和spliced(mature) mRNA丰度的评估在时间维度上揭示转录本动态变化的一个指标

1、从cellranger得到loom文件

conda activate velocytopy37cd ~/AM/velocytovelocyto run10x -m /public/workspace/lincs/lab7/singlecell/velocyto/GRCh38_rmsk.gtf ~/AM/Tumor/ /public/workspace/lincs/cellranger/ref/refdata-gex-GRCh38--A/genes/genes.gtf##报错于是 pip install --upgrade numpy 就成功了## pip install --upgrade scikit-image

2、运行velocyto.R

library(velocyto.R)ldat<-read.loom.matrices("run_count_1kpbmcs.loom")emat<-ldat$splicedemat<-emat[,colSums(emat)>=1e3]library(pagoda2)rownames(emat)<-make.unique(rownames(emat))r <- Pagoda2$new(emat,modelType='plain',trim=10,log.scale=T)r$calculatePcaReduction(nPcs=100,n.odgenes=3e3,maxit=300)r$makeKnnGraph(k=30,type='PCA',center=T,distance='cosine')r$getKnnClusters(method=munity,type='PCA',name='multilevel')r$getEmbedding(type='PCA',embeddingType='tSNE',perplexity=50,verbose=T)par(mfrow=c(1,2))r$plotEmbedding(type='PCA',embeddingType='tSNE',show.legend=F,mark.clusters=T,min.group.size=10,shuffle.colors=F,mark.cluster.cex=1,alpha=0.3,main='cell clusters')r$plotEmbedding(type='PCA',embeddingType='tSNE',colors=r$depth,main='depth') #忽略跨剪切位点的数据emat <- ldat$splicednmat <- ldat$unspliced#通过p2对细胞进行过滤emat <- emat[,rownames(r$counts)]nmat <- nmat[,rownames(r$counts)]#对分类数据进行标记cluster.label <- r$clusters$PCA$multilevel # take the cluster factor that was calculated by p2cell.colors <- pagoda2:::fac2col(cluster.label)# take embedding form p2emb <- r$embeddings$PCA$tSNEcell.dist <- as.dist(1-armaCor(t(r$reductions$PCA)))emat <- filter.genes.by.cluster.expression(emat,cluster.label,min.max.cluster.average = 0.2)nmat <- filter.genes.by.cluster.expression(nmat,cluster.label,min.max.cluster.average = 0.05)length(intersect(rownames(emat),rownames(nmat)))fit.quantile <- 0.02rvel.cd <- gene.relative.velocity.estimates(emat,nmat,deltaT=1,kCells=25,cell.dist=cell.dist,fit.quantile=fit.quantile)show.velocity.on.embedding.cor(emb,rvel.cd,n=200,scale='sqrt',cell.colors=ac(cell.colors,alpha=0.5),cex=0.8,arrow.scale=3,show.grid.flow=TRUE,min.grid.cell.mass=0.5,grid.n=40,arrow.lwd=1,do.par=F,cell.border.alpha = 0.1)