富集分析非常常见,用于判断抽样的结果是否显著。
例子1:一个工厂总共有N件产品,其中M件次品,现在从中抽取n件做检查,抽到k件次品的概率分布服从超几何分布。
例子2:一个细胞有N个基因,其中在pathway A里面有M个基因,现在从中抽取n个基因,抽到k个pathway A里基因的概率分布服从超几何分布。
最靠谱的富集分析当属clusterProfiler,里面的enrichGO可以做富集分析。
现在我有个性化的需求,所以要自己做,想借鉴一下里面的代码。
查看enrichGO代码,发现是里面的enricher_internal在做这件事,去GitHub查,发现enricher_internal是DOSE的函数,继续去GitHub查,定位到enricher_internal。
代码一览无余:
termID2ExtID <- termID2ExtID[idx]
qTermID2ExtID <- qTermID2ExtID[idx]
qTermID <- unique(names(qTermID2ExtID))
## prepare parameter for hypergeometric test
k <- sapply(qTermID2ExtID, length)
k <- k[qTermID]
M <- sapply(termID2ExtID, length)
M <- M[qTermID]
N <- rep(length(extID), length(M))
## n <- rep(length(gene), length(M)) ## those genes that have no annotation should drop.
n <- rep(length(qExtID2TermID), length(M))
args.df <- data.frame(numWdrawn=k-1, ## White balls drawn
numW=M, ## White balls
numB=N-M, ## Black balls
numDrawn=n) ## balls drawn
## calcute pvalues based on hypergeometric model
pvalues <- apply(args.df, 1, function(n)
phyper(n[1], n[2], n[3], n[4], lower.tail=FALSE)
)
## gene ratio and background ratio
GeneRatio <- apply(data.frame(a=k, b=n), 1, function(x)
paste(x[1], "/", x[2], sep="", collapse="")
)
BgRatio <- apply(data.frame(a=M, b=N), 1, function(x)
paste(x[1], "/", x[2], sep="", collapse="")
)
Over <- data.frame(ID = as.character(qTermID),
GeneRatio = GeneRatio,
BgRatio = BgRatio,
pvalue = pvalues,
stringsAsFactors = FALSE)
p.adj <- p.adjust(Over$pvalue, method=pAdjustMethod)
qobj <- tryCatch(qvalue(p=Over$pvalue, lambda=0.05, pi0.method="bootstrap"), error=function(e) NULL)
if (class(qobj) == "qvalue") {
qvalues <- qobj$qvalues
} else {
qvalues <- NA
}
geneID <- sapply(qTermID2ExtID, function(i) paste(i, collapse="/"))
geneID <- geneID[qTermID]
Over <- data.frame(Over,
p.adjust = p.adj,
qvalue = qvalues,
geneID = geneID,
Count = k,
stringsAsFactors = FALSE)
Description <- TERM2NAME(qTermID, USER_DATA)
if (length(qTermID) != length(Description)) {
idx <- qTermID %in% names(Description)
Over <- Over[idx,]
}
Over$Description <- Description
nc <- ncol(Over)
Over <- Over[, c(1,nc, 2:(nc-1))]
Over <- Over[order(pvalues),]
Over$ID <- as.character(Over$ID)
Over$Description <- as.character(Over$Description)
row.names(Over) <- as.character(Over$ID)
x <- new("enrichResult",
result = Over,
pvalueCutoff = pvalueCutoff,
pAdjustMethod = pAdjustMethod,
qvalueCutoff = qvalueCutoff,
gene = as.character(gene),
universe = extID,
geneSets = geneSets,
organism = "UNKNOWN",
keytype = "UNKNOWN",
ontology = "UNKNOWN",
readable = FALSE
)
return (x)
核心就是这个代码了:
args.df <- data.frame(numWdrawn=k-1, ## White balls drawn
numW=M, ## White balls
numB=N-M, ## Black balls
numDrawn=n) ## balls drawn
## calcute pvalues based on hypergeometric model
pvalues <- apply(args.df, 1, function(n)
phyper(n[1], n[2], n[3], n[4], lower.tail=FALSE)
)
phyper(q, m, n, k, lower.tail = TRUE, log.p = FALSE)
其中:
第一个参数:q
vector of quantiles representing the number of white balls drawn without replacement from an urn which contains both black and white balls.
第二个参数:m
the number of white balls in the urn.
第三个参数:n
the number of black balls in the urn.
第四个参数:k
the number of balls drawn from the urn.
ID Description GeneRatio BgRatio pvalue p.adjust qvalue
GO:0008380 RNA splicing 68/854 364/23210 6.07E-29 2.70E-25 2.28E-25
关键的只有两个数:
GeneRatio:我输入的基因数854(n),其中在pathway A里的有68个(k)
BgRatio:总共背景(有注释)基因数23210(N),其中pathway A里的基因数364个(M)
phyper(k-1, M, N-M, n, lower.tail = TRUE, log.p = FALSE)
带入数字:
phyper(67, 364, 23210-364, 854, lower.tail = TRUE, log.p = FALSE) = 6.07163831922482e-29
结果一致。
进阶:
- lower.tail是什么
- 为什么第一个参数要减1
参考: