zoukankan html css js c++ java

手动计算富集分析

富集分析非常常见，用于判断抽样的结果是否显著。

例子1：一个工厂总共有N件产品，其中M件次品，现在从中抽取n件做检查，抽到k件次品的概率分布服从超几何分布。

例子2：一个细胞有N个基因，其中在pathway A里面有M个基因，现在从中抽取n个基因，抽到k个pathway A里基因的概率分布服从超几何分布。

最靠谱的富集分析当属clusterProfiler，里面的enrichGO可以做富集分析。

现在我有个性化的需求，所以要自己做，想借鉴一下里面的代码。

查看enrichGO代码，发现是里面的enricher_internal在做这件事，去GitHub查，发现enricher_internal是DOSE的函数，继续去GitHub查，定位到enricher_internal。

代码一览无余：

termID2ExtID <- termID2ExtID[idx]
qTermID2ExtID <- qTermID2ExtID[idx]
qTermID <- unique(names(qTermID2ExtID))

## prepare parameter for hypergeometric test
k <- sapply(qTermID2ExtID, length)
k <- k[qTermID]
M <- sapply(termID2ExtID, length)
M <- M[qTermID]

N <- rep(length(extID), length(M))
## n <- rep(length(gene), length(M)) ## those genes that have no annotation should drop.
n <- rep(length(qExtID2TermID), length(M))
args.df <- data.frame(numWdrawn=k-1, ## White balls drawn
                      numW=M,        ## White balls
                      numB=N-M,      ## Black balls
                      numDrawn=n)    ## balls drawn


## calcute pvalues based on hypergeometric model
pvalues <- apply(args.df, 1, function(n)
                 phyper(n[1], n[2], n[3], n[4], lower.tail=FALSE)
                 )

## gene ratio and background ratio
GeneRatio <- apply(data.frame(a=k, b=n), 1, function(x)
                   paste(x[1], "/", x[2], sep="", collapse="")
                   )
BgRatio <- apply(data.frame(a=M, b=N), 1, function(x)
                 paste(x[1], "/", x[2], sep="", collapse="")
                 )


Over <- data.frame(ID = as.character(qTermID),
                   GeneRatio = GeneRatio,
                   BgRatio = BgRatio,
                   pvalue = pvalues,
                   stringsAsFactors = FALSE)

p.adj <- p.adjust(Over$pvalue, method=pAdjustMethod)
qobj <- tryCatch(qvalue(p=Over$pvalue, lambda=0.05, pi0.method="bootstrap"), error=function(e) NULL)

if (class(qobj) == "qvalue") {
    qvalues <- qobj$qvalues
} else {
    qvalues <- NA
}

geneID <- sapply(qTermID2ExtID, function(i) paste(i, collapse="/"))
geneID <- geneID[qTermID]
Over <- data.frame(Over,
                   p.adjust = p.adj,
                   qvalue = qvalues,
                   geneID = geneID,
                   Count = k,
                   stringsAsFactors = FALSE)

Description <- TERM2NAME(qTermID, USER_DATA)

if (length(qTermID) != length(Description)) {
    idx <- qTermID %in% names(Description)
    Over <- Over[idx,]
}
Over$Description <- Description
nc <- ncol(Over)
Over <- Over[, c(1,nc, 2:(nc-1))]


Over <- Over[order(pvalues),]


Over$ID <- as.character(Over$ID)
Over$Description <- as.character(Over$Description)

row.names(Over) <- as.character(Over$ID)

x <- new("enrichResult",
         result         = Over,
         pvalueCutoff   = pvalueCutoff,
         pAdjustMethod  = pAdjustMethod,
         qvalueCutoff   = qvalueCutoff,
         gene           = as.character(gene),
         universe       = extID,
         geneSets       = geneSets,
         organism       = "UNKNOWN",
         keytype        = "UNKNOWN",
         ontology       = "UNKNOWN",
         readable       = FALSE
         )
return (x)

核心就是这个代码了：

args.df <- data.frame(numWdrawn=k-1, ## White balls drawn
                  numW=M,        ## White balls
                  numB=N-M,      ## Black balls
                  numDrawn=n)    ## balls drawn


## calcute pvalues based on hypergeometric model
pvalues <- apply(args.df, 1, function(n)
             phyper(n[1], n[2], n[3], n[4], lower.tail=FALSE)
             )

phyper(q, m, n, k, lower.tail = TRUE, log.p = FALSE)

其中：

第一个参数：q
vector of quantiles representing the number of white balls drawn without replacement from an urn which contains both black and white balls.

第二个参数：m
the number of white balls in the urn.

第三个参数：n
the number of black balls in the urn.

第四个参数：k
the number of balls drawn from the urn.

ID Description GeneRatio BgRatio pvalue p.adjust qvalue
GO:0008380 RNA splicing 68/854 364/23210 6.07E-29 2.70E-25 2.28E-25

关键的只有两个数：

GeneRatio：我输入的基因数854（n），其中在pathway A里的有68个（k）

BgRatio：总共背景（有注释）基因数23210（N），其中pathway A里的基因数364个（M）

phyper(k-1, M, N-M, n, lower.tail = TRUE, log.p = FALSE)

带入数字：

phyper(67, 364, 23210-364, 854, lower.tail = TRUE, log.p = FALSE) = 6.07163831922482e-29

结果一致。

进阶：

lower.tail是什么
为什么第一个参数要减1

参考：

超几何分布检验（hypergeometric test）

https://github.com/YuLab-SMU/DOSE/blob/5be8e21c56242d58b5576c20412b3457bc61dae7/R/enricher_internal.R

不靠谱的富集软件太多了！

查看全文

相关阅读:
到底什么级别才算是高并发？
阿里大佬教你，如何写好 Java 代码！
Java 13 发布了！
年轻人的第一个自定义 Spring Boot Starter！
懵圈了，面试官问一个 TCP 连接可发多少个 HTTP 请求？
Java 和操作系统交互，你猜会发生什么？
不用找了，基于 Redis 的分布式锁实战来了！
中国剩余定理
 欧几里德与扩展欧几里德
 大数mod的技巧

原文地址：https://www.cnblogs.com/leezx/p/12510385.html