zoukankan      html  css  js  c++  java
  • 手动计算富集分析

    富集分析非常常见,用于判断抽样的结果是否显著。

    例子1:一个工厂总共有N件产品,其中M件次品,现在从中抽取n件做检查,抽到k件次品的概率分布服从超几何分布。

    例子2:一个细胞有N个基因,其中在pathway A里面有M个基因,现在从中抽取n个基因,抽到k个pathway A里基因的概率分布服从超几何分布。

    最靠谱的富集分析当属clusterProfiler,里面的enrichGO可以做富集分析。

    现在我有个性化的需求,所以要自己做,想借鉴一下里面的代码。

    查看enrichGO代码,发现是里面的enricher_internal在做这件事,去GitHub查,发现enricher_internal是DOSE的函数,继续去GitHub查,定位到enricher_internal。

    代码一览无余:

    termID2ExtID <- termID2ExtID[idx]
    qTermID2ExtID <- qTermID2ExtID[idx]
    qTermID <- unique(names(qTermID2ExtID))
    
    ## prepare parameter for hypergeometric test
    k <- sapply(qTermID2ExtID, length)
    k <- k[qTermID]
    M <- sapply(termID2ExtID, length)
    M <- M[qTermID]
    
    N <- rep(length(extID), length(M))
    ## n <- rep(length(gene), length(M)) ## those genes that have no annotation should drop.
    n <- rep(length(qExtID2TermID), length(M))
    args.df <- data.frame(numWdrawn=k-1, ## White balls drawn
                          numW=M,        ## White balls
                          numB=N-M,      ## Black balls
                          numDrawn=n)    ## balls drawn
    
    
    ## calcute pvalues based on hypergeometric model
    pvalues <- apply(args.df, 1, function(n)
                     phyper(n[1], n[2], n[3], n[4], lower.tail=FALSE)
                     )
    
    ## gene ratio and background ratio
    GeneRatio <- apply(data.frame(a=k, b=n), 1, function(x)
                       paste(x[1], "/", x[2], sep="", collapse="")
                       )
    BgRatio <- apply(data.frame(a=M, b=N), 1, function(x)
                     paste(x[1], "/", x[2], sep="", collapse="")
                     )
    
    
    Over <- data.frame(ID = as.character(qTermID),
                       GeneRatio = GeneRatio,
                       BgRatio = BgRatio,
                       pvalue = pvalues,
                       stringsAsFactors = FALSE)
    
    p.adj <- p.adjust(Over$pvalue, method=pAdjustMethod)
    qobj <- tryCatch(qvalue(p=Over$pvalue, lambda=0.05, pi0.method="bootstrap"), error=function(e) NULL)
    
    if (class(qobj) == "qvalue") {
        qvalues <- qobj$qvalues
    } else {
        qvalues <- NA
    }
    
    geneID <- sapply(qTermID2ExtID, function(i) paste(i, collapse="/"))
    geneID <- geneID[qTermID]
    Over <- data.frame(Over,
                       p.adjust = p.adj,
                       qvalue = qvalues,
                       geneID = geneID,
                       Count = k,
                       stringsAsFactors = FALSE)
    
    Description <- TERM2NAME(qTermID, USER_DATA)
    
    if (length(qTermID) != length(Description)) {
        idx <- qTermID %in% names(Description)
        Over <- Over[idx,]
    }
    Over$Description <- Description
    nc <- ncol(Over)
    Over <- Over[, c(1,nc, 2:(nc-1))]
    
    
    Over <- Over[order(pvalues),]
    
    
    Over$ID <- as.character(Over$ID)
    Over$Description <- as.character(Over$Description)
    
    row.names(Over) <- as.character(Over$ID)
    
    x <- new("enrichResult",
             result         = Over,
             pvalueCutoff   = pvalueCutoff,
             pAdjustMethod  = pAdjustMethod,
             qvalueCutoff   = qvalueCutoff,
             gene           = as.character(gene),
             universe       = extID,
             geneSets       = geneSets,
             organism       = "UNKNOWN",
             keytype        = "UNKNOWN",
             ontology       = "UNKNOWN",
             readable       = FALSE
             )
    return (x)
    

      

    核心就是这个代码了:

    args.df <- data.frame(numWdrawn=k-1, ## White balls drawn
                      numW=M,        ## White balls
                      numB=N-M,      ## Black balls
                      numDrawn=n)    ## balls drawn
    
    
    ## calcute pvalues based on hypergeometric model
    pvalues <- apply(args.df, 1, function(n)
                 phyper(n[1], n[2], n[3], n[4], lower.tail=FALSE)
                 )
    

      

    phyper(q, m, n, k, lower.tail = TRUE, log.p = FALSE)

    其中:

    第一个参数:q
    vector of quantiles representing the number of white balls drawn without replacement from an urn which contains both black and white balls.

    第二个参数:m
    the number of white balls in the urn.

    第三个参数:n
    the number of black balls in the urn.

    第四个参数:k
    the number of balls drawn from the urn.

    ID Description GeneRatio BgRatio pvalue p.adjust qvalue
    GO:0008380 RNA splicing 68/854 364/23210 6.07E-29 2.70E-25 2.28E-25

    关键的只有两个数:

    GeneRatio:我输入的基因数854(n),其中在pathway A里的有68个(k)

    BgRatio:总共背景(有注释)基因数23210(N),其中pathway A里的基因数364个(M)

    phyper(k-1, M, N-M, n, lower.tail = TRUE, log.p = FALSE)

    带入数字:

    phyper(67, 364, 23210-364, 854, lower.tail = TRUE, log.p = FALSE) = 6.07163831922482e-29

    结果一致。

    进阶:

    • lower.tail是什么
    • 为什么第一个参数要减1

    参考:

    超几何分布检验(hypergeometric test)

    https://github.com/YuLab-SMU/DOSE/blob/5be8e21c56242d58b5576c20412b3457bc61dae7/R/enricher_internal.R

    不靠谱的富集软件太多了!

  • 相关阅读:
    oracle数据表数据同步公用方法
    OSI参考模型详解
    DOM事件对象
    DOM事件流
    DOM之节点操作
    DOM之设置,获取,删除自定义的属性值
    JS修改元素的属性
    DOM获取元素以及绑定事件
    Web API
    JS中的简单数据类型和复杂数据类型
  • 原文地址:https://www.cnblogs.com/leezx/p/12510385.html
Copyright © 2011-2022 走看看