zoukankan      html  css  js  c++  java
  • 基于基因表达的打分模型

    Cell cycle analysis was conducted using the cyclone function in the scran R package (39). For every cell cluster, each cell was assigned into G1, G2M and S phases. 

    可以学习一下,来自:PanglaoDB: a web server for exploration of mouse and human single-cell RNA sequencing data

    根据基因表达来打分的需求有很多很多:

    • 细胞周期打分
    • 增殖指数 - 干细胞、癌细胞
    • 迁移指数 - 某些特定细胞,如ENCC
    • 代谢指数

    参考文章:

    Proliferation Index: A Continuous Model to Predict Prognosis in Patients with Tumours of the Ewing's Sarcoma Family

    Personalized Prediction of Proliferation Rates and Metabolic Liabilities in Cancer Biopsies

    midbrain那篇Cell

    LGE那篇Nature 

    Morphology-based prediction of cancer cell migration using an artificial neural network and a random decision forest - 不是基于基因表达的

    Quantifying rates of cell migration and cell proliferation in co-culture barrier assays reveals how skin and melanoma cells interact during melanoma spreading and invasion.

    Research Techniques Made Simple: Analysis of Collective Cell Migration Using the Wound Healing Assay

    Cell Migration in Development and Disease

    Increased cell proliferation and neurogenesis in the adult human Huntington's disease brain

    做的人很多,现在想做必须加入一些新特征,cell type区分等。

    cell cycle score

    Removal of cell cycle effect.

    The normalization method described above aims to reduce the effect of technical factors in scRNA-seq data (primarily, depth) from downstream analyses. However, heterogeneity in cell cycle stage, particularly among mitotic cells transitioning between S and G2/M phases, also can drive substantial transcriptomic variation that can mask biological signal. To mitigate this effect, we use a two-step approach:

    1) quantify cell cycle stage for each cell using supervised analyses with known stage-specific markers,

    2) regress the effect of cell cycle stage using the same negative binomial regression as outlined above.

    For the first step we use a previously published list of cell cycle dependent genes (43 S phase genes, 54 G2/M phase genes) for an enrichment analysis similar to that proposed in ref. 11.

    For each cell, we compare the sum of phase-specific gene expression (log10 transformed UMIs) to the distribution of 100 random background genes sets, where the number of background genes is identical to the phase gene set, and the background genes are drawn from the same expression bins. Expression bins are defined by 50 non-overlapping windows of the same range based on log10(mean UMI). The phase-specific enrichment score is the expression z-score relative to the mean and standard deviation of the background gene sets. Our final ‘cell cycle score’ (Extended Data Fig. 1) is the difference between S-phase score and G2/M-phase score.

    除了要normalize测序深度,还要考虑细胞周期的阶段(总共有四个细胞周期stage),尤其是S期和G2/M期。这里首先定量哪些已知的细胞周期的marker (43 S phase genes, 54 G2/M phase genes),其次,比较总和/100个随机的背景基因;背景基因来自相同的bin(50 non-overlapping windows of the same range),最后计算了一个z-score。

    note:  A z-score for a sample indicates the number of standard deviations away from the mean of expression in the reference. The formula is : z = (expression in tumor sample - mean expression in reference sample) / standard deviation of expression in reference sample.

    For a final normalized dataset with cell cycle effect removed, we perform negative binomial regression with technical factors and cell cycle score as predictors. Although the cell cycle activity was regressed out of the data for downstream analysis, we stored the computed cell cycle score before regression, enabling us to remember the mitotic phase of each individual cell. Notably, our regression strategy is tailored to mitigate the effect of transcriptional heterogeneity within mitotic cells in different phases, and should not affect global differences between mitotic and non-mitotic cells that may be biologically relevant.

    声称自己的normalization只去除了mitotic cells内部的差异。

    # input S markers, and N=100, and gene bin (all genes are divided to 33 bins)
    get.bg.lists <- function(goi, N, expr.bin) {
      res <- list()
      # check the marker local in which bin, and table the count
      goi.bin.tab <- table(expr.bin[goi])
      for (i in 1:N) {
      	# each time, random select 38 background genes and merge
        res[[i]] <- unlist(lapply(names(goi.bin.tab), function(b) {
          # find the genes in this bin & genes not in the markers
          sel <- which(expr.bin == as.numeric(b) & !(names(expr.bin) %in% goi))
          # random sample, count corespond to the number of marker
          sample(names(expr.bin)[sel], goi.bin.tab[b])
        }))
      }
      return(res)
    }
    
    # input log2 expr, S markers, 100 background genes
    enr.score <- function(expr, goi, bg.lst) {
      # S markers mean of each cell
      goi.mean <- apply(expr[goi, ], 2, mean)
      # get background genes mean by 100 samples
      bg.mean <- sapply(1:length(bg.lst), function(i) apply(expr[bg.lst[[i]], ], 2, mean))
      # return z-score
      return((goi.mean - apply(bg.mean, 1, mean)) / apply(bg.mean, 1, sd))
    }
    
    
    get.cc.score <- function(cm, N=100, seed=42) {
      set.seed(seed)
      cat('get.cc.score, ')
      cat('number of random background gene sets set to', N, '
    ')
      min.cells <- 5 # min detected gene number  
      cells.mols <- apply(cm, 2, sum) # total umi
      gene.cells <- apply(cm>0, 1, sum) # total detected gene number
      cm <- cm[gene.cells >= min.cells, ] # filter
      gene.mean <- apply(cm, 1, mean) # gene mean
      # divide all genes to 50 bins
      breaks <- unique(quantile(log10(gene.mean), probs = seq(0,1, length.out = 50)))
      gene.bin <- cut(log10(gene.mean), breaks = breaks, labels = FALSE) # table(gene.bin) to see count
      names(gene.bin) <- rownames(cm)
      gene.bin[is.na(gene.bin)] <- 0
      regev.s.genes <- read.table(file='./annotation/s_genes.txt', header=FALSE, stringsAsFactors=FALSE)$V1
      regev.g2m.genes <- read.table(file='./annotation/g2m_genes.txt', header=FALSE, stringsAsFactors=FALSE)$V1
      # list for go, mouse to human
      goi.lst <- list('S'=rownames(cm)[!is.na(match(toupper(rownames(cm)), regev.s.genes))],
                      'G2M'=rownames(cm)[!is.na(match(toupper(rownames(cm)), regev.g2m.genes))])
      
      n <- min(40, min(sapply(goi.lst, length))) # 38
      # sort the marker by their expression
      goi.lst <- lapply(goi.lst, function(x) x[order(gene.mean[x], decreasing = TRUE)[1:n]])
      # double list, background gene list
      bg.lst <- list('S'=get.bg.lists(goi.lst[['S']], N, gene.bin),
                     'G2M'=get.bg.lists(goi.lst[['G2M']], N, gene.bin))
      # merge all genes, markers  + background, sort by name
      all.genes <- sort(unique(c(unlist(goi.lst, use.names=FALSE), unlist(bg.lst, use.names=FALSE))))
      expr <- log10(cm[all.genes, ]+1)
      # 
      s.score <- enr.score(expr, goi.lst[['S']], bg.lst[['S']])
      g2m.score <- enr.score(expr, goi.lst[['G2M']], bg.lst[['G2M']])
      
      phase <- as.numeric(g2m.score > 2 & s.score <= 2)
      phase[g2m.score <= 2 & s.score > 2] <- -1
      
      return(data.frame(score=s.score-g2m.score, s.score, g2m.score, phase))
    }
    

      

    问题:

    1. 这确实是一种打分模型,但是如何评估模型是否准确呢?


    想构建一个打分模型,用于scRNA-seq的细胞类型打分,以及bulk RNA-seq的细胞组成成分预测。

    其实有篇文章已经做过这种工作了,做了第一部分,第二部分还没有评估。

     

     

  • 相关阅读:
    AndroidのActivity启动模式
    Android 垃圾回收,用软引用建立缓存
    如何在eclipse的配置文件里指定jdk路径
    Chrome浏览器扩展开发系列之二:Google Chrome浏览器扩展的调试
    Chrome浏览器扩展开发系列之三:Google Chrome浏览器扩展的架构
    Chrome浏览器扩展开发系列之四:Browser Action类型的Chrome浏览器扩展
    Chrome浏览器扩展开发系列之五:Page Action类型的Chrome浏览器扩展
    Chrome浏览器扩展开发系列之六:options 页面
    手把手教你开发Chrome扩展三:关于本地存储数据
    手把手教你开发Chrome扩展二:为html添加行为
  • 原文地址:https://www.cnblogs.com/leezx/p/9100306.html
Copyright © 2011-2022 走看看