zoukankan      html  css  js  c++  java
  • Chip-seq peak annontation

    Chip-seq peak annontation

    narrowPeak/boardPeak

    narrowPeak/boardPeakENCODE可提供下载的两种 Chip-seq 经过参考人类基因组mapping后的关于peak的数据.
    其他类型的seq数据储存个数可以参看FAQformat

    narrowPeak

    数据按照以下规则储存:

    1. string chrom: "Reference sequence chromosome or scaffold"
    2. uint   chromStart: "Start position in chromosome"
    3. uint   chromEnd: "End position in chromosome"
    4. string name: "Name given to a region (preferably unique). Use . if no name is assigned"
    5. uint   score: "Indicates how dark the peak will be displayed in the browser (0-1000) "
    6. char[1]  strand: "+ or - or . for unknown"
    7. float  signalValue: "Measurement of average enrichment for the region"
    8. float  pValue: "Statistical significance of signal value (-log10). Set to -1 if not used."
    9. float  qValue: "Statistical significance with multiple-test correction applied (FDR -log10). Set to -1 if n-ot used."
    10. int   peak: "Point-source called for this peak; 0-based offset from chromStart. Set to -1 if no point-sour-ce called."

    boardPeak

    数据按照以下规则储存:

    1. string chrom: "Reference sequence chromosome or scaffold"
    2. uint   chromStart: "Start position in chromosome"
    3. uint   chromEnd: "End position in chromosome"
    4. string name: "Name given to a region (preferably unique). Use . if no name is assigned"
    5. uint   score: "Indicates how dark the peak will be displayed in the browser (0-1000) "
    6. char[1]  strand: "+ or - or . for unknown"
    7. float  signalValue: "Measurement of average enrichment for the region"
    8. float  pValue: "Statistical significance of signal value (-log10). Set to -1 if<BR> not used."
    9. float  qValue: "Statistical significance with multiple-test correction applied (FDR -log10). Set to -1 if n-ot used."

    在接下去的peak annotation中,只演示narrowPeak.bed格式数据.

    示例数据

    演示数据来自ENCODE H3K9me3 的Chip-seq,样本ID为ENCFF199BLM.

    样本的基本信息:

    • Homo sapiens liver male adult (32 years)
      • Target: H3K9me3
      • Lab: Bing Ren, UCSD
      • Project: Roadmap

    narrowPeak 数据基本信息:

    ##   seqnames     start       end       name score strand signalValue  pValue
    ## 1    chr10 100134639 100134831  Peak_7974   178      .     4.41261 6.68330
    ## 2    chr10 100446376 100446664  Peak_4189   244      .     5.13248 8.41215
    ## 3    chr10 100779568 100779699 Peak_32369   101      .     3.34436 4.50936
    ## 4    chr10  10088147  10088346 Peak_28509   112      .     4.05830 4.84890
    ## 5    chr10 101149252 101149594  Peak_6146   211      .     4.89252 7.53305
    ## 6    chr10 101173156 101173424 Peak_32985    94      .     3.45278 4.33257
    ##    qValue peak
    ## 1 2.96490  132
    ## 2 4.37217  165
    ## 3 1.47374  105
    ## 4 1.69566   90
    ## 5 3.67439  174
    ## 6 1.30993  237

    构建注释文件

    在peak annotation中需要一个用于参考的注释文件,需要包括一下信息: 1. 染色体名 2. 转录本起始位点 3. 转录本终止位点 4. gene的名字 5. 正反链

    注释文件可以直接从外部导入,也可以利用biomaRt 生成

    library(ChIPpeakAnno)
    library(biomaRt)
    mart <- useMart("ensembl")
    datasets <- listDatasets(mart)
    mart <- useDataset("hsapiens_gene_ensembl",mart)
    # 需要筛选的特征
    props <- c("ensembl_gene_id", "external_gene_name", "transcript_biotype", "chromosome_name", "start_position", "end_position", "strand")
    
    # 筛选染色体号
    lincRNA <- subset(
      getBM(attributes=props, mart=mart, filters = "chromosome_name", values = c(1:22,"X","Y")),
      transcript_biotype == "lincRNA"
      )
    # 在染色体前加"chr"保持和narrowPeak数据一致
    lincRNA[,4] <- paste0("chr", lincRNA[,4])

    得到的数据包含以下信息:

    ##   ensembl_gene_id external_gene_name transcript_biotype chromosome_name
    ## 1 ENSG00000276255       RP5-881P19.7            lincRNA            chr1
    ## 2 ENSG00000234277          LINC01641            lincRNA            chr1
    ## 3 ENSG00000238107      RP11-495P10.5            lincRNA            chr1
    ## 4 ENSG00000274020          LINC01138            lincRNA            chr1
    ## 5 ENSG00000225620      RP11-569A11.2            lincRNA            chr1
    ## 6 ENSG00000237520       RP11-443B7.2            lincRNA            chr1
    ##   start_position end_position strand
    ## 1      228073909    228076550     -1
    ## 2      227393591    227431035      1
    ## 3      148295180    148297556      1
    ## 4      148290889    148519604     -1
    ## 5      202632428    202632911      1
    ## 6      234957231    234959989      1

    进行peak annotation

    有了这个注释文件以后,我们就可以根据我们想要的筛选规则对peak进行注释,主要用到的包是 ChIPpeakAnno.用到的函数为annotatePeakInBatch.

    # 将前面得到的注释文件转换为RangedData对象
    library(ChIPpeakAnno)
    myCustomAnno <- RangedData(
      IRanges(
        start=lincRNA[,"start_position"], 
        end=lincRNA[,"end_position"],
        names=lincRNA[,"ensembl_gene_id"]),
      space=lincRNA[,"chromosome_name"],
      strand=lincRNA[,"strand"])
    # 读入需要注释的narrowPeak数据
    bed_file <- read.table("ENCFF199BLM.bed", header = T, sep = "	", stringsAsFactors = F)
    # 将peak数据转换为GRanges
    peaks <- toGRanges(bed_file, format="narrowPeak", colNames = colnames(bed_file))
    # 根据需要进行筛选
    anno <- annotatePeakInBatch(peaks, AnnotationData=myCustomAnno, 
                                output="overlapping", 
                                FeatureLocForDistance="TSS",
                                bindingRegion=c(-2000, 2000))

    anno中提取我们需要的lincRNA的ID

    result_lincRNA <- anno@elementMetadata$feature
    head(result_lincRNA)
    ## [1] "ENSG00000237579" "ENSG00000235180" "ENSG00000232259" "ENSG00000204365"
    ## [5] "ENSG00000226578" "ENSG00000260137"
  • 相关阅读:
    POI做报表
    (一) DB2的备份和恢复:准备
    西天取经为节约成本该裁掉哪位?
    python中configpraser模块
    python中subprocess模块
    python中os模块
    python中random模块
    python中time模块和datetime模块
    python中序列化json模块和pickle模块
    迭代器生成器函数的递归调用与二分法
  • 原文地址:https://www.cnblogs.com/wwdPeRl/p/8541362.html
Copyright © 2011-2022 走看看