zoukankan      html  css  js  c++  java
  • R语言实现对基因组SNV进行注释

        很多时候,我们需要对取出的SNV进行注释,这个时候可能会在R上进行注释,通常注释文件都含有Chr(染色体)、Start(开始位点)、End(结束位点)、Description(描述),而我们的SNV文件通常是拥有Position(位置),因此我们可以先定位Chr,再用Postion去定位到Start和End之间,找到相对应的Description。为了加快速度,可以使用二分查找法。

     1 for (value in dt$value){
     2 #df:data.frame, V1 and V2 should be Start and End   value: Postition  used to find region  return:df row number where position locates  ,if no region return -1
     3     low=1
     4     high=nrow(df)
     5     mid=high %/% 2
     6     if (df[low,1] <= value & value <= df[low,2]) low
     7     else if (df[high,1] <= value & value <= df[high,2]) high
     8     else{
     9     while (value > df[mid,2] || value < df[mid,1]){
    10       if (value > df[mid,2]){
    11         low = mid+1
    12       } else if (value < df[mid,1]) {
    13         high = mid - 1
    14       } 
    15       if(high<low){
    16          mid=-1;break
    17       }
    18       mid=(low+high)%/%2
    19     }
    20       mid
    21 }
    22 }

    在R中使用for循环效率低,因此也可以用data.table包的foverlap函数,改进代码如下,对bed文件进行注释,如果要对snv进行注释,只需要将snv改成相应的start和end相等的bed文件即可。

     1 #!/bin/Rscript
     2 
     3 library(data.table)
     4 
     5 arg <- commandArgs(T)
     6 if (length(arg) != 3) {
     7     message("[usage]: BedAnnoGene.R bedfile gtffile outputfile")
     8     message("    bedfile format: chr start end information(Arbitrary but can not be lacked)")
     9     message("    GTFfile: gtf file downloaded from GENCODE")
    10     message("    outputfile: file to be writen out")
    11     message("    needed package: data.table 1.10.4")
    12     stop("Please check your arguments!")
    13 }
    14     
    15 bedfile <- arg[1]
    16 annofile <- arg[2]
    17 outfile <- arg[3]
    18 
    19 #read file 
    20 anno <- fread(annofile,sep="	",header=F)
    21 bed <- fread(bedfile,sep="	",header=F)
    22 setnames(anno,c("V1","V2","V3","V4","V5","V9"),c("Chr","Gene","Type","Start","End","Info"))
    23 anno <- anno[Type=="gene",.(Chr,Start,End,Gene=sapply(strsplit(tstrsplit(Info,";")[3][[1]],"""),function(x)x[2]))]
    24 setkey(anno,Chr,Start,End)
    25 setkey(bed,V1,V2,V3)
    26 
    27 #find overlaps by Chr
    28 lst <- list()
    29 for (ChrI in intersect(unique(bed$V1),unique(anno$Chr))){
    30   anno_reg <- anno[Chr == ChrI,.(Start,End)]
    31   bed_reg <- bed[V1 == ChrI,.(V2,V3)]
    32   setkey(anno_reg,Start,End)
    33   setkey(bed_reg,V2,V3)
    34   overl <- foverlaps(bed_reg,anno_reg,which=TRUE,nomatch = 0)
    35   if (nrow(overl) > 0){
    36     lst[[ChrI]] <- data.table(Chr=ChrI,bed[V1 == ChrI,][overl[["xid"]],.(V2,V3,V4)],anno[Chr == ChrI][overl[["yid"]],.(Gene)])
    37   }
    38 }
    39 merge_dt <- rbindlist(lst)
    40 setnames(merge_dt,c("V2","V3","V4"),c("Start","End","Name"))
    41 
    42 #if one region has more than one gene
    43 torm <- list()
    44 for (i in 1:(nrow(merge_dt)-1)){if(merge_dt[i,"Name"]==merge_dt[i+1,"Name"]){set(merge_dt,i+1L,ncol(merge_dt),paste(merge_dt[i,"Gene"],merge_dt[i+1,"Gene"],sep=";"));torm <- c(torm,list(i))}}
    45 torm <- unlist(torm)
    46 merge_dt <- merge_dt[-torm,]
    47 
    48 fwrite(merge_dt,file=outfile)

    使用帮助可以在我github看到   https://github.com/yiliu4234/BedAnnoGene

  • 相关阅读:
    动态规划——Best Time to Buy and Sell Stock IV
    动态规划——Split Array Largest Sum
    动态规划——Burst Ballons
    动态规划——Best Time to Buy and Sell Stock III
    动态规划——Edit Distance
    动态规划——Longest Valid Parentheses
    动态规划——Valid Permutations for DI Sequence
    构建之法阅读笔记05
    构建之法阅读笔记04
    构建之法阅读笔记03
  • 原文地址:https://www.cnblogs.com/ywliao/p/6679948.html
Copyright © 2011-2022 走看看