zoukankan      html  css  js  c++  java
  • 【基因组注释】ncRNA注释

    1. ncRNA

    非编码RNA(Non-coding RNA, ncRNA) 包括rRNA,tRNA,snRNA,snoRNA 和microRNA 等不编码蛋白质的RNA,它们转录后直接在RNA 水平上就能行使各自的生物学功能,并不需要翻译成蛋白质。

    2. 软件

    tRNA注释

    一般用tRNAscan-SE,老牌软件。

    rRNA注释

    可用RNAMMER注释。但需要edu邮箱下载。

    或者用barrnap,开源软件,也非常好用。

    其他ncRNA注释

    Infernal软件+Rfam数据库是注释其他复杂ncRNA的常用组合,它的结果同时包含了tRNA,rRNA,snRNA,snoRNA和miRNA等。

    对比以上两款软件的tRNA和rRNA结果,Infernal得到的结果差不太多。所以如果不重点研究ncRNA,用这个组合也可。

    3. 注释

    tRNA

    以tRNAscan为例:

    tRNAscan-SE genome.fa -o tRNA.out -f tRNA.ss -m tRNA.stats
    

    tRNA.stats文件中包含了tRNA预测的结果统计。
    image.png
    可通过脚本将tRNA.out整理为gff3。

    perl convert_tRNAScanSE_to_gff3.pl -i tRNA.out >tRNAscan.gff3
    

    PS:convert_tRNAScanSE_to_gff3.pl 可参考:github: https://github.com/jorvis/biocode/blob/master/gff/convert_tRNAScanSE_to_gff3.pl

    rRNA

    以barrnap为例:

    barrnap-master/bin/barrnap 
    	--kingdom euk    #euk bac arc mito
    	--threads 10 
    	--outseq rRNA.fa  #rRNA结果序列
    	genome.fa >rRNA.gff3  #rRNA gff结果
    

    PS:不建议用conda安装使用,会报错。如[barrnap] ERROR: nhmmer failed to run - Error: Invalid alphabet type in target for nhmmer. Expect DNA or RNA.

    得到的结果可进一步统计各核糖体亚基数目。

    • 细菌bacteria (5S,23S,16S)
    • 古菌archaea (5S,5.8S,23S,16S)
    • 多细胞生物线粒体metazoan mitochondria (12S,16S)
    • 真核生物eukaryotes (5S,5.8S,28S,18S)

    snRNA、miRNA等

    首次下载安装infernal和Rfam数据库。infernal使用conda方便,Rfam数据库重点是Rfam.cm和Rfam.clanin的下载:

    # install the package with conda 
    conda install -c bioconda infernal
    
    #  download Rfam data 
    wget ftp://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.cm.gz
    gunzip Rfam.cm.gz
    wget ftp://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.clanin
    
    # cmpress 索引
    cmpress Rfam.cm
    

    然后使用cmscan注释ncRNA:

    genome_total=400000000  #基因组大小
    CMnumber=`grep "ACC" /database/Rfam/Rfam.cm|wc -l`
    Z=`echo $genome_total*2*$CMnumber/1000000 |bc`
    #echo $Z
    
    cmscan -Z $Z --cut_ga --rfam --nohmmonly --tblout genome.tblout  
      --fmt 2 --cpu 30 --clanin /database/Rfam/Rfam.clanin 
      /database/Rfam/Rfam.cm genome.fa > genome.cmscan
    

    主要有两个结果文件genome.cmscan(比对结果)和genome.tblout(table格式)。要想得到进一步分类统计结果,需要做一些处理。

    4. snRNA、miRNA等结果的统计

    首先,提取必须的列和非重叠区域或重叠区域中得分高的区域。

    awk 'BEGIN{OFS="	";}{if(FNR==1) print "target_name	accession	query_name	query_start	query_end	strand	score	Evalue"; if(FNR>2 && $20!="=" && $0!~/^#/) print $2,$3,$4,$10,$11,$12,$17,$18; }' my-genome.tblout >genome.tblout.final.xls
    

    image.png

    然后,需要根据Rfam数据库的注释来对结果进行分类。Rfam的注释可从官网拷贝(命名tmp):https://rfam.xfam.org/search/type

    image.png
    image.png

    image.png

    # 拆分第三列
    less tmp | awk 'BEGIN {FS=OFS="	"}{split($3,x,";");class=x[2];print $1,$2,$3,$4,class}' > rfam_anno.txt
    

    最后统计各类ncRNA注释结果。

     awk 'BEGIN{OFS=FS="	"}ARGIND==1{a[$2]=$5;}ARGIND==2{type=a[$1]; if(type=="") type="Others"; count[type]+=1;}END{for(type in count) print type, count[type];}' rfam_anno.txt genome.tblout.final.xls
    

    统计结果如下:

     riboswitch	1
     tRNA	813
     ribozyme	1
    Others	217
     miRNA	1948
     antisense	7
     rRNA	1451
     sRNA	1
     snRNA	667
    

    也可将注释结果整理为gff3文件:

    perl infernal-tblout2gff.pl --cmscan --fmt2 genome.tblout >genome.infernal.ncRNA.gff3
    

    附infernal-tblout2gff.pl

    #!/usr/bin/env perl
    # infernal-tblout2gff.pl: convert cmsearch or cmscan tblout files to GFF format
    
    # an important point of above ref:
    # "Start is always less than or equal to end"
    # EPN, Fri Jun  7 11:07:38 2019
    # 
    #
    use strict;
    use warnings;
    use Getopt::Long;
    
    my $in_tblout  = "";   # name of input tblout file
    
    my $usage;
    $usage  = "infernal-tblout2gff.pl
    
    ";
    $usage .= "Usage:
    
    ";
    $usage .= "infernal-tblout2gff.pl [OPTIONS] <cmsearch tblout file>
    	OR
    ";
    $usage .= "infernal-tblout2gff.pl --cmscan [OPTIONS] <cmscan tblout file>
    	OR
    ";
    $usage .= "infernal-tblout2gff.pl --cmscan --fmt2 [OPTIONS] <cmscan --fmt 2 tblout file>
    
    ";
    $usage .= "	OPTIONS:
    ";
    $usage .= "		-T <n>       : minimum bit score to include is <n>
    ";
    $usage .= "		-E <x>       : maximum E-value to include is <x>
    ";
    $usage .= "		--cmscan     : tblout file was created by cmscan
    ";
    $usage .= "		--source <s> : specify 'source' field should be <s> (e.g. tRNAscan-SE)
    ";
    $usage .= "		--fmt2       : tblout file was created with cmscan --fmt 2 option
    ";
    $usage .= "		--all        : output all info in 'attributes' column   [default: E-value]
    ";
    $usage .= "		--none       : output no info in 'attributes' column    [default: E-value]
    ";
    $usage .= "		--desc       : output desc field in 'attributes' column [default: E-value]
    ";
    $usage .= "		--version <s>: append "-<s>" to 'source' column
    ";
    $usage .= "		--extra <s>  : append "<s>;" to 'attributes' column
    ";
    $usage .= "		--hidedesc   : do not includ "desc:" prior to desc value in 'attributes' column
    ";
    
    my $do_minscore = 0;       # set to '1' if -T used
    my $do_maxevalue = 0;      # set to '1' if -E used
    my $minscore   = undef;    # defined if if -T used
    my $maxevalue  = undef;    # defined if -E used
    my $do_cmscan  = 0;        # set to '1' if --cmscan used
    my $do_fmt2    = 0;        # set to '1' if --fmt
    my $do_all_attributes = 0; # set to '1' if --all
    my $do_no_attributes  = 0; # set to '1' if --none
    my $do_de_attributes  = 0; # set to '1' if --desc
    my $version = undef;       # defined if --version used
    my $extra = undef;         # defined if --extra used
    my $do_hidedesc = 0;       # set to 1 if --hidedesc used
    my $opt_source = undef;    # set to <s> if --source <s> used
    
    &GetOptions( "T=s"       => $minscore,
                 "E=s"       => $maxevalue,
                 "cmscan"    => $do_cmscan,
                 "source=s"  => $opt_source,
                 "fmt2"      => $do_fmt2,
                 "all"       => $do_all_attributes,
                 "none"      => $do_no_attributes,
                 "desc"      => $do_de_attributes,
                 "version=s" => $version, 
                 "extra=s"   => $extra,
                 "hidedesc"  => $do_hidedesc);
    
    if(scalar(@ARGV) != 1) { die $usage; }
    my ($tblout_file) = @ARGV;
    
    if(defined $minscore)  { $do_minscore  = 1; }
    if(defined $maxevalue) { $do_maxevalue = 1; }
    if($do_minscore && $do_maxevalue) { 
      die "ERROR, -T and -E cannot be used in combination. Pick one.";
    }
    if(($do_all_attributes) && ($do_no_attributes)) { 
      die "ERROR, --all and --none cannot be used in combination. Pick one.";
    }
    if(($do_all_attributes) && ($do_de_attributes)) { 
      die "ERROR, --all and --desc cannot be used in combination. Pick one.";
    }
    if(($do_no_attributes) && ($do_de_attributes)) { 
      die "ERROR, --none and --desc cannot be used in combination. Pick one.";
    }
    if(($do_fmt2) && (! $do_cmscan)) { 
      die "ERROR, --fmt2 only makes sense in combination with --cmscan"; 
    }
    if((defined $opt_source) && (defined $version)) { 
      die "ERROR, --source and --version are incompatible";
    }
    
    if(! -e $tblout_file) { die "ERROR tblout file $tblout_file does not exist"; }
    if(! -s $tblout_file) { die "ERROR tblout file $tblout_file is empty"; }
    
    my $source = ($do_cmscan) ? "cmscan" : "cmsearch";
    if(defined $version)    { $source .= "-" . $version; }
    if(defined $opt_source) { $source = $opt_source; }
    
    open(IN, $tblout_file) || die "ERROR unable to open $tblout_file for reading"; 
    my $line;
    my $i;
    while($line = <IN>) { 
      if($line !~ m/^#/) { 
        chomp $line;
        my @el_A = split(/s+/, $line);
        my $nfields = scalar(@el_A);
        if((! $do_fmt2) && ($nfields < 18)) { 
          die "ERROR expected at least 18 space delimited fields in tblout line (fmt 1, default) but got $nfields on line:
    $line
    "; 
        }
        if(($do_fmt2) && ($nfields < 27)) { 
          die "ERROR expected at least 27 space delimited fields in tblout line (fmt 2, --fmt2) but got $nfields on line:
    $line
    "; 
        }
        # ref Infernal 1.1.2 user guide, pages 59-61
        my $idx     = undef;
        my $seqname = undef;
        my $seqaccn = undef;
        my $mdlname = undef;
        my $mdlaccn = undef;
        my $clan    = undef;
        my $mdl     = undef;
        my $mdlfrom = undef;
        my $mdlto   = undef;
        my $seqfrom = undef;
        my $seqto   = undef;
        my $strand  = undef;
        my $trunc   = undef;
        my $pass    = undef;
        my $gc      = undef;
        my $bias    = undef;
        my $score   = undef;
        my $evalue  = undef;
        my $inc     = undef;
        my $olp     = undef;
        my $anyidx  = undef;
        my $anyfrct1= undef;
        my $anyfrct2= undef;
        my $winidx  = undef;
        my $winfrct1= undef;
        my $winfrct2= undef;
        my $desc    = undef;
    
        if($do_fmt2) { # 27 fields
          $idx     = $el_A[0];
          $seqname = ($do_cmscan) ? $el_A[3] : $el_A[1];
          $seqaccn = ($do_cmscan) ? $el_A[4] : $el_A[2];
          $mdlname = ($do_cmscan) ? $el_A[1] : $el_A[3];
          $mdlaccn = ($do_cmscan) ? $el_A[2] : $el_A[4];
          $clan    = $el_A[5];
          $mdl     = $el_A[6];
          $mdlfrom = $el_A[7];
          $mdlto   = $el_A[8];
          $seqfrom = $el_A[9];
          $seqto   = $el_A[10];
          $strand  = $el_A[11];
          $trunc   = $el_A[12];
          $pass    = $el_A[13];
          $gc      = $el_A[14];
          $bias    = $el_A[15];
          $score   = $el_A[16];
          $evalue  = $el_A[17];
          $inc     = $el_A[18];
          $olp     = $el_A[19];
          $anyidx  = $el_A[20];
          $anyfrct1= $el_A[21];
          $anyfrct2= $el_A[22];
          $winidx  = $el_A[23];
          $winfrct1= $el_A[24];
          $winfrct2= $el_A[25];
          $desc    = $el_A[26]; 
          for($i = 27; $i < $nfields; $i++) { $desc .= "_" . $el_A[$i]; }
        }
        else { # fmt 1, default
          $seqname = ($do_cmscan) ? $el_A[2] : $el_A[0];
          $seqaccn = ($do_cmscan) ? $el_A[3] : $el_A[1];
          $mdlname = ($do_cmscan) ? $el_A[0] : $el_A[2];
          $mdlaccn = ($do_cmscan) ? $el_A[1] : $el_A[3];
          $mdl     = $el_A[4];
          $mdlfrom = $el_A[5];
          $mdlto   = $el_A[6];
          $seqfrom = $el_A[7];
          $seqto   = $el_A[8];
          $strand  = $el_A[9];
          $trunc   = $el_A[10];
          $pass    = $el_A[11];
          $gc      = $el_A[12];
          $bias    = $el_A[13];
          $score   = $el_A[14];
          $evalue  = $el_A[15];
          $inc     = $el_A[16];
          $desc    = $el_A[17]; 
          for($i = 18; $i < $nfields; $i++) { $desc .= "_" . $el_A[$i]; }
        }
        # one sanity check, strand should make sense
        if(($strand ne "+") && ($strand ne "-")) { 
          if(($do_fmt2) && (($seqfrom eq "+") || ($seqfrom eq "-"))) { 
            die "ERROR problem parsing, you specified --fmt2 but tblout file appears to have NOT been created with --fmt 2, retry without --fmt2
    problematic line:
    $line
    "; 
          }
          if((! $do_fmt2) && (($pass eq "+") || ($pass eq "-"))) { 
            die "ERROR problem parsing, you did not specify --fmt2 but tblout file appears to have been created with --fmt 2, retry with --fmt2
    problematic line:
    $line
    "; 
          }
          die "ERROR unable to parse, problematic line:
    $line
    ";
        }
        if(($do_minscore) && ($score < $minscore)) { 
          ; # skip
        }
        elsif(($do_maxevalue) && ($evalue > $maxevalue)) { 
          ; # skip
        }
        else { 
          my $attributes = "evalue=" . $evalue; # default to just evalue
          if($do_all_attributes) { 
            if($do_fmt2) { 
              $attributes .= sprintf(";idx=$idx;seqaccn=$seqaccn;mdlaccn=$mdlaccn;clan=$clan;mdl=$mdl;mdlfrom=$mdlfrom;mdlto=$mdlto;trunc=$trunc;pass=$pass;gc=$gc;bias=$bias;inc=$inc;olp=$olp;anyidx=$anyidx;anyfrct1=$anyfrct1;anyfrct2=$anyfrct2;winidx=$winidx;winfrct1=$winfrct1;winfrct2=$winfrct2;%s$desc", ($do_hidedesc ? "" : "desc="));
            }
            else { 
              $attributes .= sprintf(";seqaccn=$seqaccn;mdlaccn=$mdlaccn;mdl=$mdl;mdlfrom=$mdlfrom;mdlto=$mdlto;trunc=$trunc;pass=$pass;gc=$gc;bias=$bias;inc=$inc;%s$desc", ($do_hidedesc ? "" : "desc="));
            }
          }
          elsif($do_no_attributes) { 
            $attributes = "-";
          }
          elsif($do_de_attributes) { 
            $attributes = $desc;
          }
          if(defined $extra) { 
            if($attributes eq "-")       { $attributes = ""; }
            elsif($attributes !~ m/;$/) { $attributes .= ";"; }
            $attributes .= $extra . ";";
          }
          printf("%s	%s	%s	%d	%d	%.1f	%s	%s	%s
    ", 
                 $seqname,                             # token 1: 'sequence' (sequence name)
                 $source,                              # token 2: 'source'
                 $mdlname,                             # token 3: 'feature' (model name) you may want to change this to 'ncRNA'
                 ($strand eq "+") ? $seqfrom : $seqto, # token 4: 'start' in coordinate space [1..seqlen], must be <= 'end'
                 ($strand eq "+") ? $seqto : $seqfrom, # token 5: 'end' in coordinate space [1..seqlen], must be >= 'start'
                 $score,                               # token 6: 'score' bit score
                 $strand,                              # token 7: 'strand' ('+' or '-')
                 ".",                                  # token 8: 'phase' irrelevant for noncoding RNAs
                 $attributes);                         # token 9: attributes, currently only E-value, unless --all, --none or --desc
        }
      }
    }
    

    附convert_tRNAScanSE_to_gff3.pl

    #!/usr/bin/env perl
    
    =head1 NAME
    github:jorvis/biocode
    tRNAScan_SE_to_gff3.pl - convert raw output of tRNAScan-SE to gff3
    =head1 SYNOPSIS
    USAGE: convert_tRNAScanSE_to_gff3.pl 
                --input=/path/to/some_file.out 
    =head1 OPTIONS
    B<--input,-i>
        The raw output from tRNAScan-SE:
        
        Sequence                        tRNA    Bounds  tRNA    Anti    Intron Bounds   Cove
        Name                    tRNA #  Begin   End     Type    Codon   Begin   End     Score
        --------                ------  ----    ------  ----    -----   -----   ----    ------
        tp.assembly.567468735.1         1       91820   91902   Tyr     GTA     91857   91866   66.58
        tp.assembly.567468735.1         2       171777  171849  Phe     GAA     0       0       70.28
        tp.assembly.567468735.1         3       172144  172215  His     GTG     0       0       64.04
        tp.assembly.567468735.1         4       852847  852919  Thr     AGT     0       0       75.69
        tp.assembly.567468735.1         5       877291  877362  Trp     CCA     0       0       68.97
        tp.assembly.567468735.1         6       1468229 1468300 Cys     GCA     0       0       72.10
        tp.assembly.567468735.1         7       2507459 2507530 Pro     AGG     0       0       62.33
        tp.assembly.567468735.1         8       2507198 2507127 Pro     CGG     0       0       65.73
        tp.assembly.567468735.1         9       2506317 2506246 Pro     TGG     0       0       66.60
        tp.assembly.567468735.1         10      2463785 2463713 Lys     TTT     0       0       79.47
        tp.assembly.567468735.1         11      2191149 2191069 Leu     CAG     0       0       57.47
        tp.assembly.567468735.1         12      1633307 1633237 Gly     CCC     0       0       65.52
        tp.assembly.567468735.1         13      1255051 1254968 Leu     CAA     0       0       60.46
        tp.assembly.567468735.1         14      251108  251037  Asp     GTC     0       0       59.48
        tp.assembly.567468735.1         15      250520  250449  Asp     GTC     0       0       59.48
    B<--log,-l> 
        Log file
    B<--ignoreIntrons,-g>
        Flag to ignore introns. tRNAs split by introns are usually flagged by NCBI's table2asn as too short.
    B<--help,-h>
        This help message
    =head1  DESCRIPTION
    File converter
    =head1  INPUT
    Input above.
    =head1  OUTPUT
    GFF3 to STDOUT
    =head1  CONTACT
        Kyle Tretina
        kyletretina@gmail.com
    =cut
    
    use warnings;
    use strict;
    use Getopt::Long qw(:config no_ignore_case no_auto_abbrev pass_through);
    use Pod::Usage;
    
    my %options = ();
    my $results = GetOptions (\%options, 
                              'input|i=s',
                              'log|l=s',
                              'ignoreIntrons|g',
                              'help|h') || pod2usage();
    
    ## display documentation
    if( $options{'help'} ){
        pod2usage( {-exitval => 0, -verbose => 2, -output => *STDERR} );
    }
    
    ## make sure everything passed was peachy
    &check_parameters(\%options);
    
    ## open the log if requested
    my $logfh;
    if (defined $options{log}) {
        open($logfh, ">$options{log}") || die "can't create log file: $!";
    }
    
    ## set ignoreIntrons flag
    my $ignoreIntrons = '';
    if (defined $options{ignoreIntrons}) {
        $ignoreIntrons = 1;
    }
    
    
    ## open the input file
    my $ifh;
    open($ifh, "<$options{input}") || die "can't open input file: $!";
    
    # all output needs the gff header
    print "##gff-version 3
    ";
    
    ## globals
    my $tRNAGenNum=1;
    my $tRNAExonNum=1;
    
    ## parse the file
    foreach my $line (<$ifh>){
    	my @cols = split /[	]/, $line;
    	chomp @cols;
    	my $contig = $cols[0];
    
        if ($contig =~ /^(.+?)s+$/) {
            $contig = $1;
        }
    
        ## skip the header lines
        next if $contig eq 'Sequence' || $contig eq 'Name' || $contig eq '--------';
        
    	my $start = $cols[2] + 0;
    	my $stop = $cols[3] + 0;
    	my $trna = $cols[4];
    	my $score = $cols[8];
        my $pseudo = "";
    	my $anticodon = $cols[5];
        my $intron_start = $cols[6]-1; ## offset intron boundaries
    	my $intron_end = $cols[7]+1;
    
        $pseudo = ";pseudogene=unknown" if $cols[9] =~ m/pseudo/;
        # Suppressor (sup) tRNAs are not recognized by NCBI's table2asn
        $trna = "Xxx" if $trna =~ m/Sup/ or $trna =~ m/Undet/;
    	
    	## Check if tRNAScan-SE predicted a possible intron (value in column 6 > 0)
    	if ($cols[6] > 0 and not $ignoreIntrons) {
       		if ($start > $stop){
    			my $intron_end = $cols[6]+1; ## offset intron boundaries
    			my $intron_start = $cols[7]-1;
                ($start, $stop) = ($stop, $start);
        	}
            print "$contig	tRNAScan-SE	gene	$start	$stop	$score	-	.	ID=tRNA-$trna$tRNAGenNum\_gene$pseudo
    ";
    		print "$contig	tRNAScan-SE	tRNA	$start	$intron_start	$score	-	.	ID=tRNA-$trna$tRNAExonNum\_tRNA;Parent=tRNA-$trna$tRNAGenNum\_gene;product=tRNA-$trna;anticodon=$anticodon;
    ";
            $tRNAExonNum++;
    		print "$contig	tRNAScan-SE	tRNA	$intron_end	$stop	$score	-	.	ID=tRNA-$trna$tRNAExonNum\_tRNA;Parent=tRNA-$trna$tRNAGenNum\_gene;product=tRNA-$trna;anticodon=$anticodon;
    ";
    		$tRNAGenNum++;
            $tRNAExonNum++;
    	} else{
    		($start, $stop) = ($stop, $start) if ($start > $stop);
            print "$contig	tRNAScan-SE	gene	$start	$stop	$score	-	.	ID=tRNA-$trna$tRNAGenNum\_gene$pseudo
    ";
    		print "$contig	tRNAScan-SE	tRNA	$start	$stop	$score	-	.	ID=tRNA-$trna$tRNAExonNum\_tRNA;product=tRNA-$trna;anticodon=$anticodon;Parent=tRNA-$trna$tRNAGenNum\_gene
    ";
    		$tRNAGenNum++;
            $tRNAExonNum++;
    	}
    }
    
    exit(0);
    
    
    sub _log {
        my $msg = shift;
        print $logfh "$msg
    " if $logfh;
    }
    
    sub check_parameters {
        my $options = shift;
        ## make sure required arguments were passed
        my @required = qw( input );
        for my $option ( @required ) {
            unless  ( defined $$options{$option} ) {
                die "--$option is a required option";
            }
        }
        ## handle some defaults
        $options{optional_argument2}   = 'foo'  unless ($options{optional_argument2});
    }
    
    

    https://www.jianshu.com/p/8f4061c38508
    http://blog.genesino.com/2017/06/Rfam/

  • 相关阅读:
    Installing OpenSSH from the Settings UI on Windows Server 2019 or Windows 10 1809
    [CB]Intel 2018架构日详解:新CPU&新GPU齐公布 牙膏时代有望明年结束
    [百科]数字孪生
    wifi 标准
    WM_CONCAT和LISTAGG 语法例子
    Informix 启动 Fatal error in shared memory initialization解决方法
    B站日志系统的前世今生
    nginx安全日志分析脚本的编写
    可视化web日志分析工具Logstalgia
    CDH-5.7.0:基于Parcels方式离线安装配置
  • 原文地址:https://www.cnblogs.com/jessepeng/p/15392809.html
Copyright © 2011-2022 走看看