zoukankan      html  css  js  c++  java
  • blast | diamond 输出结果选择和解析 | 比对

    之前的文章:构建NCBI本地BLAST数据库 (NR NT等) | blastx/diamond使用方法 | blast构建索引 | makeblastdb

    本地运行blast时,需要指定out format。

    常见的网页版blast结果可以参照:Blast结果的详细解析

    *** Formatting options
     -outfmt <String>
       alignment view options:
         0 = Pairwise,
         1 = Query-anchored showing identities,
         2 = Query-anchored no identities,
         3 = Flat query-anchored showing identities,
         4 = Flat query-anchored no identities,
         5 = BLAST XML,
         6 = Tabular,
         7 = Tabular with comment lines,
         8 = Seqalign (Text ASN.1),
         9 = Seqalign (Binary ASN.1),
        10 = Comma-separated values,
        11 = BLAST archive (ASN.1),
        12 = Seqalign (JSON),
        13 = Multiple-file BLAST JSON,
        14 = Multiple-file BLAST XML2,
        15 = Single-file BLAST JSON,
        16 = Single-file BLAST XML2,
        18 = Organism Report
    
       Options 6, 7 and 10 can be additionally configured to produce
       a custom format specified by space delimited format specifiers.
       The supported format specifiers are:
       	    qseqid means Query Seq-id
       	       qgi means Query GI
       	      qacc means Query accesion
       	   qaccver means Query accesion.version
       	      qlen means Query sequence length
       	    sseqid means Subject Seq-id
       	 sallseqid means All subject Seq-id(s), separated by a ';'
       	       sgi means Subject GI
       	    sallgi means All subject GIs
       	      sacc means Subject accession
       	   saccver means Subject accession.version
       	   sallacc means All subject accessions
       	      slen means Subject sequence length
       	    qstart means Start of alignment in query
       	      qend means End of alignment in query
       	    sstart means Start of alignment in subject
       	      send means End of alignment in subject
       	      qseq means Aligned part of query sequence
       	      sseq means Aligned part of subject sequence
       	    evalue means Expect value
       	  bitscore means Bit score
       	     score means Raw score
       	    length means Alignment length
       	    pident means Percentage of identical matches
       	    nident means Number of identical matches
       	  mismatch means Number of mismatches
       	  positive means Number of positive-scoring matches
       	   gapopen means Number of gap openings
       	      gaps means Total number of gaps
       	      ppos means Percentage of positive-scoring matches
       	    frames means Query and subject frames separated by a '/'
       	    qframe means Query frame
       	    sframe means Subject frame
       	      btop means Blast traceback operations (BTOP)
       	    staxid means Subject Taxonomy ID
       	  ssciname means Subject Scientific Name
       	  scomname means Subject Common Name
       	sblastname means Subject Blast Name
       	 sskingdom means Subject Super Kingdom
       	   staxids means unique Subject Taxonomy ID(s), separated by a ';'
       			 (in numerical order)
       	 sscinames means unique Subject Scientific Name(s), separated by a ';'
       	 scomnames means unique Subject Common Name(s), separated by a ';'
       	sblastnames means unique Subject Blast Name(s), separated by a ';'
       			 (in alphabetical order)
       	sskingdoms means unique Subject Super Kingdom(s), separated by a ';'
       			 (in alphabetical order)
       	    stitle means Subject Title
       	salltitles means All Subject Title(s), separated by a '<>'
       	   sstrand means Subject Strand
       	     qcovs means Query Coverage Per Subject
       	   qcovhsp means Query Coverage Per HSP
       	    qcovus means Query Coverage Per Unique Subject (blastn only)
       When not provided, the default value is:
       'qaccver saccver pident length mismatch gapopen qstart qend sstart send
       evalue bitscore', which is equivalent to the keyword 'std'
       Default = `0'
    

    默认是0,也就是会输出比对的结果。

    但是这样的结果显然不适合批量处理,批量处理的文件格式显然必须是dataframe。

    所以网上有人推荐“outfmt 7 or 10 works perfect”,所以一般就选10吧。

    Diamond的输出格式:

    --outfmt (-f)          output format
    	0   = BLAST pairwise
    	5   = BLAST XML
    	6   = BLAST tabular
    	100 = DIAMOND alignment archive (DAA)
    	101 = SAM
    
    	Value 6 may be followed by a space-separated list of these keywords:
    
    	qseqid means Query Seq - id
    	qlen means Query sequence length
    	sseqid means Subject Seq - id
    	sallseqid means All subject Seq - id(s), separated by a ';'
    	slen means Subject sequence length
    	qstart means Start of alignment in query
    	qend means End of alignment in query
    	sstart means Start of alignment in subject
    	send means End of alignment in subject
    	qseq means Aligned part of query sequence
    	sseq means Aligned part of subject sequence
    	evalue means Expect value
    	bitscore means Bit score
    	score means Raw score
    	length means Alignment length
    	pident means Percentage of identical matches
    	nident means Number of identical matches
    	mismatch means Number of mismatches
    	positive means Number of positive - scoring matches
    	gapopen means Number of gap openings
    	gaps means Total number of gaps
    	ppos means Percentage of positive - scoring matches
    	qframe means Query frame
    	btop means Blast traceback operations(BTOP)
    	staxids means unique Subject Taxonomy ID(s), separated by a ';' (in numerical order)
    	stitle means Subject Title
    	salltitles means All Subject Title(s), separated by a '<>'
    	qcovhsp means Query Coverage Per HSP
    	qtitle means Query title
    
    	Default: qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore
    

    diamond就选6吧,便于批量处理。

    diamond 比对转录本到Pfam库的部分结果,可以看到,格式6非常适合做批量处理。

    wgs.RNAseq_00000687     M1AFS7.1        60.9    179     65      2       337     861     102     279     8.1e-55 221.5
    wgs.RNAseq_00000687     A0A0B2PIQ0.1    61.0    177     66      2       337     861     319     494     4.0e-54 219.2
    wgs.RNAseq_00000687     A0A0L9UG27.1    61.2    178     65      3       337     861     315     491     4.0e-54 219.2
    wgs.RNAseq_00000687     A0A0L9UG27.1    63.2    38      14      0       6       119     250     287     1.0e-04 55.1
    wgs.RNAseq_00000688     A0A0D2QMP5.1    65.8    114     35      2       218     550     317     429     1.7e-34 153.3
    wgs.RNAseq_00000688     A0A0D2QMP5.1    53.2    62      26      1       36      221     256     314     8.7e-10 71.2
    wgs.RNAseq_00000688     A0A0D2TR34.1    65.8    114     35      2       218     550     322     434     1.7e-34 153.3
    wgs.RNAseq_00000688     A0A0D2TR34.1    53.2    62      26      1       36      221     261     319     8.7e-10 71.2
    wgs.RNAseq_00000688     F6H0G0.1        65.8    111     36      2       221     550     320     429     5.6e-33 148.3  

    每一列是什么呢? BLASTn output format 6

    Column headers:
    qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore

     1.  qseqid  query (e.g., gene) sequence id
     2.  sseqid  subject (e.g., reference genome) sequence id
     3.  pident  percentage of identical matches
     4.  length  alignment length
     5.  mismatch  number of mismatches
     6.  gapopen  number of gap openings
     7.  qstart  start of alignment in query
     8.  qend  end of alignment in query
     9.  sstart  start of alignment in subject
     10.  send  end of alignment in subject
     11.  evalue  expect value
     12.  bitscore  bit score

    比对结果的第三列和第四列非常有用,尤其是在鉴别软件stringtie等预测出来的转录本是否为有效转录本时,其实预测出来的转录本大部分都是没有意义的,但是又能部分hit到蛋白上,这时我们就只能选出比对最长的那个转录本,其余的可以看作是无效的转录本。

    操作比较简单,先对比对长度排个序,从长到短。

    cat extract_no_N_200.fasta.diamond.nr | sort -n -r -k4 > extract_no_N_200.fasta.diamond.nr.sort
    

    其次就是利用python的panda模块去冗余就好了。

    import pandas as pd
    
    infile = "extract_no_N_200.fasta.diamond.nr.sort"
    #infile = "test.sort"
    
    df = pd.read_csv(infile, sep="	", header=None)
    df.columns = ["l1","l2","l3","l4","l5","l6","l7","l8","l9","l10","l11","l12"]
    
    df1 = df.sort_values(['l4'],ascending=False)
    
    df2 = df1.drop_duplicates(subset=[df1.columns[1]], keep = 'first')
    
    df2.to_csv("rm_dup_protein_"+infile, sep="	", index=False, header=False)
    
    df3 = df2.sort_values(['l3'],ascending=False)
    
    df4 = df3.drop_duplicates(subset=[df3.columns[0]], keep = 'first')
    
    df4.to_csv("unique_protein_"+infile, sep="	", index=False, header=False)
    

      

    这样我们得到的转录本就基本上是无冗余的了。

    然后对每个转录本取identical match比例最高的蛋白就好了。其实信息是可以全部保留的。

    blast和diamond怎么只输出最优的hit?

    blastn -query transcripts.fa -out transcripts.blast.txt -task megablast -db refseq_rna -num_threads 12 -evalue 1e-10 -best_hit_score_edge 0.05 -best_hit_overhang 0.25 -outfmt 7 -perc_identity 50 -max_target_seqs 1
    

      

    待续~

  • 相关阅读:
    React倒计时功能实现——解耦通用
    如何在 UmiJs 中配置使用 Sass
    Redux入门实战——todo-list2.0实现
    每天学点JavaScript基础(2)——JavaScript里的分号,你加还是不加?
    每天学点JavaScript基础(1)—— null 和 undefined
    CSS画图
    React入门实战实例——ToDoList实现
    博客园样式管理总结(个人博客园装修指南)
    如何使用VSCode发布博客到博客园
    React入门
  • 原文地址:https://www.cnblogs.com/leezx/p/8603166.html
Copyright © 2011-2022 走看看