zoukankan      html  css  js  c++  java
  • QIIME1 聚OTU

    qiime 本身不提供聚类的算法,它只是对其他聚otu软件的封装
    根据聚类软件的算法,分成了3个方向:
    de novo:                   pick_de_novo_otus.py 
    closed-reference:      pick_closed_reference_otus.py
    open-reference OTU: pick_open_reference_otus.py
     
     不同算法的优缺点:
    de novo:    pick_de_novo_otus.py 
    优点: 所有的reads 都会聚类
    缺点:不支持并行,计算速度慢,当reads > 10M 时就会非常慢
    使用场景: 研究不常见的marker 基因
     
    closed-reference: pick_closed_reference_otus.py
    和数据库比对,比对不上数据库的reasd 直接丢掉,数据库中reads 带有taxonpmy 注释, 可以方便的进行taxonomy 注释
    优点:完全并行, 速度快;tree 或者taxonomy 注释更好, 数据库中的otu分类效果都很好
    缺点: 不能检测数据库中没有的物种
    Because reads that don’t hit the reference sequence collection are discarded, your analyses only focus on the diversity that you “already know about”
     
    open-reference OTU: pick_open_reference_otus.py
    首先和数据库比对,没有比对上的reads 在使用denovo的聚类策略进行聚otu
    open-reference OTU 是推荐的聚otu策略
    优点: 所有reads都会聚类,部分并行,速度较快
    缺点: 当新物种较多时,速度会很慢
     
    我们最常用的是open-reference OTU聚类, 对应的脚本是 pick_open_reference_otus.py  
    可以看做一个pipieline, 共有6个步骤,其中前4步为OTU 聚类,后2步为产生OTU table 和 聚类的tree
     
    Step 1) Prefiltering and picking closed reference OTUs
    The first step is an optional prefiltering of the input fasta file to remove
    sequences that do not hit the reference database with a given sequence
    identity (PREFILTER_PERCENT_ID). This step can take a very long time, so is
    disabled by default. The prefilter parameters can be changed with the options:
    --prefilter_refseqs_fp
    --prefilter_percent_id
    This filtering is accomplished by picking closed reference OTUs at the specified
    prefilter percent id to produce:
    prefilter_otus/seqs_otus.log
    prefilter_otus/seqs_otus.txt
    prefilter_otus/seqs_failures.txt
    prefilter_otus/seqs_clusters.uc
    Next, the seqs_failures.txt file is used to remove these failed sequences from
    the original input fasta file to produce:
    prefilter_otus/prefiltered_seqs.fna
    This prefiltered_seqs.fna file is then considered to contain the reads
    of the marker gene of interest, rather than spurious reads such as host
    genomic sequence or sequencing artifacts
     
    首先对序列进行一个预处理,给定一个比对相似度 ,采用close-reference OTU 方法删除输入序列中不能比对上数据库的序列,这一步是可选的
    如果执行了预处理,会产生 prefilter_otus/prefiltered_seqs.fna 文件,如果不执行,直接拿 input.fasta 去进行下一步的处理
     
    If prefiltering is applied, this step progresses with the prefiltered_seqs.fna.
    Otherwise it progresses with the input file. The Step 1 closed reference OTU
    picking is done against the supplied reference database. This command produces:
    step1_otus/_clusters.uc
    step1_otus/_failures.txt
    step1_otus/_otus.log
    step1_otus/_otus.txt
     
    然后采用close-reference OTU的方式聚OTU
     
    The representative sequence for each of the Step 1 picked OTUs are selected to
    produce:
    step1_otus/step1_rep_set.fna
     
    Next, the sequences that failed to hit the reference database in Step 1 are
    filtered from the Step 1 input fasta file to produce:
    step1_otus/failures.fasta
     
    Then the failures.fasta file is randomly subsampled to PERCENT_SUBSAMPLE of
    the sequences to produce:
    step1_otus/subsampled_failures.fna.
    Modifying PERCENT_SUBSAMPLE can have a big effect on run time for this workflow,
    but will not alter the final OTUs.
     
    对于没能比对上数据库的read, 会生成 step1_otus/failures.fasta 文件,同时随机抽取一部分reads, 产生step1_otus/subsampled_failures.fna 文件
    修改 PERCENT_SUBSAMPLE 参数,可以加速运行时间
     
     
     
    Step 2) The subsampled_failures.fna are next clustered de novo, and each cluster
    centroid is then chosen as a "new reference sequence" for use as the reference
    database in Step 3, to produce:
    step2_otus/subsampled_seqs_clusters.uc
    step2_otus/subsampled_seqs_otus.log
    step2_otus/subsampled_seqs_otus.txt
    step2_otus/step2_rep_set.fna
     
    对于第一步产生的step1_otus/subsampled_failures.fna 文件,使用denovo 聚类的方式对这部分序列聚类,产生新的参考序列
     
    Step 3) Pick Closed Reference OTUs against Step 2 de novo OTUs
    Closed reference OTU picking is performed using the failures.fasta file created
    in Step 1 against the 'reference' de novo database created in Step 2 to produce:
    step3_otus/failures_seqs_clusters.uc
    step3_otus/failures_seqs_failures.txt
    step3_otus/failures_seqs_otus.log
    step3_otus/failures_seqs_otus.txt
     
    用step1_otus/failures.fasta 比对step2_otus/step2_rep_set.fna 进行比对
     
    Assuming the user has NOT passed the --suppress_step4 flag:
    The sequences which failed to hit the reference database in Step 3 are removed
    from the Step 3 input fasta file to produce:
    step3_otus/failures_failures.fasta
     
    没有比对上的序列会产生step3_otus/failures_failures.fasta 文件
     
     
    Step 4) Additional de novo OTU picking
    It is assumed by this point that the majority of sequences have been assigned
    to an OTU, and thus the sequence count of failures_failures.fasta is small
    enough that de novo OTU picking is computationally feasible. However, depending
    on the sequences being used, it might be that the failures_failures.fasta file
    is still prohibitively large for de novo clustering, and the jobs might take
    too long to finish. In this case it is likely that the user would want to pass
    the --suppress_step4 flag to avoid this additional de novo step.
     
    A final round of de novo OTU picking is done on the failures_failures.fasta file
    to produce:
    step4_otus/failures_failures_cluster.uc
    step4_otus/failures_failures_otus.log
    step4_otus/failures_failures_otus.txt
     
    用第三步产生failures_failures.fasta 文件再次聚OTU
     
    Step 5) Produce the final OTU map and rep set
    If Step 4 is completed, the OTU maps from Step 1, Step 3, and Step 4 are
    concatenated to produce:
    final_otu_map.txt
     
    如果第四步执行了的话,将1,3,4 产生的map 文件合并起来,产生final_otu_map.txt 文件
     
    If Step 4 was not completed, the OTU maps from Steps 1 and Step 3 are
    concatenated together to produce:
    final_otu_map.txt
     
    如果第四步没有执行,将1,3产生的map 文件合并起来,产生final_otu_map.txt 文件
     
    Next, the minimum specified OTU size required to keep an OTU is specified with
    the --min_otu_size flag. For example, if the user left the --min_otu_size as the
    default value of 2, requiring each OTU to contain at least 2 sequences, the any
    OTUs which failed to meet this criteria would be removed from the
    final_otu_map.txt to produce:
    final_otu_map_mc2.txt
     
    If --min_otu_size 10 was passed, it would produce:
    final_otu_map_mc10.txt
     
    The final_otu_map_mc2.txt is used to build the final representative set:
    rep_set.fna
     
    -min_otu_size 对OTU进行过滤,产生final_otu_map_mc2.txt 文件已经对应的代表序列 rep_set.fna
     
    Step 6) Making the OTU tables and trees
    An OTU table is built using the final_otu_map_mc2.txt file to produce:
    otu_table_mc2.biom
     
    由final_otu_map_mc2.txt 产生 otu_table_mc2.biom OTU table
     
    As long as the --suppress_taxonomy_assignment flag is NOT passed,
    then taxonomy will be assigned to each of the representative sequences
    in the final rep_set produced in Step 5, producing:
    rep_set_tax_assignments.log
    rep_set_tax_assignments.txt
    This taxonomic metadata is then added to the otu_table_mc2.biom to produce:
    otu_table_mc_w_tax.biom
     
    对otu 代表序列进行 taxonomy 注释, 产生 otu_table_mc_w_tax.biom 文件
     
    As long as the --suppress_align_and_tree is NOT passed, then the rep_set.fna
    file will be used to align the sequences and build the phylogenetic tree,
    which includes the de novo OTUs. Any sequences that fail to align are
    omitted from the OTU table and tree to produce:
    otu_table_mc_no_pynast_failures.biom
    rep_set.tre
     
    对otu代表序列进行多序列比对,构建进化树, 产生 rep_set.tre 文件
     
    If both --suppress_taxonomy_assignment and --suppress_align_and_tree are
    NOT passed, the script will produce:
    otu_table_mc_w_tax_no_pynast_failures.biom
     
    It is important to remember that with a large workflow script like this that
    the user can jump into intermediate steps. For example, imagine that for some
    reason the script was interrupted on Step 2, and the user did not want to go
    through the process of re-picking OTUs as was done in Step 1. They can simply
    rerun the script and pass in the:
    --step_1_otu_map_fp
    --step1_failures_fasta_fp
    parameters, and the script will continue with Steps 2 - 4.
     
    对于大型的脚本,要求可以在大致的步骤之间跳转,不执行前面的步骤
     
    **Note:** If most or all of your sequences are failing to hit the reference
    during the prefiltering or closed-reference OTU picking steps, your sequences
    may be in the reverse orientation with respect to your reference database. To
    address this, you should add the following line to your parameters file
    (creating one, if necessary) and pass this file as -p:
     
    pick_otus:enable_rev_strand_match True
     
    Be aware that this doubles the amount of memory used in these steps of the
    workflow.
     
    如果原始序列中有很大一部分序列,没有比对上数据库中的序列,可能的原因是输入序列与数据库中的是反向互补的,可以添加 pick_otus:enable_rev_strand_match True 参数
    但是这个参数会导致内存加倍
     
    基本用法:
    pick_open_reference_otus.py -i $PWD/seqs1.fna -r $PWD/refseqs.fna -o $PWD/ucrss_sortmerna_sumaclust/ -p $PWD/ucrss_smr_suma_params.txt -m sortmerna_sumaclust
     
    -i  : 输入的原始序列,fasta格式
    -r : 数据库中的序列,fasta格式, 默认采用的是 greengene /usr/local/lib/python2.7/site-packages/qiime_default_r eference/gg_13_8_otus/rep_set/97_otus.fasta
    -o : 输出结果的目录 
    -p : 参数对应的文件
    -m : 聚类的软件,可选的有'uclust', 'usearch61', 'sortmerna_sumaclust', 默认为 uclust
  • 相关阅读:
    LeetCode:删除链表中的节点【203】
    精益创业和画布实战(1):变革家,让天下没有难懂的生意
    怎么投资理财,如果有且仅有100万本金?
    怎么投资理财,如果有且仅有100万本金?
    Java集合——HashMap、HashTable以及ConCurrentHashMap异同比较
    View绘制详解,从LayoutInflater谈起
    Java线程和多线程(七)——ThreadLocal
    跳槽谋发展:人生发展的一些思考和最近找工作的坎坷经历
    跳槽谋发展:人生发展的一些思考和最近找工作的坎坷经历
    三个案例带你看懂LayoutInflater中inflate方法两个参数和三个参数的区别
  • 原文地址:https://www.cnblogs.com/xudongliang/p/7205190.html
Copyright © 2011-2022 走看看