zoukankan html css js c++ java

tophat

转自：http://blog.sina.com.cn/s/blog_8808cae20101amqp.html

一. Tophat简介

Tophat使用RNA-seq的reads数据来寻找基因的剪切点(splice junction）。该软件调用Bowtie,或Bowtie2来将reads比对到参考基因组上，分析比对结果，从而寻找出外显子之间的结合位点。

二. Tophat安装

直接下载适合于Linux x86_64的二进制文件，解压缩即可使用。

$ wget http://tophat.cbcb.umd.edu/downloads/tophat-2.0.8b.Linux_x86_64.tar.gz $ tar zxf tophat-2.0.8b.Linux_x86_64.tar.gz

前提条件当然要安装Bowtie, Bowtie2, SAM tools, Boost C++ libraries等。

三. Tophat的使用参数

使用Tohat时，bowtie2（或bowtie，下同）, bowtie2-align, bowtie2-inspect, bowtie2-build 和 samtools 必须要在系统路径中。

1. 用法

$ tophat [options]* 可以看出，tophat必须要的条件是比对的index数据库，以及要比对的reads。可以为多个 paired-end reads数据以逗号分开。

index数据文件 需要给出目录以及目录文件的共同前缀 例如，index文件存放在当前目录下的index文件夹，文件的名字是hg19.*.*, index数据的文件应该是：./index/hg19

值得注意的：Tophat能比对的最大reads为1024bp；能比对paired-end reads；不能将多种不同类型的reads混合起来进行比对，这样会给出不好的结果。

如果有多种不同类型的reads进行比对，则可以：

首先，对一种类型的reads使用合适的参数运行tophat； 接着，使用bed_to_juncs将前一次的运行结果junctions.bed转换成下一次运行tophat 所的-j参数所需的junction文件； 最后，再一次使用-j参数运行tophat。

2. 常用一般参数

-h | --help -v | --version -N | --read-mismatches default: 2 丢弃不匹配碱基数超过该数目的比对结果 --read-gap-length default: 2 丢弃gap总长度超过该数目的比对结果 --read-edit-dist default: 2 丢弃read的edit distance大于该值的比对结果 --read-realign-edit-dist default: "read-edit-dist" + 1 一些跨越多个exons的reads可能会被错误地比对到geneome上。Tophat有多个比对 步骤，每个比对步骤过后，比对结果中包含了edit distance的值。该参数能让Tophat对 那些edit distance的值 >= 该参数的reads重新进行比对。若设置该参数值为0，则每个 read在多个比对步骤中每次都要进行比对。这样会加大地增加比对精确性和运行时间。默认下 该参数比上一个参数的值大，则表示对reads进行重新比对。 --bowtie1 default: bowtie2 使用Bowtie1来代替Bowtie2进行比对。特别是使用colorspace reads时，因为只 有Bowtie1支持，而Bowtie2不支持。 -o | --output default: ./tophat_out 输出的文件夹路径 -r | --mate-inner-dist default: 50 成对的reads之间的平均inner距离。例如：fragments长度300bp，reads长度50bp ， 则其inner距离为200bp，该值该设为200。 --mate-std-dev default:20 inner距离的标准偏差。 -a | --min-anchor-length default: 8 read的锚定长度：该参数能设定的最小值为3；锚定在junction两边的reads长度只 有都大于此值，才能用于junction的验证。 -m | --splice-mismatches default: 0 对于一个剪切比对，其在锚定区能出现的最大的不匹配碱基数。 -i | --min-intron-length default:70 最小的intron长度。Tophat会忽略比该长度要小的donor/acceptor pairs，认 为该区属于exon。 --I | --max-intron-length default:500000 最大的intron长度。Tophat会忽略长度大于该值的donor/acceptor pairs，除 非有long read支持。 --max-insertion-length defautl: 3 最大的插入长度 --max-deletion-length default: 3 最大的缺失长度 --solexa-quals fastq文件使用Solexa的碱基质量格式 --solexa1.3-quals | --phred64-quals 使用Illumina GA pipeline version 1.3的碱基质量格式，即Phred64. -Q | --quals 说明是使用单独的碱基质量文件 --inter-quals 有空格隔开的整数值来代表碱基质量。当使用 -C 参数时，该参数为默认参数。 -C | --color Colorspace reads。使用这一种reads的时候命令如下： $ tophat --color --quals --bowtie1 [other options]* -p | --num-threads default: 1 比对reads的线程数 -g | --max-multihits default: 20 对于一个reads，可能会有多个比对结果，但tophat根据比对得分，最多保留的比对结 果数目。如果没有 --report-secondary-alignments 参数，则只会报告出最佳的比对 结果。若最佳比对结果数目超过该参数值，则只随机报告出该数目的最佳比对结果；若有 -- report-secondary-alignments 参数，则按得分顺序报告出比对结果，直至达到默认 的数目为止。 --report-secondary-alignments 是否报告additional or secondary alignments（基于比对分值AS来确定的）。 --no-discordant 对于paired reads，仅仅报告concordant mappings。 --no-mixed 对于paired reads，只报告concordant mappings 和 discordant mappi ngs。默认上，是所有的比对结果都报告。 --no-coverage-search 取消以覆盖度为基础来搜寻junctions，和下一个参数对立，该参数为默认参数。 --coverage-search 确定以覆盖度为基础来搜寻junctions (此参数会占用大量的内存和时间)。该参数能增大敏感性。 --microexon-search 使用该参数，pipeline会尝试寻找micro-exons。仅仅在reads长度>=50bp时有效。 --library-typeTophat处理的reads具有链特异性。比对结果中将会有个XS标签。一般Illumina数 据的library-type为 fr-unstranded。

3. 高级参数

--keep-tmp 保留中间文件和临时文件，对于debug有用 --keep-fasta-order 对比对结果按基因组fasta文件进行排序。该参数会使输出的SAM/BAM文件和tophat的 1.41或以前版本不兼容 --no-sort-bam输出的BAM文件不是coordinate-sorted. --no-convert-bam 不要转换成bam格式。输出结果为sam格式。 -R | --resume 从最末尾的成功完成点处，接着运行Tophat。使用方法为： $ tophat -R tophat_out -z | --zpacker default:gzip 用来对临时文件进行压缩的的压缩程序

4. Bowtie2的特别参数

使用tophat2的时候，其中的一些参数传递为bowtie2的参数，这些参数都以’b2′开头。其实，这些参数使用默认的即可。

end-to-end模式(Tophat2不能使用local alignment): --b2-very-fast --b2-fast --b2-sensitive --b2-very-sensitive 比对参数： --b2-N default: 0 --b2-L default: 20 --b2-i default: S,1,1.25 --b2-n-ceil default: L,0,0.15 --b2-gbar defaut: 4 得分参数： --b2-mp default: 6,2 --b2-np default: 1 --b2-rdg default: 5,3 --b2-rfg default: 5,3 --b2-score-min default: L,-0.6,-0.6 Effort参数 --b2-D default: 15 --b2-R default: 2

5. 融合转录子mapping

如果设定 –fusion-search 参数，则有些reads能比对到潜在的融合转录子(fusion transcripts)上。额外融合信息保存在 fusions.out 中。

--fusion-search 开启融合转录子的比对 --fusion-anchor-length default: 20 read比对到融合子的两边，每以边至少匹配的碱基数。

6. 提供的转录子的结构注释数据

值得注意的提供的GTF文中中的染色体名称和Bowtie index中的一致。这些名称是区分大小写的。

-j | --raw-juncs <.juncs file> 提供junctions文件。该文件可以使用tophat同一目录下的程序bed_to_juncs程序 来处理tophat的结果文件junctions.bed生成。 $ bed_to_juncs junctions文件是tab分隔的文件，内容为： <+/-> 其中left和right数值是0-based的junction两端的值。 --no-novel-juncs 只搜寻和GFF或junctions文件中提供的junctions想匹配的reads。如果没有 -G 或 -j 参数，则该参数无效。 -G | --GTF 提供基因模型的注释文件，GTF 2.2 或者 GFF 3 格式的文件。如果设置了该参数，Tophat 则先提取出转录子序列，然后使用Bowtie2将reads比对到提取的转录组中；只有不能比对上 的reads再比对到genome；比对上的reads再打断转变成genomic mappings；再融合新 的mappings和junctions作为最后的输出。 值得注意的是GTF/GFF文件代表chromosome和contig的第一列要和bowtie index中的 参考序列名一致。 `$ bowtie-inspect --names your_index` 命令可以获得bowt ie的index。 --transcriptome-index

是使用了 -G 参数后，Tophat提取转录子序列，然后使用bowtie2-build来建立index，这个过程会消耗不少时间。于是，使用该参数，会将index文件生成到指定文件夹。则后续的运用同样的index则不再需要额外耗时了。

7. 提供insertions/deletions

以下参数是使用RNA-seq数据来验证indels。

--insertions | --deletions <.juncs file> juncs文件例子： chr1 20564 20567 0-based数值。表示有20565和20566这2个碱基缺失 chr1 17491 17491 CA 表示在17491处插入了2个碱基CA --no-novel-indels 仅仅只搜寻在已给的位点的reads。

四. Tophat的输出结果

主要的结果文件是：
1. accepted_hits.bam
2. junctions.bed UCSC BED格式
3. insertions.bed 和 deletions.bed

五. 思考题

对一物种的两个样本A和B使用Illumina Hiseq2000分别进行了转录组测序，得到了结果文件A.reads1_1.fastq, A.reads1_2.fastq, B.reads1_1.fastq 和 B.reads1_2fastq。测序文库的插入片段长度为200bp，reads长度为90bp。物种的基因组文件为species.fasta。请用 Tophat分析该物种的转录结合位点，indel信息?

$ bowtie2-build species.fastq species  建立索引文件
$ tophat    --read-realign-edit-dist 0    -o ./tophat_out    -r 20  --mate-std-dev 20  --coverage-search  --microexon-search  -p 24  --library-type fr-unstranded  species  A.reads1_1.fastq,B.reads1_1.fastq A.reads1_2.fastq B.reads1_2fastq

获得unique的reads

grep "NH:i:1$"

tophat: http://ccb.jhu.edu/software/tophat/index.shtml

查看全文

相关阅读:
获取股票行情API 接口
 使用百度地图来展示自定义的GPS点，用pyechart 框架实例
 C 语言基础笔记
 GPS 测试汇总和python GPS 导航地图实现
 用python 来炒股二 BeautifulSoup爬虫信息新闻文章
 Python tkinter 笔记 [pack,place,grid 布局管理]
RSS 订阅精选 2020
用python来炒股<三> 炒股交易系统(法则)
使用python 来实现炒股
 鼠须管输入法的配置介绍

原文地址：https://www.cnblogs.com/zf723/p/4936454.html