短序列组装Sequence Assembly（转载）

zoukankan html css js c++ java

短序列组装Sequence Assembly（转载）

转载：http://blog.sina.com.cn/s/blog_4af3f0d20100fq5i.html

短序列组装（Sequence assembly）几乎是近年来next-generation sequencing最热门的话题。简单来说，就是把基因组长长的序列打断(shotgun sequencing)，因为我们不知道基因组整条序列是如何排列（成一条链，最后成为一条染色体）组合（如何区分不同染色体）的，而我们又无法实现一次把整条长序列完整测序（现在有单子测序可能是一个新的sunlight)。然后，我们通过算法，计算机的帮助，把这些短的序列组装起来成为一条完整有序的序列。
就好比我们有这样一句话：

    it is just a hypothesis, so don't be seriously！

    假设，我们现在不知道这句话到底是什么，就像我们有一个box，我们抽到一张纸，但没打开，我们把这张纸撕成pieces，当然可能还发生了变化，所有的空格和标点都消失了（魔术！）我们得到：

    itis ypo stah the sodo eriou siss ju ntbes sly……

    因为我们测了几次，为了增加覆盖度，这样我们能通过高覆盖度而提高置信度：

    itis ypo stah the sodo eriou siss ju ntbes sly tis yopth sodon beser beser ssod iti sju……

    另外，我们又发明了一种称作为paired-ends的序列测序方法，即两头定长，中间插入片段一定的序列，像这样：

    iti*****ahyp sju*****pot the*****don sod*****ser bes*****sly ……

    这样我们根据如下图的方法，我们可以把这句话拼回来：

     itisjustahypothesissodontbeseriously

但它不是最终结果，我们根据我们的现有的语法习惯，我们给它们加上空格（gap)和标点（遗漏的关键东西），我们能够还原原话！

第一：介绍一下组装的方法：
方法一：对序列进行组装,如果是重测序,可以用MAQ进行组装：Map to reference genome
方法二：如果是对新物种进行(de novo)测序,用velvet进行组装：De novo assembly
第二：组装的原理和流程图：



方法一和方法二的区别是有无参考基因组（reference genome）：下面是有参考基因组的一个结果显示



Mapping short reads to a reference
Eland
aligner for Illumina data
alignment policies:
• allows up to 2 mismatches/alignment
• non-unique alignments are discarded
Maq
• quality aware - takes seq quality into
account
• allows non-unique alignments
Index methods
• reference genome is loaded into active
memory as k-mers
• very fast alignments
• SOAP
• Bowtie
SNP detection, paired-end mapping, RNA-seq, ChIP-seq, etc.



Analysis depends on application
Mapping to reference genome
• useful for interrogating the “known” genome
• RNA sequencing
• ChIP sequencing
• SNP detection (targeted and whole-genome)
• methyl-seq
• CNV detection (sometimes)
De novo assembly
• no genome sequence
• unbiased ascertainment of variation in
known genome by whole-genome reseq

第三：short reads alignment by MAQ



第四：velvet示意图：



    通过上述两种方法可以完成高通量短序列数据的组装，但事实它并不简单，因为基因组中含有大量的重复序列（Repeats），多态性变异（Polymorphism），测序错误（Sequencing error)，这三个方面就是组装过程中出现组装错误的主要来源.

参考资料：http://blog.sina.com.cn/s/blog_4860086b0100dnos.html

http://seqanswers.com/forums/showthread.php?t=1024

查看全文

相关阅读:
在asp.net中显示/隐藏GridView的列
 WPF中的图表设计器 – 2
Code Project精彩系列
 C#实现台球游戏
 超级简单:DIV布局
 [WF4.0]工作流设计器Rehosting（三）
android 集成第三方应用，包。
抓log方法
 android logcat 打印
 android build.prop学习

原文地址：https://www.cnblogs.com/steamed-bread/p/5611058.html