Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the MaSuRCA mega-reads algorithm

zoukankan html css js c++ java

Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the MaSuRCA mega-reads algorithm
Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the MaSuRCA mega-reads algorithm 用马瑟卡巨型读码算法，将面包小麦的祖先土什奇的巨大且高度重复的基因组杂交组装
Aleksey V. Zimin 1,2,

Daniela Puiu 1,

Ming-Cheng Luo 3,

Tingting Zhu 3,

Sergey Koren 4,

Guillaume Marçais 2,5,

James A. Yorke 2,6,

Jan Dvořák 3 and

Steven L. Salzberg 1,7

+Author Affiliations

¹Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, Maryland 21205, USA;

²Institute for Physical Sciences and Technology, University of Maryland, College Park, Maryland 20742, USA;

³Department of Plant Sciences, University of California, Davis, California 95616, USA;

⁴National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA;

⁵Department of Computational Biology, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, USA;

⁶Departments of Mathematics and Physics, University of Maryland, College Park, Maryland 20742, USA;

⁷Departments of Biomedical Engineering, Computer Science, and Biostatistics, Johns Hopkins University, Baltimore, Maryland 21218, USA

Corresponding author: salzberg@jhu.edu
Abstract

Long sequencing reads generated by single-molecule sequencing technology offer the possibility of dramatically improving the contiguity of genome assemblies. The biggest challenge today is that long reads have relatively high error rates, currently around 15%. The high error rates make it difficult to use this data alone, particularly with highly repetitive plant genomes. Errors in the raw data can lead to insertion or deletion errors (indels) in the consensus genome sequence, which in turn create significant problems for downstream analysis; for example, a single indel may shift the reading frame and incorrectly truncate a protein sequence. Here, we describe an algorithm that solves the high error rate problem by combining long, high-error reads with shorter but much more accurate Illumina sequencing reads, whose error rates average <1%. Our hybrid assembly algorithm combines these two types of reads to construct mega-reads, which are both long and accurate, and then assembles the mega-reads using the CABOG assembler, which was designed for long reads. We apply this technique to a large data set of Illumina and PacBio sequences from the species Aegilops tauschii, a large and extremely repetitive plant genome that has resisted previous attempts at assembly. We show that the resulting assembled contigs are far larger than in any previous assembly, with an N50 contig size of 486,807 nucleotides. We compare the contigs to independently produced optical maps to evaluate their large-scale accuracy, and to a set of high-quality bacterial artificial chromosome (BAC)-based assemblies to evaluate base-level accuracy.

由单分子测序技术产生的长测序reads提供了大幅度提高基因组装配的连续性的可能性。
目前最大的挑战是长读的错误率相对较高，目前约为15%。
高错误率使得单独使用这些数据非常困难，特别是对于高度重复的植物基因组。
原始数据中的错误会导致一致基因组序列中的插入或删除错误(indels)，从而给下游分析带来重大问题;
例如，单个indel可能会改变读码框并错误地截断蛋白质序列。
在这里，我们描述了一种解决高错误率问题的算法，该算法结合了长、高错误率的reads和更短但更准确的Illumina测序reads，其错误率平均为1%。
我们的混合装配算法将这两种读取结合起来，构造出既长又准确的巨读，然后使用CABOG装配器对巨读进行装配，CABOG装配器是专为长读设计的。
我们将这项技术应用于一个巨大的Illumina和PacBio序列的数据集，该序列来自于Aegilops tauschii物种，这是一个巨大的、极其重复的植物基因组，以前的组装尝试都没有成功。
我们发现，最终组装的contig比以往任何组装的都要大，N50的contig大小为486,807个核苷酸。
我们将contigs与独立制作的光学地图进行比较，以评估其大规模精度，并将其与一套高质量的细菌人工染色体(BAC)组件进行比较，以评估基础水平的精度。
查看全文

相关阅读:
Delphi 日期函数的单元 DateUtils
学习官方示例 SysUtils.DecodeDate、DecodeTime
msp430的常量可以这样定义
 学习官方示例 SysUtils.EncodeDate、EncodeTime、StrToDate、StrToTime、StrToDateTime
Delphi中Format与FormatDateTime函数详解
 csdn太慢了搬到园子里来
 .net 2.0 真的能与1.1 安全正确地运行在同一台电脑上吗？
照着这些做，生活自然很开心
 【转】SQL中取当前记录的ID>SCOPE_IDENTITY()
[转]Windows XP Service Pack 2中弹出窗口拦截器的研究

原文地址：https://www.cnblogs.com/wangprince2017/p/13756608.html

Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the MaSuRCA mega-reads algorithm

Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the MaSuRCA mega-reads algorithm 用马瑟卡巨型读码算法，将面包小麦的祖先土什奇的巨大且高度重复的基因组杂交组装

Abstract