Abstract

Motivation : Recent emergence of nanopore sequencing technology set a challenge for established assembly methods. In this work, we assessed how existing hybrid and non-hybrid de novo assembly methods perform on long and error prone nanopore reads.

Results : We benchmarked five non-hybrid (in terms of both error correction and scaffolding) assembly pipelines as well as two hybrid assemblers which use third generation sequencing data to scaffold Illumina assemblies. Tests were performed on several publicly available MinION and Illumina datasets of Escherichia coli K-12, using several sequencing coverages of nanopore data (20×, 30×, 40× and 50×). We attempted to assess the assembly quality at each of these coverages, in order to estimate the requirements for closed bacterial genome assembly. For the purpose of the benchmark, an extensible genome assembly benchmarking framework was developed. Results show that hybrid methods are highly dependent on the quality of NGS data, but much less on the quality and coverage of nanopore data and perform relatively well on lower nanopore coverages. All non-hybrid methods correctly assemble the E. coli genome when coverage is above 40×, even the non-hybrid method tailored for Pacific Biosciences reads. While it requires higher coverage compared to a method designed particularly for nanopore reads, its running time is significantly lower.

动机:最近纳米孔测序技术的出现对现有的组装方法提出了挑战。
在这项工作中,我们评估了现有的杂化和非杂化从头组装方法在长且容易出错的纳米孔读取上的表现。

结果:我们对五种非杂交(包括纠错和脚手架)组装管道以及两种杂交组装器进行了基准测试,这两种杂交组装器使用第三代测序数据来搭建Illumina组件。
使用多个纳米孔数据(20、30、40和50)对几个公开的MinION和Illumina大肠杆菌K-12数据集进行了测试。
为了估计封闭细菌基因组组装的需求,我们试图评估每一个覆盖层的组装质量。
为了进行基准测试,开发了一个可扩展的基因组组装基准测试框架。
结果表明,混合方法对NGS数据质量的依赖性较强,对纳米孔数据质量和覆盖度的依赖性较低,在覆盖较低的纳米孔上表现较好。
当覆盖度超过40时,所有的非杂交方法都能正确地组装大肠杆菌基因组,甚至为太平洋生物科学公司量身定做的非杂交方法也能如此。
与专为纳米孔读取而设计的方法相比,它要求更高的覆盖率,但其运行时间明显较低。

Availability and Implementation:https://github.com/kkrizanovic/NanoMark

Contact:mile.sikic@fer.hr

Supplementary information:Supplementary data are available at Bioinformatics online.

1 Introduction

During the last ten years next generation sequencing (NGS) devices have dominated genome sequencing market. In contrast to previously used Sanger sequencing, NGS is much cheaper, less time consuming and not so labor intensive. Yet, when it comes to de novo assembly of longer genomes many researchers are being skeptical of using NGS reads. These devices produce reads a few hundred base pairs long, which is too short to unambiguously resolve repetitive regions even within relatively small microbial genomes ( Nagarajan and Pop, 2013 ).

Although the use of paired-end and mate-pair technologies has improved the accuracy and completeness of assembled genomes, NGS sequencing still produces highly fragmented assemblies due to long repetitive regions. These incomplete genomes have to be finished using a more laborious approach that includes Sanger sequencing and specially tailored assembly methods. Owing to NGS, many efficient algorithms have been developed to optimize running time and memory footprints in sequence assembly, alignment and downstream analysis steps.

The need for technologies that would produce longer reads which could solve the problem of repeating regions has resulted in the advent of new sequencing approaches—the so-called ‘third generation sequencing technologies’. The first among them was a single-molecule sequencing technology developed by Pacific Biosciences (PacBio). Although PacBio sequencers produce much longer reads (up to several tens of thousands of base pairs), their reads have significantly higher error rate (∼10 to 15%) than NGS reads (∼1%) ( Schirmer et al. , 2015 ). Existing assembly and alignment algorithms were not capable of handling such high error rates. This caused the development of read error correction methods. At first, hybrid correction was performed using complementary NGS (Illumina) data ( Koren et al. , 2012 ). Later, self-correction of PacBio-only reads was developed ( Chin et al. , 2013 ) which required higher coverage (>50×). The development of new, more sensitive aligners (BLASR ( Chaisson and Tesler, 2012 )) and optimization of existing ones (BWA-MEM ( Li, 2013 )) was required.

In 2014, Oxford Nanopore Technologies (ONT) presented their tiny MinION sequencer—about the size of a harmonica. The MinION can produce reads up to a few hundred thousand base pairs long.1D reads from the MinION sequencer (with the latest R7.3 chemistry) have raw base accuracy less than 75%, while higher quality 2D reads (80–88% accuracy) comprise only a fraction of all 2D reads ( Ip et al. , 2015 Laver et al. , 2015 ). This again spurred the need for development of even more sensitive algorithms for mapping and realignment, such as GraphMap ( Sović et al. , 2016 ) and marginAlign ( Jain et al. , 2015 ). Any doubt about the possibility of using MinION reads for de novo assembly was resolved in 2015 when Loman et al. demonstrated that the assembly of a bacterial genome ( Escherichiacoli K-12) using solely ONT reads is possible even with high error rates ( Loman et al. , 2015 ). Thanks to extremely long reads and affordability and availability of the nanopore sequencing technology, these results might cause a revolution in de novo sequence analysis in near future.

Following up on the results from Loman et al. and Liao et al. (2015 ), we explored the applicability of existing hybrid and non-hybrid de novo assembly tools that support third generation sequencing data and assessed their ability to cope with nanopore error profiles. In our study, we compared seven assembly tools/pipelines which include five long-read assemblers: pipeline published by Loman et al. (in continuation LQS pipeline), PBcR ( Koren et al. , 2012 ), Falcon ( https://github.com/PacificBiosciences/FALCON ), Miniasm ( Li, 2016 ) and Canu ( https://github.com/marbl/canu ); and two hybrid assemblers: ALLPATHS-LG ( Gnerre et al. , 2011 ) and SPAdes ( Bankevich et al. , 2012 ). These tools were tested on real, publicly available datasets of a well-known clonal sample of E. coli K-12 MG1655. All of the tools/pipelines were evaluated on all test datasets up to the draft assembly level, not including the polishing phase. Draft assemblies, containing one ‘big contig’ of at least 4 Mbp, were polished using Nanopolish and compared. For the purpose of our analyses, we developed a benchmarking framework NanoMark which automates running of different assemblers and processing of the results. The framework provides wrappers with uniform interfaces for each assembly tool, simplifying their usage for the end user.

1介绍
近十年来,新一代测序(NGS)设备已经占领了基因组测序市场。
与以前使用的桑格测序相比,NGS测序更便宜,更省时,也不那么费力。
然而,当涉及到长基因组的从头组装时,许多研究人员对使用NGS读取持怀疑态度。
这些设备产生几百个碱基对长的读取,这太短,即使在相对较小的微生物基因组中,也无法明确地解决重复区域(Nagarajan和Pop, 2013)。
虽然配对端和配对技术的使用提高了组装基因组的准确性和完整性,但由于长重复区域,NGS测序仍然产生高度碎片化的组装。
这些不完整的基因组必须使用一种更加费力的方法来完成,包括桑格测序和专门定制的组装方法。
由于NGS的存在,许多有效的算法被开发来优化在顺序汇编、对齐和下游分析步骤中的运行时间和内存占用。
对能够产生更长的读取值的技术的需求可以解决重复区域的问题,这导致了新的测序方法的出现,即所谓的第三代测序技术。
其中第一项是太平洋生物科学公司(PacBio)开发的单分子测序技术。
虽然PacBio测序器能产生更长的读取(多达数万个碱基对),但它们的读取误差率(10 - 15%)明显高于NGS读取(1%)(Schirmer et al., 2015)。
现有的装配和对准算法无法处理如此高的错误率。
这导致了误读校正方法的发展。
首先,使用互补NGS (Illumina)数据进行混合校正(Koren et al., 2012)。
后来,研发了pacbio only reads的自校正(Chin et al., 2013),要求更高的覆盖率(>50)。
需要开发新的、更敏感的对准器(BLASR (Chaisson and Tesler, 2012))和优化现有对准器(BWA-MEM (Li, 2013))。
2014年,牛津纳米孔科技公司(ONT)推出了他们的微型音序器,大小约为一个口琴。
仆从可以产生长达几十万个碱基对的读取。
从MinION测序器读取的1D序列(最新的R7.3化学成分)的原始基础精度低于75%,而更高质量的2D读取(8088%的精度)只占所有2D读取的一小部分(Ip等人,2015;
Laver等,2015)。
这再次引发的需要发展更敏感的算法映射和调整,如GraphMap (sovisentćet al ., 2016)和marginAlign (Jain et al ., 2015)。
任何怀疑的可能性使用奴才读取新创组装是解决2015年鲁曼等人表明,细菌基因组的组装(Escherichiacoli k - 12)使用仅仅读甚至是可能的高错误率(鲁曼et al ., 2015)。
由于超长读取时间、纳米孔测序技术的可购性和可购性,这些结果可能在不久的将来引发一场从头测序分析的革命。
在Loman等人和Liao等人(2015)的研究结果的基础上,我们探索了支持第三代测序数据的现有杂交和非杂交新装配工具的适用性,并评估了它们应对纳米孔误差的能力。
在我们的研究中,我们比较七装配工具/管道包括五个读汇编:管道发表的鲁曼et al .(在延续lq管道),PBcR(科伦et al ., 2012),猎鹰(https://github.com/PacificBiosciences/FALCON), Miniasm(李,2016)和Canu (https://github.com/marbl/canu);
以及两种混合装配器:ALLPATHS-LG (Gnerre et al., 2011)和SPAdes (Bankevich et al., 2012)。
这些工具测试了真实的,公开的数据集,一个著名的克隆样本的大肠杆菌K-12 MG1655。
所有的工具/管道都在所有的测试数据集上进行了评估,直到草案装配水平,不包括抛光阶段。
牵伸组件,包含一个大的至少4 Mbp的contig,使用纳米极抛光并进行比较。
为了进行分析,我们开发了一个基准测试框架NanoMark,它可以自动运行不同的汇编程序并对结果进行处理。
框架为每种组装工具提供了具有统一接口的包装器,从而简化了最终用户对它们的使用。

2 Background

Majority of algorithms for de novo assembly follow either the de Bruijn graph (DBG) or the Overlap-Layout-Consensus (OLC) paradigm ( Pop, 2009 ). OLC assemblers predate the DBG and were widely used in the Sanger sequencing era. A major representative of the OLC class is Celera which was developed and maintained until very recently. The DBG approach attempted to solve the problem of ever-growing sequencing throughput brought on by the NGS technologies. Unlike OLC in which overlaps between reads have to be calculated explicitly, DBG splits the reads into k-mers and constructs the overlap graph implicitly, e.g. through a hash table lookup. While the assembly in the OLC paradigm attempts to find a Hamiltonian path through an overlap graph, the DBG attempts to solve a, virtually, simpler problem of finding an Eulerian path through a de Bruijn graph. It was later shown that both de Bruijn and overlap graphs can be transformed into string graph form, in which, similar to the DBG, an Eulerian path also needs to be found to obtain the assembly ( Myers, 2005 ). Major differences lie in the implementation specifics of both algorithms. Although the DBG approach is faster, OLC based algorithms perform better for longer reads ( Pop, 2009 ). Additionally, DBG assemblers depend on finding exact-matching k-mers between reads (typically ∼21 to 127 bases long ( Bankevich and Pevzner, 2016 )). Given the error rates in third generation sequencing data, this presents a serious limitation. The OLC approach, on the other hand, should be able to cope with higher error rates given a sensitive enough overlapper, but contrary to the DBG a time-consuming all-to-all pairwise comparison between input reads needs to be performed.

Since the focus in the past decade has been on NGS reads, most of the state-of-the-art assemblers use the DBG paradigm. Hence, there are not many OLC assemblers that could be utilized for long PacBio and ONT reads. In fact, methods developed to handle such data are mostly pipelines based on the Celera assembler, including: HGAP ( Chin et al. , 2013 ), PBcR ( Koren et al. , 2012 ) and the LQS pipeline ( Loman et al. , 2015 ). Since its original publication ( Myers et al. , 2000 ), Celera has been heavily revised to support newer sequencing technologies, including modifications for second generation data ( Miller et al. , 2008 ), adoptions for third generation (single molecule) data via hybrid error correction ( Koren et al. , 2012 ), non-hybrid error correction ( Berlin et al. , 2015 Miller et al. , 2008 ) and hybrid approaches to assembly which combine two or more technologies ( Goldberg et al. , 2006 ). All of this contributed to the popularity of Celera which led to its wide adoption in assembly pipelines for third generation sequencing data. Notably, one of the first such pipelines was the Hierarchical Genome Assembly Process (HGAP). HGAP uses BLASR to detect overlaps between raw reads during the error correction step. Unfortunately, it requires input data to be in PacBio-specific formats, which prevents its application to other (e.g. nanopore) sequencing technologies. PBcR pipeline employs a similar approach to HGAP—it starts with an error correction step, and feeds Celera with corrected reads. PBcR, since recently, employs the MHAP overlapper ( Berlin et al. , 2015 ) for sensitive overlapping of reads during the error–correction step. Also, recent updates allow it to handle reads from Oxford Nanopore MinION sequencers. The LQS pipeline also follows a similar approach, but implements novel error–correction (Nanocorrect) and consensus (Nanopolish) steps. Instead of BLASR and MHAP, Nanocorrect uses DALIGNER ( Myers, 2014 ) for overlap detection. Nanopolish presents a new signal-level consensus method for fine-polishing of a draft assembly using raw nanopore data. The LQS pipeline also employs Celera as the middle layer, i.e. for assembly of error corrected reads.

Until very recently, the only non-hybrid alternative to Celera-based pipelines was Falcon. Falcon is a new experimental diploid assembler developed by Pacific Biosciences, not yet officially published. It is based on a hierarchical approach similar to HGAP, consisting of several steps: (i) raw sub-read overlapping for error correction using DALIGNER, (ii) pre-assembly and error correction, (iii) overlapping of error-corrected reads, (iv) filtering of overlaps, (v) construction of the string graph and (vi) contig construction. Unlike HGAP, it does not use Celera as its core assembler. Since Falcon accepts input reads in the standard FASTA format and not only the PacBio-specific format like HGAP does, it can potentially be used on any base called long-read dataset. Although originally intended for PacBio data, Falcon presents a viable option for assembly of nanopore reads, even though they have notably different error profiles.

In late 2015 the developers of Celera, PBcR and MHAP moved away from original Celera and PBcR projects and started to develop a new assembler, Canu. Canu is derived from Celera and also utilizes code from Pacific Biosciences’ Falcon and Pbdagcon projects. It is still in the very early phase of development.

Also in late 2015, a new long read assembly tool Miniasm was released ( Li, 2016 ). Miniasm attempts to assemble genomes from noisy long reads (both PacBio and Oxford Nanopore) without performing error–correction.

Aside from mentioned methods, hybrid assembly approaches present another avenue to utilizing nanopore sequencing data. Liao et al. ( Liao et al. , 2015 ) recently evaluated several assembly tools on PacBio data, including hybrid assemblers SPAdes ( Bankevich et al. , 2012 ) and ALLPATHS-LG ( Gnerre et al. , 2011 ) for which they reported good results. Both of these are DBG-based, use Illumina libraries for the primary assembly and then attempt to scaffold the assemblies using longer, less accurate reads. Furthermore, SPAdes was recently updated and now officially supports nanopore sequencing data as the long read complement to NGS data

2背景
大多数从头组装算法遵循de Bruijn图(DBG)或重叠-布局-一致(OLC)范式(Pop, 2009)。
OLC组装早于DBG,在桑格测序时代被广泛使用。
OLC类的一个主要代表是直到最近才被开发和维护的Celera。
DBG方法试图解决NGS技术带来的测序吞吐量不断增长的问题。
与OLC中读取之间的重叠必须显式计算不同,DBG将读取拆分为k-mers并隐式地构造重叠图,例如通过哈希表查找。
OLC范式中的程序集试图通过重叠图找到哈密顿路径,而DBG试图解决一个实际上更简单的问题,即通过de Bruijn图找到欧拉路径。
后来的研究表明,de Bruijn图和重叠图都可以转化为弦图形式,其中与DBG类似,也需要找到欧拉路径来获得集合(Myers, 2005)。
主要区别在于两种算法的实现细节。
尽管DBG方法更快,但是基于OLC的算法在更长的读取过程中表现更好(Pop, 2009)。
此外,DBG汇编程序依赖于在读之间找到精确匹配的k-mers(通常是21到127碱基长(Bankevich和Pevzner, 2016))。
鉴于第三代测序数据的错误率,这是一个严重的限制。
另一方面,如果重叠足够敏感,OLC方法应该能够处理较高的错误率,但是与DBG相反,需要执行输入读取之间的所有对所有比较,这非常耗时。
由于过去十年的重点一直放在NGS读取上,所以大多数最先进的汇编程序都使用DBG范式。
因此,能够用于长时间PacBio和ONT读取的OLC汇编器并不多。
事实上,开发的处理这些数据的方法大多是基于Celera汇编器的管道,包括:HGAP (Chin et al., 2013)、PBcR (Koren et al., 2012)和LQS管道(Loman et al., 2015)。
自原来的出版(迈尔斯et al ., 2000),塞莱拉支持新的测序技术已经大量修改,包括第二代修改数据(米勒et al ., 2008),收养为第三代(单分子)数据通过混合纠错(科伦et al ., 2012),标价纠错(柏林et al ., 2015;
以及结合两种或两种以上技术的混合装配方法(Goldberg et al., 2006)。
所有这些都促进了Celera的普及,并使其广泛应用于第三代测序数据的组装管道中。
值得注意的是,第一个这样的管道是层次基因组组装过程(HGAP)。
在错误校正步骤中,HGAP使用BLASR检测原始读取之间的重叠。
不幸的是,它要求输入数据必须是pacbio特有的格式,这就阻碍了它在其他(例如纳米孔)测序技术上的应用。
PBcR管道使用了与HGAP类似的方法,它从一个错误纠正步骤开始,并向Celera提供纠正后的读取。
最近,PBcR开始使用MHAP overlapper (Berlin et al., 2015),用于在纠错步骤中敏感的读值重叠。
此外,最近的更新允许它处理从牛津纳米孔的小分子测序器读取。
LQS流水线也采用类似的方法,但实现了新的纠错(Nanocorrect)和共识(Nanopolish)步骤。
与BLASR和MHAP不同,Nanocorrect使用DALIGNER (Myers, 2014)进行重叠检测。
Nanopolish提出了一种新的信号级共识方法,用于精细抛光草图组装使用原始的纳米孔数据。
LQS管道还使用Celera作为中间层,即用于纠错读取的组装。
直到最近,对于基于赛乐的管道,唯一的非混合替代品是Falcon。
“猎鹰”是一种新的实验二倍体组装器,由太平洋生物科学公司开发,尚未正式发表。
它基于类似于HGAP的层次化方法,包括以下几个步骤:(i)使用DALIGNER进行原始子读重叠纠错,(ii)预装配和纠错,(iii)对纠错的读进行重叠,(iv)对重叠部分进行过滤,(v)构建字符串图,(vi)构建contig。

与HGAP不同,它不使用Celera作为其核心汇编程序。由于Falcon接受标准FASTA格式的输入读取,而不仅仅是像HGAP那样的PacBio特定格式,所以它可以用于任何称为long read dataset的基础上。尽管Falcon最初是为PacBio数据设计的,但它为组装纳米孔读数提供了一个可行的选择,尽管它们有明显不同的误差分布。2015年末,Celera、PBcR和MHAP的开发人员离开了原来的Celera和PBcR项目,开始开发新的汇编程序Canu。Canu源于Celera,也使用了太平洋生物科学猎鹰和Pbdagcon项目的代码。它仍处于非常早期的发展阶段。同样在2015年末,新的长阅读组装工具Miniasm发布(Li,2016)。Miniasm尝试从嘈杂的长读(PacBio和Oxford Nanopore)中组装基因组,而不执行错误纠正。除了上述方法外,混合组装方法为利用纳米孔测序数据提供了另一种途径。廖等。(廖等。,2015)最近评估了PacBio数据上的几个组装工具,包括混合汇编工具SPAdes(Bankevich等人。,2012)和ALLPATHS-LG(Gnerre等人。2011年),他们报告了良好的结果。如果使用的是基于DBG的程序集,那么这两个程序集的精度都比较低。此外,SPAdes最近更新,现在正式支持纳米孔测序数据作为NGS数据的长读补充