Long-read error correction: a survey and qualitative comparison

Abstract

Third generation sequencing technologies Pacific Biosciences and Oxford Nanopore Technologies were respectively made available in 2011 and 2014. In contrast with second generation sequencing technologies such as Illumina, these new technologies allow the sequencing of long reads of tens to hundreds of kbps. These so called long reads are particularly promising, and are especially expected to solve various problems such as contig and haplotype assembly or scaffolding, for instance. However, these readers are also much more error prone than second generation reads, and display error rates reaching 10 to 30%, according to the sequencing technology and to the version the chemistry. Moreover, these errors are mainly composed of insertions and deletions, whereas most errors were substitutions in Illumina reads. As a result, long reads require efficient error correction, and a plethora of error correction tools, directly targeted at these reads, were developed in the past nine years. These methods can adopt an hybrid approach, using complementary short reads to perform correction, or a self-correction approach, only making use of the information contained in the long reads sequences. Both theses approaches make use of various strategies such as multiple sequence alignment, de Bruijn graphs, hidden Markov models, or even combine different strategies.

In this paper, we describe a complete state-of-the-art of long-read error correction, reviewing all the different methodologies and tools existing up to date, for both hybrid and self-correction. Moreover, the long reads characteristics, such as sequencing depth, length, error rate, or even sequencing technology, can have an impact on how well a given tool or strategy performs, and can thus drastically reduce the correction quality. We thus also present an in depth benchmark of available long-read error correction tools, on a wide variety of datasets, composed of both simulated and real data, with various error rates, coverages, and read lengths, ranging from small bacterial to large mammal genomes.

The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-ND 4.0 International license

长读纠错:调查和定性比较
Pierre Morisse, Thierry Lecroq, Arnaud Lefebvre
doi: https://doi.org/10.1101/2020.03.06.977975
这篇文章是预印本，并没有通过同行评审认证[这意味着什么?]。

Proud to present latest (double) contribution:

1) A complete survey on long-read error correction, describing the 29 tools available as of today;

2) An in-depth benchmark of these tools on a wide variety of 20 datasets, ranging from bacterial to human g

摘要
信息/历史
指标
PDF预览

摘要
第三代测序技术太平洋生物科学公司(Pacific Biosciences)和牛津纳米孔技术公司(Oxford Nanopore technologies)分别于2011年和2014年推出。
与第二代测序技术(如Illumina)相比，这些新技术可以对数十到数百kbps的长读序列进行测序。
这些所谓的长读取非常有前途，尤其有望解决诸如重叠群和单倍型组装或脚手架等各种问题。

然而，这些阅读器也比第二代阅读器更容易出错，根据测序技术和版本的化学显示错误率达到10 - 30%。
此外，这些错误主要由插入和删除组成，而Illumina reads中大部分错误为替换。

因此，长读取需要有效的纠错，而在过去的九年里，开发了大量的纠错工具，直接针对这些读取。
这些方法可以采用一种混合的方法，使用互补的短读来执行校正;也可以采用一种仅利用长读序列中包含的信息的自校正方法。

这两种方法都利用了多种策略，如多重序列对齐、德布鲁因图、隐马尔科夫模型，甚至还结合了不同的策略。
在这篇论文中，我们描述了一个完整的先进的长read的错误纠正，回顾了所有不同的方法和工具，为混合和自我纠正，到目前为止。
此外，测序深度、长度、错误率甚至测序技术等长read特性会影响给定工具或策略的性能，从而大大降低校正质量。
因此，我们也提出了一个深入的基准，可用的长时间读取错误修正工具，在一个广泛的数据集，由模拟和真实的数据，具有各种错误率，覆盖率，read长度，从小型细菌到大型哺乳动物的基因组。

此预印本的版权持有人是作者/资助者，他已授予bioRxiv永久展示预印本的许可。
它使用CC-BY-ND 4.0国际许可证