Abstract

Genome sequences provide genomic maps with a single-base resolution for exploring genetic contents. Sequencing technologies, particularly long reads, have revolutionized genome assemblies for producing highly continuous genome sequences. However, current long-read sequencing technologies generate inaccurate reads that contain many errors. Some errors are retained in assembled sequences, which are typically not completely corrected by using either long reads or more accurate short reads. The issue commonly exists, but few tools are dedicated for computing error rates or determining error locations. In this study, we developed a novel approach, referred to as k-mer abundance difference (KAD), to compare the inferred copy number of each k-mer indicated by short reads and the observed copy number in the assembly. Simple KAD metrics enable to classify k-mers into categories that reflect the quality of the assembly. Specifically, the KAD method can be used to identify base errors and estimate the overall error rate. In addition, sequence insertion and deletion as well as sequence redundancy can also be detected. Collectively, KAD is valuable for quality evaluation of genome assemblies and, potentially, provides a diagnostic tool to aid in precise error correction. KAD software has been developed to facilitate public uses.

基因组序列为探索遗传内容提供了单碱基分辨率的基因组图。
测序技术,特别是长读序列技术,已经彻底改变了基因组组装技术,使其能够产生高度连续的基因组序列。
然而,目前的长读测序技术产生了包含许多错误的不准确的读码。
一些错误保留在组合序列中,这些错误通常不能通过长读或更准确的短读完全纠正。
这个问题通常存在,但是很少有工具专门用于计算错误率确定错误位置
在本研究中,我们开发了一种新的方法,称为k-mer丰度差异(KAD),来比较短读表示的每个k-mer的推断拷贝数和在组装中观察到的拷贝数。
简单的KAD度量可以将k-mers分类到反映装配质量的类别中。
具体来说,KAD方法可以用来识别基础误差和估计总体错误率。
此外,还可以检测序列的插入和删除以及序列冗余。
总的来说,KAD对于基因组装配的质量评估是有价值的,并且潜在地提供了一个诊断工具来帮助精确的错误纠正。
KAD软件的开发是为了方便公众使用。

INTRODUCTION

DNA sequencing technologies have revolutionized genetic and genomic analyses, facilitating de novo assemblies of genomes from various species with small to large complex genomes of species such as wheat

(1). Genome assemblies using Illumina technologies, here referred to as short-read sequences, are typically highly fragmented but sequence bases are accurate. The contiguity of genome assemblies can be dramatically improved by using long single-molecule sequencing technologies principally led by Pacific Biosciences SMRT and Oxford Nanopore (ONT) platforms

(2). Sequencing reads yielded from both technologies have relatively high rates of errors, and are dominated by small insertions or deletions. Although consensus sequences from high coverage of sequencing reads reduce errors in genome assemblies, nonrandom errors, or biased errors, in reads can result in inaccurate consensus sequences. Biased errors could be caused by epigenomic modifications that affect sequencing signals for base calling (3,4).

To mitigate per-base errors in a draft assembly, sequencing polishing algorithms have been developed using signal-level raw data (5,6). However, for genome sequences of many species, a great number of errors still exist after multiple rounds of sequence polishing, or error correction (7). Practically, errors can be further reduced via correction with additional Illumina short reads. Variants revealed by alignments of Illumina reads with assembled sequences are typically used for error correction (8). This strategy works well for small low-repetitive genomes because most assembly regions can be uniquely covered by Illumina reads. For large repetitive genomes, the strategy works less well due to a lower proportion of genomes uniquely aligned by Illumina reads, a higher rate of misalignments and even no alignments at some poorly assembled regions (9).

Assembly quality of genome sequences is related to assembly contiguity and completeness, correctness of sequence ordering and consensus base accuracy. Community-based projects such as GAGE (10) and Assemblathon (11) looked at a suite of criteria for a comprehensive assessment when benchmarking different assembly methods when the ground truth assembly is known. The value of N50, the length of the smallest contig of a set of the top long contigs that cover half of assembly space, is widely used as an indicator of assembly contiguity. Alignment rates or number of variants based on alignment to assembled sequences with genome sequencing reads or RNA sequencing reads, and comparison against the reference genomes of related varieties or species can indicate assembly quality. Tools were developed for comparing some of these parameters for genome assemblies, such as QUAST (12). Conserved benchmarking universal single-copy orthologs (BUSCO) (13) and core eukaryotic genes mapping approach (CEGMA) (14) have been used to assess genome completeness simply based on evaluating the coding or gene space. LTR (long terminal repeat) assembly index that indicates the assembly quality of LTR retrotransposons was designed to evaluate assembly continuity, extending assembly quality assessment to repetitive regions (15).

In addition, approaches for genome assembly characterization were also developed based on profiles of k-mers, substrings of length k from longer DNA sequences. K-mer-based approaches have been used to quantify genome size, repetitive levels and heterozygosity in assembled sequences (16–19), and to perform reference-free genome comparisons based on sequencing data (20,21). A method KAT (k-mer analysis toolkit) was developed to profile k-mer spectra of both sequencing reads and assemblies and to visualize the difference of k-mer abundance in the assembly and read data (22). Here, we quantified abundance of k-mers from sequencing reads and k-mer occurrences in the assembly genome, and developed a single value, k-mer abundance difference (KAD), per k-mer. Given a set of input reads, KAD analysis can evaluate the accuracy of nucleotide base quality at both genome-wide and single-locus levels, which, indeed, is appropriate, efficient and powerful for assessing genome sequences assembled with inaccurate long reads.

DNA测序技术已经彻底改变了遗传和基因组分析,促进了对不同物种的基因组进行从头组装,这些物种的基因组由小到大都很复杂,比如小麦

(1).使用Illumina技术的基因组装配,在这里称为短读序列,通常是高度片段化的,但序列碱基是准确的。
通过使用主要由太平洋生物科学公司SMRT和牛津纳米孔(ONT)平台主导的长单分子测序技术,基因组装配的连续性可以显著提高

(2)两种技术产生的测序reads都有较高的错误率,以小的插入或缺失为主。
尽管来自高覆盖率的序列序列减少了基因组装配中的错误,但非随机错误或偏置错误可能会导致不准确的一致序列。
表观基因组修饰会影响碱基调用的序列信号,从而导致偏差(3,4)。

为了减少草图装配中的每基误差,使用信号级原始数据开发了序列抛光算法(5,6)。
然而,对于很多物种的基因组序列,经过多次的序列抛光或纠错后,仍然存在大量的错误(7)。实际上,通过增加Illumina短读来进行纠错可以进一步减少错误。
Illumina reads与已组装序列比对所发现的变异通常用于错误校正(8)。这种策略对小的低重复基因组很有效,因为大多数装配区域可以被Illumina reads唯一覆盖。
对于大量重复的基因组,由于Illumina reads唯一对齐的基因组比例较低,在一些组装不良的区域出现错位甚至没有对齐的比率较高,所以该策略的效果不太好(9)。

基因组序列的装配质量装配的连续性和完整性序列排序的正确性一致性碱基精度有关。
基于社区的项目,如GAGE(10)和Assemblathon(11),在对不同的装配方法进行基准测试时,当ground truth的装配是已知的,考虑了一套全面评估的标准。
N50是覆盖了半个装配空间的顶部长叠维集合的最小叠维的长度,它的值被广泛用作装配连续性的指标。
利用基因组测序reads或RNA测序reads对组装后的序列进行比对,并与相关品种或物种的参考基因组进行比对,可以显示组装质量。
研究人员开发了一些工具来比较这些基因组组装的参数,如QUAST(12)。
保守标杆通用单拷贝标准克隆(BUSCO)(13)和核心真核基因定位方法(CEGMA)(14)已被用于仅基于编码或基因空间评估基因组完整性。
设计了LTR (long terminal repeat)装配指数,用以评价LTR逆转录转座子的装配质量,将装配质量评价扩展到重复区域(15)。

此外,基于k-mers(较长的DNA序列中长度为k的子串)的图谱,还开发了基因组装配表征的方法。
基于k -mer的方法已被用于量化基因组大小、重复水平和组合序列中的杂合性(16-19),并基于测序数据进行无参考的基因组比较(20,21)。
我们开发了KAT (k-mer analysis toolkit)方法来描述测序reads和组件的k-mer光谱,并可视化组装和读取数据中k-mer丰度的差异(22)。
在这里,我们通过测序读值和组装基因组中k-mer的出现量化了k-mers的丰度,并开发出单个值,即每个k-mer的k-mer丰度差异(KAD)。
给定一组输入reads, KAD分析可以评估全基因组和单位点水平上的核苷酸碱基质量的准确性,实际上,这对于评估由不准确的长reads组成的基因组序列是合适的、有效的和强大的。

MATERIALS AND METHODS

Simulation of genome sequences with single-nucleotide substitution errors   模拟基因组序列与单核苷酸替代错误

Genome sequences were simulated with various single-nucleotide substitution errors using the software simuG (23). The parameters of simuG were set as ‘-snp_count < total nucleotide number of genome × variation rate > -titv_ratio 0.5’.

Simulation of genome sequences with different error types                                 模拟不同错误类型的基因组序列

The software simuG was also used to simulate genome sequences with different error types (23). For single-nucleotide substitution errors, the parameters of simuG were set as ‘-snp_count < total nucleotide number of genome × variation rate > -titv_ratio 0.5’. For short insertions and deletions (INDELs), the parameter was set as ‘-indel_count < total nucleotide number of genome × variation rate > -ins_del_ratio 1’. For long sequence redundancy, the parameter was set as ‘-cnv_count < number of long sequence redundancy > -cnv_gain_loss_ratio Inf’. For long sequence deletion, the parameter was set as ‘-cnv_count < number of long sequence redundancy > -cnv_gain_loss_ratio 0’.

Simulation of reads with and without errors                                                          有错误和没有错误的读取模拟

The software DWGSIM (https://github.com/nh13/DWGSIM) was used to generate reads with or without errors (error-free reads) using reference genomes of multiple species with various genome sizes. To simulate reads without errors with different read depths, the parameter was set as ‘-e 0 -E 0 -C < read depth > -1 150 -2 150 -r 0 -R 0 -X 0 -y 0 -c 0 -S 0’. To simulate reads with single-nucleotide substitution errors with 50× read depth, the parameter was set as ‘-e < sequencing error rate > -E < sequencing error rate > -C 50 -1 150 -2 150 -r 0 -R 0 -X 0 -y 0 -c 0 -S 0’. The parameter -C represents read depth, and the parameters -1 and -2 represent the lengths of the first and second reads of paired-end reads. To simulate reads without errors, all the parameters controlling error rates in reads (-r, -R, -X, -y, -c, -S) were set to 0.

Calculation of true capture rate and false capture rate                                        真实捕获率和虚假捕获率的计算

Error k-mers were identified with the KAD script ‘KADprofile.pl’, which determined the KAD value per k-mer and identified error k-mers. To calculate the true capture rate (TCR) value that stands for the percentage of simulated base errors detected by KAD, all the error k-mers detected by KAD were aligned to their simulated genomes with bowtie (24) and the overlapping error k-mers were merged into error regions, which was implemented by the KAD script ‘KADdist.pl’. The ratio of simulated errors located in error regions to total simulated errors was calculated as the TCR value. The false capture rate (FCR) value stands for the percentage of error k-mers that do not overlap with simulated base errors. Therefore, the FCR value was determined by calculating the ratio of the number of error k-mers that do not overlap with simulated errors to the total number of error k-mers detected by KAD.

Xvv1601 whole genome sequencing via PacBio and genome assembly                    Xvv1601全基因组PacBio测序和基因组组装

The strain Xanthomonas vasicola pv. vasculorum (Xvv1601) was collected in a Kansas corn field in 2016. Bacterial growth and DNA extraction referred to a previous procedure (25). The 10–20 kb whole genome shotgun (WGS) libraries were constructed using Xvv1601 genomic DNAs. The library was sequenced with P6-C4 chemistry on SMRT cells of PacBio RS II at the Yale Center for Genomic Analysis. PacBio genome assembly: Canu (v1.3) was previously used for genome assembly (26). PacBio reads with the minimum length of 5 kb were used.

To generate Illumina data for the bacterium Xvv1601, the sequencing library was prepared using the Illumina TruSeq DNA LT Sample Prep Kit. Paired-end 2 × 300 bp reads were generated on an Illumina MiSeq at the Integrated Genomic Facility at Kansas State University. To examine the impact of read depths on error detection, the module ‘sample’ of the software seqtk (https://github.com/lh3/seqtk) was used to downsample Illumina reads to ∼90×, ∼80×, ∼70×, ∼60×, ∼50×, ∼40×, ∼30× and ∼20×.

Identification of polymorphisms between two Xvv1601 assembly versions             两个Xvv1601装配版本间多态性的鉴定

The software MUMmer 4 (27) was used to identify DNA polymorphisms between the two assemblies (canu and final) of the bacterial strain Xvv1601. The two assembly sequences were aligned with the nucmer command. Alignments were filtered with the command delta-filter with (-1 -l 10000 -i 90), which resulted in unique alignments with at least 10 kb matches and at least 90% identity between the two assembled genomes. The alignments passing the filtering were used for the variant discovery with ‘show-snps’.

B71 whole genome sequencing using Nanopore MinION

Wheat blast, a devastating emerging wheat disease, is caused by the fungus Magnaporthe oryzae Triticum (MoT) (28). Nanopore long reads from a virulent MoT strain B71 were produced for the de novo genome assembly. B71 nuclear genomic DNA was prepared as described previously (28). Genomic DNA was subjected to 20 kb size selection using Bluepippin cassette kit BLF7510 with High-Pass Protocol (Sage Science, USA), followed by library preparation with the SQK-LSK109 kit (Oxford Nanopore, UK). Library was loaded to the flow cell FLO-MIN106D (Oxford Nanopore, UK) and sequenced on MinION (Oxford Nanopore, UK). Guppy version 2.2.2 was used to convert Nanopore raw data (fast5) to FASTQ data with default parameters.

B71 Nanopore genome assembly and sequence polishing

Nanopore reads were input to Canu 1.8 for genome assembly with the following parameters: genomeSize = 45m; minReadLength = 5000; minOverlapLength = 1000; corOutCoverage = 80 (26). Nanopore reads were aligned back to the Canu assembly with minimap2 (2.14-r892) with the parameter of (-ax map-ont). Alignments in BAM format converted by Samtools (1.9) were input for assembly polishing with Nanopolish (version 0.11.0) with default parameters (https://github.com/jts/nanopolish). The Nanopolish procedure was repeated twice. Nanopolish-polished sequences were further polished with trimmed reads of Illumina sequencing data (SRA accession SRR6232156) using Pilon (version 1.23) (8). In the Pilon polishing, Illumina reads were aligned to Nanopolish-polished sequences with the aligner bwa (0.7.12-r1039) with default parameters of the module ‘mem’ (29). Pilon used bwa alignments and polish assembled sequences with the parameters of (–minmq 40 –minqual 15). Assembled contigs were renamed based on their similarity to the assembly of B71Ref1 (28). We also manually fixed a misassembly that joined a previously identified mini-chromosome’s sequence with chromosome 6. The assembly and polishing procedure resulted in ONTv0.14. The same Pilon procedure was applied to further polish the PacBio assembly B71Ref1 (28) with Illumina reads, resulting in a new assembly PBRef1.3.

Whole genome sequence alignment via NUCmer

The nucmer command from the software MUMmer 4 (27) was used for whole genome alignment between B71Ref1.3 and ONTv0.14. The parameter of ‘-L 1000’ was used in the command nucmer and the parameter of ‘-L 5000 -I 98’ in the command of show-coords, which resulted in alignments with at least 5 kb matches and at least 98% identity between the two assembled genomes.

KAD analysis to analyze B71 genome assemblies

Using trimmed B71 Illumina reads, the KAD analysis was performed for both assemblies ONTv0.14 and B71Ref1.3 with the script ‘KADprofile.pl’, which determined the KAD value per k-mer and grouped k-mers. The script ‘KADdist.pl’ was used to map k-mers to the assembly genomes and profile distributions of k-mers from each k-mer group, particularly the group of error k-mers.

KAD analysis to assess the maize reference genome

The version 4 of maize reference genome of the bred line B73 was downloaded from ensemblgenomes.org (30). B73 WGS sequencing data were from SRA accessions of SRR4039069 and SRR4039070. KAD was run using trimmed WGS data with 47-mer as the k-mer length.

KAT analysis

The KAT software (version 2.4.2) was downloaded from the KAT GitHub (https://github.com/TGAC/KAT). Different tools in KAT were run with the Xvv1601 and B71 datasets, respectively, to test KAT analysis. The KAT hist was used to identify k-mer spectra from Illumina sequencing data of Xvv1601 and B71, and the parameters of KAT hist were -t 4 -l 1 -h 1000000 -i 1 -m 25 -H 100000000. The KAT gcp was used to calculate GC contents of k-mers in Xvv1601 and B71 sequencing data, and the parameters of KAT gcp were -t 4 -x 1 -y 1000 -m 25 -H 100000000. The KAT comp was used to compare the sequencing data and assemblies. For Xvv1601, two assemblies, canu and final, were compared to Xvv1601 sequencing data with the KAT comp parameters -t 4 -m 25. For B71, two assemblies, ONTv0.14 and B71Ref1.3, were compared to B71 sequencing data with the same parameters.

KAD analysis on heterozygous genomes

The simulated heterozygous Escherichia coli genomes were used to perform the KAD analysis. Simulations were performed on the E. coli reference genome with the single-nucleotide substitution frequencies from 1% to 10%. The E. coli reference genome was combined with those simulated genomes to form heterozygous genomes. From each heterozygous genome, 25× reads with no errors were simulated from each haplotype. Then, the KAD analyses were performed using the E. coli reference genome with each simulated heterozygous read dataset.

RESULTS

Rationale of KAD profiling and software development

With the availability of long-noisy-read and short-accurate-read sequencing technologies, genome sequences nowadays are often constructed by using both long and short WGS reads. The Illumina sequencing platform is the dominant short-read technology and produces accurate reads with ∼0.1% error rate, predominated by single-nucleotide substitutions (31). High-depth sequencing data (e.g. 30× or above) and relatively uniformly distributed reads across the genome enable to quantify genome content through k-mer analysis. Specifically, the abundance of a k-mer from short reads is correlated with occurrence of the k-mer in the genome (21). For most genomes, single-copy k-mers each of which is present once in the genome are dominant among all k-mer sequences (non-redundant k-mers with one or multiple copies) derived from the genome. The mode of sequencing depths of single-copy k-mers ([Math Processing Error]m⁠), representing sequencing depth of read data, can be estimated from the spectrum of abundance of read k-mers that are k-mers generated from sequencing reads. For a given read k-mer, the k-mer abundance, or the count per k-mer in reads, is signified by [Math Processing Error]c⁠. The occurrence or the copy of the k-mer in a given genome can be estimated by [Math Processing Error]c/m⁠. In assembled sequences, the occurrence of the k-mer is signified by [Math Processing Error]n⁠. Therefore, [Math Processing Error]log2⁡(c/mn) represents the copy number difference between the estimate by reads and the copy of the k-mer in assembled sequences. Because the [Math Processing Error]n value per k-mer is 0 for the k-mers that are present in reads but absent in assembled sequences, the formula was adjusted as [Math Processing Error]log2⁡[(c+m)/m(n+1)]⁠, the value of KAD. Using this formula, KADs of k-mers with matching copies indicated by reads and the assembly should be 0 or around 0. If a single-copy k-mer from an assembly results from errors, and no such k-mer is found in reads, the KAD equals −1. Read k-mers missed in the assembly have positive KAD values. In such cases, high-copy k-mers from a genome that are well represented in reads but not in the assembly have high KAD values. Collectively, this simple KAD metric indicates how each k-mer matches with read data in copy number. Therefore, based on the KAD profile of all k-mers together, the quality of an assembly can be assessed using a common standard informed by a read dataset.

Base errors in assemblies can be detected through KAD analysis    组装器中的基本错误可以通过KAD分析检测出来

The use of KAD in error detection was tested by the simulation using an Ecoli reference genome (4.7 Mb). The KAD calculation requires a genome assembly and short reads generated from the genome. We simulated 50× reads without errors (error-free reads) from the E. coli reference genome and 10 sets of genome sequences with 0.1–1% single-nucleotide substitution errors (see the ‘Materials and Methods’ section). The KAD value using 25-mer as the k-mer size was determined for each k-mer derived from simulated genomes that contain varying numbers of errors. A k-mer with the KAD value equaling −1 was referred to as an error k-mer. As expected, the numbers of error k-mers increased with error rates of the simulated genomes (Figure 1A). We extended simulations using larger genomes from four additional species, namely yeast (Saccharomyces cerevisiae, 12.4 Mb), Arabidopsis (Arabidopsis thaliana, 121.6 Mb), rice (Oryza sativa japonica, 381.3 Mb) and maize (Zea mays, 2.2 Gb). For each genome, 50× error-free reads were simulated. Similar to the simulation using the E. coli genome, the number of error k-mers detected by KAD analysis in four species showed a linear correlation with error rates (Figure 1ASupplementary Figure S1 and Supplementary Table S1), which supported the conjecture that the number of error k-mers determined by KAD analysis accurately reflects error rates regardless of genome size and complexity.

Figure 1.

Error detection through KAD analysis.

(A) Number of error k-mers detected in simulated E. coli and maize genomes with different simulated error rates. The left and right y-axes represent the number of error k-mers detected in the simulated E. coli and maize genome sequences, respectively.

(B) Under the 0.5% rate of errors in simulated genomes, ratios of the number of error k-mers to the number of simulated errors in five simulated genomes were plotted versus corresponding genome sizes.

通过KAD分析进行错误检测。
(A)在不同模拟错误率的大肠杆菌和玉米基因组中检测到的k-mers错误数。
左边和右边的y轴分别代表在模拟大肠杆菌和玉米基因组序列中检测到的错误k-mers数量。
(B)在0.5%的模拟基因组错误率下,绘制出5个模拟基因组中错误k-mers数量与模拟错误数量的比值与相应的基因组大小。

Error detection through KAD analysis. (A) Number of error k-mers detected in simulated E. coli and maize genomes with different simulated error rates. The left and right y-axes represent the number of error k-mers detected in the simulated E. coli and maize genome sequences, respectively. (B) Under the 0.5% rate of errors in simulated genomes, ratios of the number of error k-mers to the number of simulated errors in five simulated genomes were plotted versus corresponding genome sizes.

Error k-mers were mapped to simulated genomes and the regions covered by error k-mers were referred to as error regions. The simulated errors located in error regions were considered as errors captured by KAD and the percentage of captured errors was defined as TCR. In relatively small genomes (Ecoli, yeast and Arabidopsis) with error rates ranging from 0.1% to 1%, the TCR values were all >99% (Table 1), which indicated that KAD analysis can inform almost all errors in these simulated genomes. For a genome with a moderate size (rice), the TCR values reduced to ∼95%. The TCR values were further reduced to ∼67% for maize that has a large and complicated genome, suggesting that the size and complexity of a genome have impacts on error detection.

错误k-mers被映射到模拟基因组中,被错误k-mers覆盖的区域被称为错误区域。
将位于误差区域的模拟误差视为KAD捕获的误差,定义捕获误差的百分比为TCR。
在相对较小的基因组(大肠杆菌、酵母和拟南芥)中,错误率在0.1%至1%之间,TCR值均为99%(表1),这表明KAD分析可以揭示这些模拟基因组中几乎所有的错误。
对于中等大小(水稻)的基因组,TCR值降至~ 95%。
对于基因组大而复杂的玉米,TCR值进一步降低到67%,这表明基因组的大小和复杂性对错误检测有影响。

DISCUSSION

Remarkable progress has been made to advance genome assemblies in recent decades, including cost-efficient and accurate high-throughput short-read sequencing, long-read single-molecule sequencing, and improved assembly and error correction algorithms (36,37). The evaluation of multiple versions of assemblies using different assembly algorithms and various assembly procedures is critical for the optimization of genome assemblies. It is also important to assess the final assembly products for information on errors and incompleteness. Here, we developed a simple but effective method for genome evaluation based on the quantitative comparison of k-mer abundances between accurate short-read sequencing data and the assembly sequences. The method, referred to as KAD, depicts how sequence contents from high-depth short reads are represented in each examined genome assembly, identifies the regions in the assembly where base errors occur, estimates the overall error rate of an assembly and finds the regions containing potential sequence redundancy or incompleteness. Additionally, problems with short-read data, if they exist, could be detected. The KAD method has been implemented and the scripts are freely accessible (https://github.com/liu3zhenlab/kad), which would be useful for genome assembly projects of a wide range of species from small bacterial genomes to large eukaryotic genomes.

KAD first analyzes high-depth (40× or higher recommended) Illumina data to determine the sequencing depth, or the coverage of the genome. This analysis assumes that k-mers having a single copy in the genome are most abundant among all k-mers generated from the genome, which is true for genomes of most species if the k-mer length is sufficiently long (e.g. 25 nt). The method can be used to assess genome sequences of heterozygous diploids, but caution is needed when analyzing polyploidy genomes such as wheat. Based on that assumption, the k-mer abundance that occurs most frequently (the mode of k-mer abundances) should represent the sequencing depth of single-copy k-mers if reads contain no sequencing errors. Reads with sequencing errors, even at a low frequency of sequencing errors, generate a very high number of k-mers with small abundance (e.g. 1–3), and can slightly reduce k-mer abundance of each correct k-mer. Such reduction is negligible as it is very small. We thus use the mode of k-mer abundances (the most frequent k-mer abundance that is not small) as the estimate of the sequencing depth, which is an important parameter in the formula to calculate KADs.

近几十年来,基因组装配技术取得了显著进展,包括成本高效、准确的高通量短读测序、长读单分子测序,以及改进的装配和纠错算法(36,37)。
利用不同的装配算法和不同的装配程序对不同版本的装配进行评估是优化基因组装配的关键。
评估最终装配产品的误差和不完整性也是很重要的。
在此,我们开发了一种简单而有效的基因组评估方法,该方法基于准确的短读测序数据和装配序列之间k-mer丰度的定量比较。
方法,称为科安达,描述了如何从最高深度短序列内容读取中表示每个基因组组装、检测识别基本错误发生的区域在组装,估计总体错误率的组装和发现包含潜在的序列冗余或不完备的地区。
此外,如果存在短读数据的问题,则可以检测到。
KAD方法已经实现,并且脚本可以免费获取(https://github.com/liu3zhenlab/kad),这将对从小型细菌基因组到大型真核生物基因组的广泛物种基因组组装项目非常有用。

KAD首先分析深度高(推荐40倍或更高)的Illumina数据,以确定测序深度,或基因组的覆盖范围。
该分析假设基因组中只有一个拷贝的k-mers在所有由基因组产生的k-mers中最为丰富,如果k-mer长度足够长(例如25 nt),那么大多数物种的基因组都是如此。
该方法可用于评估杂合子二倍体的基因组序列,但在分析小麦等多倍体基因组时需要谨慎。
基于这一假设,如果reads不包含测序错误,那么发生最频繁的k-mer丰度(k-mer丰度模式)应该代表单拷贝k-mers的测序深度。
有测序错误的Reads,即使测序错误的频率很低,也会产生大量低丰度的k-mers(例如1-3),并且会略微降低每个正确k-mer的k-mer丰度。
这种降低是可以忽略的,因为它很小。
因此我们使用k-mer丰度模式(最常见的不小的k-mer丰度)作为序列深度的估计,这是计算KADs公式中的一个重要参数。

For a complete assembly with no errors and unbiased sequencing data with no contamination, the KAD profile would be ideal in that KADs of all k-mers would be around 0. In reality, problematic k-mers with KADs distant from 0 can be found. We categorized problematic k-mers into four groups: OverRep, Error, LowUnderRep and HighUnderRep. The group of ‘Error’ is well defined. KAD values of all k-mers of this group are –1, which indicates that no such k-mers are produced from sequencing reads, but they are present in the assembly. Importantly, the number of k-mers in this group (error k-mers) reflects the sequence error rate of the assembly. Our simulation data indicate that the assembly error rate can be estimated using the number of error k-mers. This estimation is particularly useful nowadays because many genome assemblies retain a number of errors from noisy long reads that are used for assemblies (7). Importantly, most of these errors create new sequences that are not present in the actual genome sequences. Therefore, most error k-mers can be unambiguously mapped to error regions in the assembled sequences even though error k-mers are located in highly repetitive regions. This provides a unique approach to identify errors in repetitive sequences that are error-prone. The other three groups are arbitrarily categorized by defining the range of KADs. KAD thresholds to define these categories are adjustable but the defaults worked well for all the genomes that we tested.

The over-represented group includes k-mers with low KAD values (e.g. <−1), representing k-mers occurring multiple times in the assembly, but are absent or represented less frequently in reads than expected. Most likely, these k-mers are derived from sequence redundancy in the assembly, or regions containing systematic errors across multiple locations in the assemblies, or sequences in the genome that are biasedly under-represented in reads. Biased under-representation in reads could occur at extremely highly GC- or AT-rich regions (25). The under-represented groups include k-mers with high positive KAD values (e.g. >0.75), representing k-mers showing fewer copies in the assembly than indicated by reads. For example, k-mers from a region incorrectly collapsed due to tandem repeats would be categorized in these groups. The higher the KAD values, the higher the level of potential assembly incompleteness. However, when k-mers are derived from high-copy organelles or plasmids, the high KAD values reflect the fact that high copies of the sequence are present but only one is probably assembled. In addition, when k-mers are from contamination in short reads, their KAD values could be high. When mapping over- and under-represented k-mers to the assembly, the issue of multiple mapping is frequently encountered. KAD scripts allow users to tune the parameter of the maximum number of locations, which enables the examination of k-mers that are located at multiple genomic regions. However, users need to bear this in mind that the over- or under-represented signal from a region could be repeatedly shown at multiple locations.

对于没有错误的完整装配和没有污染的无偏置测序数据,KAD profile将是理想的,因为所有k-mers的KAD将在0左右。
在现实中,可以找到KADs距离为0的有问题的k-mers。
我们将有问题的k-mers分为四组:OverRep、Error、LowUnderRep和HighUnderRep。
“错误”的组是明确定义的。
这组k-mers的KAD值均为-1,说明该k-mers不是由测序reads产生的,而是存在于assembly中。
重要的是,这一组中k-mers的数量(error k-mers)反映了装配的序列错误率。
我们的仿真数据表明,装配错误率可以用误差k-mers的数目来估计。
这种估计在当今尤为有用,因为许多基因组装配在用于装配的噪声长读取中保留了许多错误(7)。重要的是,这些错误中的大多数产生了实际基因组序列中不存在的新序列。
因此,即使错误k-mers位于高度重复的区域,大多数错误k-mers也可以清晰地映射到已组装序列中的错误区域。
这为识别容易出错的重复序列中的错误提供了一种独特的方法。
通过定义kad的范围,可以任意地对其他三组进行分类。
定义这些类别的KAD阈值是可调整的,但默认值对我们测试的所有基因组都很有效。

过度代表的组包括KAD值较低的k-mers(例如&lt;−1),表示k-mers在组装中多次出现,但在读取中没有出现或出现的频率低于预期。
最有可能的是,这些k-mers来自于装配中的序列冗余,或者装配中的多个位置包含系统错误的区域,或者基因组中的序列在reads中偏少的代表。
读操作中的偏少表示可能发生在GC或at含量极高的区域(25)。
代表性不足的组包括高阳性KAD值的k-mers(例如&gt;0.75),表明k-mers在装配中比reads显示的拷贝数要少。
例如,由于串联重复而不正确折叠的区域的k-mers将被归类到这些组中。
KAD值越高,潜在装配不完全程度越高。
然而,当k-mers来自于高拷贝的细胞器或质粒时,高KAD值反映了序列存在高拷贝但可能只有一个被组装的事实。
此外,当k-mers短时间内来自污染时,它们的KAD值可能很高。
在将过多和过少表示的k-mers映射到程序集时,经常会遇到多重映射的问题。
KAD脚本允许用户调整位置的最大数量的参数,这使得检查位于多个基因组区域的k-mers成为可能。
然而,用户需要记住这一点,即一个区域的信号过多或过少都可能在多个位置重复显示。

In this study, for most analyses, we chose 25-mer as the k-mer size because it is an optimal size in the genome assembler ALLPATHS-LG for analyzing k-mer abundance spectrum (38) and we have successfully applied it to examine repetitive sequences of maize that has a large complex genome (21). A shorter k-mer size can have a higher resolution to pinpoint error regions but compromises the uniqueness of each k-mer in the assembled genome sequences. In our simulation result, we showed that 25-mer has high power and accuracy for detecting base error in assembled genomes with small or moderate sizes. We also showed that large complex genomes like maize need longer k-mers for improving the uniqueness of k-mers in the genomes. However, error regions identified using longer k-mers are wider and more likely to cover multiple errors, and the analysis requires more computation resources. In addition, sequencing reads used in KAD analysis are not error-free. The longer the k-mer size, the higher the likelihood that a k-mer from reads would carry errors (39). Therefore, slightly higher depth than 40× would help ensure reliable error detection when a longer k-mer length (e.g. 49-mer) is used.

The development of the KAD analysis was inspired by a motivation to generate a high-quality genome assembly and to develop a method to compare different assemblies for nomination of the final release. With the KAD bioinformatics pipeline, we can quantify the overall sequence error rate and locate errors. Existing error correction algorithms can use the information provided by KAD for targeted error correction, which should reduce false correction. KAD can also help to determine whether one or multiple rounds of polishing were sufficient to meet the convergence criteria, for example, on the basis of the diagnostic plots including KAD error profiles and KAD landscape plots. In the future, new error correction approaches, particularly approaches that are not based on read alignments, could be developed along with KAD for precise error correction in both non-repetitive and repetitive sequences.

在本研究中,对于大多数分析,我们选择25-mer作为k-mer大小,因为在基因组组装器ALLPATHS-LG中,25-mer是分析k-mer丰度谱的最佳大小(38),并且我们已经成功地应用它来检测具有较大复杂基因组的玉米重复序列(21)。
更短的k-mer大小可以有更高的分辨率,以查明错误区域,但损害了每个k-mer在组装的基因组序列的唯一性。
在我们的模拟结果中,我们表明25-mer在检测小或中等大小的组合基因组中的基误差方面具有高的功率和精度。
我们还发现,像玉米这样的大型复杂基因组需要更长的k-mers来提高基因组中k-mers的唯一性。
然而,使用较长的k-mers识别的错误区域更宽,更可能涵盖多个错误,分析需要更多的计算资源。
此外,在KAD分析中使用的序列读取也不是没有错误的。
k-mer的大小越长,读取的k-mer携带错误的可能性就越大(39)。
因此,当使用较长的k-mer长度(例如49-mer)时,深度略高于40倍将有助于确保可靠的错误检测。

KAD分析的发展是受到一个动机的启发,以产生一个高质量的基因组组装和发展一种方法来比较不同的组装提名最终发布。
利用KAD生物信息学管道,我们可以量化总体序列错误率和定位错误率。
现有的纠错算法可以利用KAD提供的信息进行有针对性的纠错,从而减少误纠错。
KAD还可以帮助确定一次或多次抛光是否足以满足收敛标准,例如,在诊断图的基础上,包括KAD误差轮廓和KAD景观图。
在未来,新的纠错方法,特别是不基于读取对齐的方法,可以与KAD一起开发,用于对非重复序列和重复序列进行精确纠错。

DATA AVAILABILITY

Bacterial sequencing raw data and genome assembly are available in the NCBI BioProject under the accession number of PRJNA421835.

Fungal Nanopore sequencing raw data and genome assembly are available in the NCBI BioProject under the accession number of PRJNA355407.

 KAD codes are available at GitHub (https://github.com/liu3zhenlab/KAD)