A new era of long-read sequencing for cancer genomics

Abstract

Cancer is a disease largely caused by genomic aberrations. Utilizing many rapidly emerging sequencing technologies, researchers have studied cancer genomes to understand the molecular statuses of cancer cells and to reveal their vulnerabilities, such as driver mutations or gene expression. Long-read technologies enable us to identify and characterize novel types of cancerous mutations, including complicated structural variants in haplotype resolution. In this review, we introduce three representative platforms for long-read sequencing and research trends of cancer genomics with long-read data. Further, we describe that aberrant transcriptome and epigenome statuses, namely, fusion transcripts, as well as aberrant transcript isoforms and the phase information of DNA methylation, are able to be elucidated by long-read sequencers. Long-read sequencing may shed light on novel types of aberrations in cancer genomics that are being missed by conventional short-read sequencing analyses.

癌症是一种主要由基因组畸变引起的疾病。
研究人员利用许多快速出现的测序技术，研究癌症基因组，以了解癌细胞的分子状态，并揭示其脆弱性，如驱动突变或基因表达。
长期以来的技术使我们能够识别和描述新的癌症突变类型，包括在单倍型分辨率中复杂的结构变异。
在这篇综述中，我们介绍了3个具有代表性的长读测序平台和具有长读数据的癌症基因组学研究趋势。
此外，我们还描述了异常的转录组和表观基因组状态，即融合转录本，以及异常的转录本异构体和DNA甲基化的相位信息，可以通过长read的测序器来阐明。
长读测序可能有助于揭示传统短读测序分析所忽略的癌症基因组学中的新类型畸变。

Introduction

Cancer cells harbor mutations in their genomes, parts of which affect the function of driver and tumor suppressor genes, resulting in the abnormal proliferation and initiation or progression of carcinogenesis. Drugs targeted at driver events show appreciable efficacy for shrinking tumor sizes. For example, EGFR tyrosine kinase inhibitors are effective for lung adenocarcinomas with EGFR mutations [1]. The identification of driver genes and the vulnerabilities of cancer cells have been energetically progressing by means of sequencing technologies.

Modern sequencing technologies are rapidly being developed to enable us to identify and characterize mutations in each cancer case more easily. Many consortiums, such as ICGC [2] and TCGA [3], have sequenced, analyzed, and reported on the genomic statuses specific to each cancer subtype. They have mainly focused on point mutations, such as single-nucleotide variants (SNVs) and short indels, because short-read sequencing techniques are generally being used for genotyping. However, other types of genomic aberrations are highly complicated. The detection and precise identification of various sizes of structural variants (SVs) and mutations in repetitive regions are challenging for short reads that are only a few hundred bases at the longest. The detection accuracy and precision are still limited, even though many bioinformatics tools and pipelines have been developed for these tasks (e.g., Pindel, DELLY2, Manta, SvABA) [4,5,6,7]. Short reads also lack the phasing information of each allele, which means that we are missing out on which alleles the mutations occurred in. To complement the vulnerability of short-read sequencing, new sequencing technologies for longer DNA chains are highly desirable in the field of cancer genomics.

Many long-read sequencing technologies have been developed and utilized in recent years. For example, single-molecule real-time sequencing (SMRT) [8] is one of the long-read methods developed by Pacific Biosciences (PacBio). This method is based on a single-DNA polymerase attached in a zero-mode waveguide (ZMW), which is a nanostructure for fluorescence detection. Using SMRT sequencing, we can obtain long-read data longer than 10 kb. In a recent report, approximately half (at least 26%) of the reads were sequenced with ≥ 10 kb length, and these datasets were used for the construction of comprehensive catalogs of common SVs in the human genome [9].

Nanopore-type sequencers have been commercialized by Oxford Nanopore Technologies. Protein nanopores are arrayed on a membrane to detect changes in an electrical current when a DNA or an RNA molecule passes through the pore, permitting direct sequencing of the molecules. MinION is a portable long-read sequencing platform with low initial costs capable of obtaining >5 Gb in each run. The library preparation is also simple to conduct and takes only ~48 h for each sequencing. Furthermore, a larger platform, PromethION, can achieve ~10 times the sequencing output of MinION. In our study, we used both MinION and PromethION for whole-genome sequencing of the lung cancer cell line LC2/ad. The lengths of the mapped reads are ~16 and 14 kb on average, respectively (up to 32 kb) [10]. For much longer reads, Jain et al. [11] reported a protocol for generating ultra-long reads (up to > 800 kb) to sequence and assemble the human genome with the intention of characterizing the difficult regions that include repetitive sequences and complicated structural variations. Correspondingly, it is also reported that these long reads could be used to probe into regions that were previously inaccessible by conventional short-read sequencers [12], underlining the advantages that long-read sequencing could offer. Oxford nanopore sequencers enable us to easily obtain long reads although they suffer a relatively lower sequencing accuracy than that from short-read sequencing technologies.

In contrast to these physical long-read sequencers, researchers can also obtain synthetic long-read sequences reconstructed from short-read sequencing with barcode sequences attached to each high-molecular-weight DNA molecule. 10x Genomics released a linked-read technology based on the generation of oil-droplet-containing barcoded gel beads, reaction reagents, and DNA molecules ( > 100 kb) using the Chromium system. Only 1 ng of genomic DNA is needed. This method provides the phase information of SNPs for haplotyping the genome (N50 phase block lengths ranged from 0.9–2.8 Mb) [13] and enables the detection of SVs by following the molecular barcodes specific to each large DNA fragment.

Long-read sequencing is now becoming more prevalent, and thus, cancer studies using long-read information have been rapidly increasing and continuously progressing in order to decipher complicated cancer genomes. Here, we introduce recent long-read analyses for cancer research and new perspectives of cancer genomics brought by long-read sequencing

肿瘤细胞的基因组存在突变，部分突变影响驱动基因和抑癌基因的功能，导致肿瘤异常增殖和癌变的开始或进展。
针对驱动事件的药物对缩小肿瘤大小显示出明显的疗效。
例如，EGFR酪氨酸激酶抑制剂对EGFR突变[1]的肺腺癌有效。
通过测序技术，对驱动基因和癌细胞脆弱性的识别一直在积极进展。

现代测序技术的迅速发展使我们能够更容易地识别和确定每个癌症病例的突变。
许多协会，如ICGC[2]和TCGA[3]，已经对每种癌症亚型的基因组状态进行了测序、分析和报告。
他们主要关注点突变，如单核苷酸变异(SNVs)和短indels，因为短读测序技术通常用于基因分型。
然而，其他类型的基因组畸变是高度复杂的。
对不同大小的结构变异(SVs)和重复区域的突变进行检测和精确识别，对于最长只有几百个碱基的短读来说是一个挑战。
尽管已经开发了许多生物信息学工具和管道(如Pindel, DELLY2, Manta, SvABA)[4,5,6,7]，但检测的准确性和精密度仍然有限。
短读也缺乏每个等位基因的相位信息，这意味着我们错过了发生突变的等位基因。
为了弥补短读测序的弱点，新的长链DNA测序技术在癌症基因组学领域非常受欢迎。

近年来，许多长读测序技术被开发和应用。
例如，单分子实时测序(SMRT)[8]是太平洋生物科学(PacBio)开发的一种长read的方法。
该方法基于单dna聚合酶附着在零模波导(ZMW)上，这是一种用于荧光检测的纳米结构。
使用SMRT测序，我们可以获得超过10kb的长读数据。
在最近的一份报告中，大约一半(至少26%)的reads测序长度≥10kb，这些数据集被用于构建人类基因组[9]中常见SVs的综合目录。

纳米矿型测序器已经由牛津纳米孔技术公司商业化。
蛋白质纳米孔排列在膜上，以检测当DNA或RNA分子通过小孔时电流的变化，从而允许对分子进行直接测序。
MinION是一种便携式长读测序平台，初始成本低，每次运行可获得5gb的数据。
文库制备简单，每次测序只需48小时。
此外，更大的平台PromethION可以实现MinION约10倍的测序输出。
在我们的研究中，我们同时使用MinION和PromethION对肺癌细胞株LC2/ad进行全基因组测序。
映射读取的长度平均分别为~16 kb和14 kb(最高为32 kb)[10]。
对于更长的阅读，Jain等人[11]报道了一种产生超长阅读(高达800 kb)的方案，用于对人类基因组进行测序和组装，目的是对包括重复序列和复杂结构变化在内的困难区域进行表征。
相应地，也有报道称，这些长读测序可以用于探测传统短读测序器[12]无法探测的区域，这突出了长读测序可以提供的优势。
虽然牛津纳米孔测序仪的测序精度比短读测序技术要低，但我们可以很容易地获得长读序列。

与这些物理的长读测序器不同，研究人员还可以从短读测序中获得合成的长读序列，并将条形码序列连接到每个高分子量的DNA分子上。
10x Genomics发布了一项link -read技术，该技术基于使用铬系统生成含油滴条形码凝胶珠、反应试剂和DNA分子(100 kb)。
只需要1ng的基因组DNA。
该方法为基因组(N50相块长度在0.9-2.8 Mb之间)[13]的单倍型分析提供了SNPs的相位信息，并通过跟踪每个大DNA片段的分子条形码来检测SVs。

长读测序现在变得越来越普遍，因此，利用长读信息的癌症研究迅速增加，并不断发展，以破译复杂的癌症基因组。
在这里，我们介绍了最近的癌症研究的长期阅读分析和癌症基因组学的新前景带来的长read测序

A new era of long-read sequencing for cancer genomics

A new era of long-read sequencing for cancer genomics 癌症基因组学长read测序的新时代

Abstract

Introduction