zoukankan html css js c++ java

Single molecule real-time (SMRT) sequencing comes of age: applications and utilities for medical diagnostics 单分子实时(SMRT)测序成熟:医学诊断的应用和实用程序

Single molecule real-time (SMRT) sequencing comes of age: applications and utilities for medical diagnostics

单分子实时(SMRT)测序成熟:医学诊断的应用和实用程序

Simon Ardui, Adam Ameur, Joris R Vermeesch, Matthew S Hestand

Author Notes

Nucleic Acids Research, Volume 46, Issue 5, 16 March 2018, Pages 2159–2168, https://doi.org/10.1093/nar/gky066

Published:

01 February 2018

Article history

Abstract

Short read massive parallel sequencing has emerged as a standard diagnostic tool in the medical setting. However, short read technologies have inherent limitations such as GC bias, difficulties mapping to repetitive elements, trouble discriminating paralogous sequences, and difficulties in phasing alleles. Long read single molecule sequencers resolve these obstacles. Moreover, they offer higher consensus accuracies and can detect epigenetic modifications from native DNA. The first commercially available long read single molecule platform was the RS system based on PacBio's single molecule real-time (SMRT) sequencing technology, which has since evolved into their RSII and Sequel systems. Here we capsulize how SMRT sequencing is revolutionizing constitutional, reproductive, cancer, microbial and viral genetic testing.

短读大量平行测序已经成为医学领域的标准诊断工具。
然而，短读技术有其固有的局限性，如GC偏倚、难以映射到重复元素、难以区分异子序列和难以确定等位基因的相位。
长读单分子测序仪解决了这些障碍。
此外，它们提供了更高的一致性精确度，并能检测到来自本地DNA的表观遗传修饰。
第一个商用长读单分子测序平台是基于PacBio单分子实时测序技术的RS系统，该系统后来发展成为RSII和Sequel系统。
在这里，我们概述了SMRT测序是如何革命性的体质，生殖，癌症，微生物和病毒基因测试。

Issue Section:

SURVEY AND SUMMARY

INTRODUCTION

Modern medical genomics research and diagnostics relies heavily on DNA sequencing. Sequencing technologies are used in a wide range of applications during the entire human lifespan, from prenatal diagnostics, to newborn screening, to diagnosing rare diseases, hereditary forms of cancer, pharmacogenetics testing and predisposition testing for a plethora of diseases. It can even include testing for future generations in terms of carrier screening and pre-implantation genetic diagnoses (1,2).

The history of sequencing technologies can be broken up into three phases: first-, second- and third-generation sequencing (3). Though earlier first-generation technologies provided ground breaking discoveries, the big revolution in sequencing began with the invention of the ‘chain-termination’ or dideoxy technique, or what is today called Sanger sequencing (3,4). Improvements in chemistry and switching from gels to capillary based electrophoresis led to the current Sanger machines that provide low-throughput, high quality reads of up to ∼1 kb. Sanger sequencing is still often referred to as the gold standard and is commonly used for diagnosing Mendelian disorders (5) and targeted validation of higher-throughput sequencing results.

The first decade of the 21st century brought forth the development of multiple new methods of DNA sequencing (6). As opposed to first-generation platforms, these new second-generation technologies have considerably shorter reads (up to a few hundred bps), but at massively higher throughput (up to billions of reads per run). Common short-read platforms based on fluorescence include Illumina's bridge amplification and sequencing by synthesis technologies (e.g. HiSeq and MiSeq), Roche 454 pyrosequencers, and Applied Biosystem's sequencing by oligonucleotide ligation and detection (SOLiD) platforms. Additional short-read platforms include the Ion Torrent sequencers that detect nucleotides by the difference in pH as a result of hydrogen ions emitted during polymerisation, as opposed to light signals. Though these short-read platforms have permitted scientists to quickly hunt for causative mutations in a panel of disease genes, the exome, or even the entire human genome in both research and clinical settings (7), they all share common pitfalls and drawbacks. The short read lengths hinder assigning reads to complex parts of the genome (8), phasing of variants (9), resolving repeat regions (10) and introduce gaps and ambiguous regions in de novo assemblies (11,12). The amplification steps during library preparation and/or the actual sequencing reaction also introduce chimeric reads (13), variation in repeat size, and an underrepresentation of GC-rich/poor regions. Taken together, these drawbacks hinder the utility of diagnostic variant detection.

Third-generation is in general characterized as single molecule sequencing and is fundamentally different from clonal based second generation sequencing methods. Helicos provided the first commercial application of single molecule sequencing based on fluorescence detection and sequencing by synthesis. Though lacking amplification biases, such as underrepresentation of GC-rich/poor regions, this early single molecule sequencing still produced short (often 35 bp) read lengths (14). Two newer technologies, single molecule real-time (SMRT) sequencing by Pacific Biosciences (PacBio) (15) and nanopore sequencing by Oxford Nanopore Technologies (16), offer the advantages of single molecule sequencing, including exceptionally long read lengths (>20 kb). These platforms permit sequencing/assembly through repetitive elements, direct variant phasing, and even direct detection of epigenetic modifications (17,18). Sequencing also only lasts several hours, which is very appealing in a diagnostic setting. Though simple and low-cost nanopore based technologies (reviewed in (18–20)) are catching on and likely represent future platforms, SMRT sequencing is currently more matured and therefore diagnostically applicable at this time. Here we review how SMRT sequencing is being implemented in human genetic diagnostics.

介绍
现代医学基因组学研究和诊断很大程度上依赖于DNA测序。
测序技术在人类整个生命周期中被广泛应用，从产前诊断到新生儿筛查、罕见病诊断、癌症遗传形式、药物遗传学检测和多种疾病的易感体质检测。
它甚至可以包括通过载体筛选和植入前基因诊断来检测后代(1,2)。

测序技术的历史可以分成三个阶段:第一、第二代和第三代测序(3)。虽然第一代技术提供开创性的早些时候的发现,大测序革命始于“链终止”或双脱氧法技术的发明,或今天称为Sanger测序(3、4)。
化学方面的改进以及基于凝胶的电泳转换为基于毛细管的电泳，使得目前的Sanger机器能够提供低吞吐量、高质量的数据读取，数据读取量高达1kb。
Sanger测序仍常被称为金标准，常用于孟德尔疾病的诊断(5)和高通量测序结果的靶向验证。

21世纪的第一个十年的发展带来多种DNA测序(6)的新方法。相对于第一代平台,这些新的第二代技术已经相当短的读取(高达几百个基点),但在大规模更高的吞吐量(每运行数十亿读取)。
常见的基于荧光的短读平台包括Illumina的桥式扩增和合成技术测序(如HiSeq和MiSeq)、罗氏454焦测序仪和Applied Biosystem的寡核苷酸连接和检测(固体)平台测序。
附加的短读平台包括离子激流测序仪，它通过聚合过程中释放的氢离子在pH值上的差异来检测核苷酸，而不是光信号。
尽管这些短内容平台允许科学家快速寻找诱发基因突变的疾病,外显子组,甚至整个人类基因组研究和临床设置(7),他们都有着共同的缺陷和弊端。
短读取长度阻碍了对基因组复杂部分的分配读取(8)，变异体的相位(9)，解决重复区域(10)，并在从头组装中引入间隙和模糊区域(11,12)。
文库制备和/或实际测序反应中的扩增步骤也会引入嵌合reads(13)、重复大小的变化以及gc富集/贫区代表性不足。
综上所述，这些缺点阻碍了变异诊断检测的应用。

第三代测序一般以单分子测序为特征，与基于克隆的第二代测序方法有本质区别。
Helicos首次提供了基于荧光检测和合成测序的单分子测序的商业应用。
尽管缺乏扩增偏倚，如gc富/贫区代表性不足，这种早期单分子测序仍然产生短(通常为35 bp)阅读长度(14)。
两项较新的技术，太平洋生物科学公司(PacBio)的单分子实时测序(SMRT)和牛津纳米孔技术公司(Oxford nanopore technologies)的纳米孔测序(nanopore sequencing)，提供了单分子测序的优势，包括异常长的读取长度(20kb)。
这些平台允许通过重复的元素进行测序/组装，直接的变分阶段，甚至直接检测表观遗传修饰(17,18)。
测序也只持续几个小时，这是非常吸引诊断设置。
虽然简单和低成本的纳米孔技术(在(18-20)中回顾)正在流行，并可能代表未来的平台，SMRT测序目前更成熟，因此诊断适用于此时。
在这里，我们回顾SMRT测序是如何实施在人类遗传诊断。

SMRT SEQUENCING TECHNOLOGY AND TERMINOLOGY

Before SMRT sequencing, a library needs to be prepared from double stranded DNA input material (Figure 1A). Typically this often requires five or more micrograms of DNA which can limit some applications. The library preparation consists of simply ligating hairpin adapters onto DNA molecules, thereby circularizing them into a construct termed a SMRTbell (Figure 1B) (21). Next, a primer and a polymerase are annealed to the adapter whereupon the library is loaded on a SMRT Cell containing 150 000 nanoscale observation chambers (Zero Mode Waveguides (ZMWs)) for the RSII system and up to a million on the newer Sequel platform. The polymerase bound SMRTbells are then loaded into the ZMWs (Figure 1C). Ideally as many ZMWs should be loaded with exactly one SMRTbell as possible to maximise throughput and read lengths. For a good run, this is around one third to one half of the ZMWs per SMRT cell. Hence a SMRT cell typically produces ∼55 000 reads for the RSII system and 365 000 reads for the Sequel system (Table 1). The actual sequencing reaction occurs within each ZMW, whose small diameter only permits the smallest available volume for light detection (22). The polymerase within each ZMW incorporates fluorescently labeled nucleotides, emitting a fluorescent signal that is recorded by a camera in real-time (Figure 1C). These signals are converted to long sequences termed continuous long reads (CLR) (22), linear reads, or polymerase reads. For a short insert library, the circular structure of the molecule results in the insert sequence being covered multiple times by the CLR. Each pass of an original strand is termed a subread. In addition, all subreads from the same molecule can be combined into one highly accurate consensus sequence termed a circular consensus sequence (CCS) or reads-of-insert (ROI) (Figure 1F–H, left panel). These two terms are often used interchangeably, but by definition the difference is CCS requires two full sequencing passes of the insert whereas ROI can be defined starting from even a partial pass.

Figure 1.

Open in new tab Download slide

Overview of SMRT Sequencing Technology. Sequencing starts with preparing a library from double stranded DNA (A) to which hairpin adapters are ligated (B). This library is thereafter loaded onto a SMRT Cell made up of nanoscale observation chambers (Zero Mode Waveguides (ZMWs)). The DNA molecules in the library will be pulled to the bottom of the ZMW where the polymerase will incorporate fluorescently labelled nucleotides (C). Note that not all ZMWs will contain a DNA molecule because the library is loaded by diffusion. The fluorescence emitted by the nucleotides is recorded by a camera in real-time. Hence, not only the fluorescence color can be registered, but also the time between nucleotide incorporation which is called the interpulse duration (IPD) (D, right panel). When a sequencing polymerase encounters nucleotides on the DNA strand containing an (epigenetic) modification, like for example a 6-methyl adenosine modification (E, left panel), then the IPD will be delayed (E, right panel) compared to non-methylated DNA (D, right panel). Due to the circular structure of the library, a short insert will be covered multiple times by the continuous long read (CLR). Each pass of the original DNA molecule is termed a subread, which can be combined into one highly accurate consensus sequence termed a circular consensus sequence (CCS) or reads-of-insert (ROI) (F–H, left panel). Though SMRT sequencing always uses a circular template, long insert libraries typically only have a single pass and hence generate a linear sequence with single pass error rates (black nucleotides) (FG, right panel). Afterwards, overlapping single passes can be combined into one consensus sequence of high quality (H, right panel). Overall, CCS reads have the advantage of being very accurate while single passes stand out for their long read lengths (>20 kb).

Comparison of PacBio sequencing platforms to two current industry standards

Table 1.

Comparison of PacBio sequencing platforms to two current industry standards

Platform	Read length	Number reads	Error rate	Run rime
PacBio RSII (per SMRT cell)	Average 10–16 kb	∼55 000	13–15%	0.5–6 hours
PacBio Sequel (per SMRT cell)	Average 10–14 kb	∼365 000	13–15%	0.5–10 hours
Illumina HiSeq 4000	2 × 150 bp	5 billion	∼0.1%	<1–3.5 days
Illumina MiSeq	2 × 300 bp	25 million	∼0.1%	4–55 hours

Numbers from personal experience and company website (www.pacb.com and www.illumina.com) queries on 14 November 2017.

Open in new tab

Due to the real-time detection of the nucleotide incorporation rate, the pace of the polymerase progressing through the DNA strand is registered during sequencing (23). The time between nucleotide incorporations is termed the interpulse duration (IPD) and varies with epigenetic changes on the DNA (Figure 1D and E). Since a polymerase is not holding a single nucleotide during sequencing, but approximately twelve nucleotides, an epigenetic change on one nucleotide can actually affect the incorporation rate of surrounding nucleotides. This results in a ‘fingerprint,’ (24) some of which have been characterized, such as for 6-mA, 4-mC and (Tet-converted) 5-mC.

In addition to fewer but longer reads (Table 1), PacBio data differs from short read sequencing technologies in several aspects. Reads are not a set read length, but a distribution of read lengths depending on how long each individual polymerase is active. Since there is no need for amplification during the library preparation, nor during the sequencing process, biases such as GC-skewing are near absent. In contrary to second-generation platforms, raw PacBio reads also differ in error types (more indels than mismatches) and have a much higher abundance (∼13–15%, Table 1), though they are spread randomly across the reads (25,26). This randomness enables highly accurate consensuses (>99%) to be build up rapidly by sequencing multiple times the same molecule (CCS reads) (15) or by combining different CLRs derived from the same locus (Figure 1G and H). Also, diffusion loading creates a preference towards shorter molecules which might negatively impact sequencing runs. This loading bias can be mitigated by using magbead loading which keeps molecules <1 kb from binding to the bottom of ZMWs, size selection to remove short molecules, and/or by adding polyethylene glycol during loading to enhance packing of large DNA molecules. It is possible that a complete length independent loading can be achieved in the (near) future by applying an electrical field to force charged molecules into ZMW’s (27).

To address these inherently different reads, bioinformatic analyses require adapting current tools and/or developing new methods, such as for alignment (26,28–32) and assembly (33–39). Many PacBio specific tools and pipelines (including those for demultiplexing, creating CCS reads, long amplicon analyses, de novo assemblies (34) and epigenetic analyses) are available in PacBio's SMRT analysis suite (openly available, www.pacb.com/support/software-downloads/) via the command line or their SMRT Portal and SMRT Link graphical user interfaces.

CONSTITUTIONAL

Tandem repeat disorders

Tandem repeats cause more than 40 neurological, neurodegenerative or neuromuscular diseases when mutated (40). Unfortunately, sequencing those DNA elements is difficult with short-read platforms because the reads are too short to span most tandem repeats. The first tandem repeat studied by SMRT sequencing was the FMR1 CGG repeat (41). Healthy individuals carry around 30 CGG units which is mostly interrupted by one or two AGG units. An expansion of the repeat to more than 200 units causes the Fragile X Syndrome (FXS), which is one of the most frequent causes of inherited intellectual disability and autism. Loomis et al. (41) showed they could sequence through a long full mutation allele of 750 units which equals 2 kb of 100% GC and repetitive content. Interestingly, expansions to full mutations only occur upon maternal transmission whereby the risk directly correlates with increasing repeat size and fewer AGG interruptions (42). SMRT sequencing can be used to determine the repeat size and the detection of the number of interrupting AGG units (43). A main advantage of this approach is the unambiguous separation of the two CGG repeats on the different X chromosomes of females thereby outperforming all other (PCR) approaches. Afterward, the information generated by SMRT sequencing is used clinically for improved genetic counselling of woman weighing the risk of having a child with FXS (43–45). Another example of tackling a tandem repeat by SMRT sequencing is the ATTCT repeat embedded in intron 9 of the Spinocerebellar ataxia type 10 gene (SCA10) (10). For the first time the full length of an expanded ATTCT repeat was completely sequenced using SMRT technology. The repeat was reconstructed by assembly and both known and novel interruptions were detected (10). The presence of those interruptions influence the phenotype of SCA10 patients and hence knowing the exact repeat structure allows for better genotype-phenotype correlations. It will be interesting to use SMRT sequencing in the near future for other tandem repeats with interruptions like Myotonic Dystrophy (46) and Friedreich's Ataxia (47) to increase our knowledge on tandem repeat configuration, its influence on stability of the repeat, and phenotype of an individual.

Where all of the above applications use PCR, novel amplification free enrichment methods are currently being developed. Methods using amplification are very error-prone, especially when amplifying (tandem) repeats (41), and remove all epigenetic marks (48). Thus using amplification impedes a complete genetic and epigenetic characterization of tandem repeats. Currently two methods are under development. The first method presented by Pham et al. (48) is based on type IIS restriction enzyme digestions, customized hairpin adapters especially designed to anneal at the targeted digest overhangs, and a ‘capture-hook’ method. A second and more recent method (bioRxivhttps://doi.org/10.1101/203919) is based on restriction enzyme digestion followed by cleavage of SMRT bells containing the target of interest using the CRISPR/Cas9 system. By ligating a specific capture adapter at the CRISPR/Cas9 DNA cleavage sites, the SMRT bell molecules of interest can then be selectively pulled down by magnetic beads targeting the capture adapter. The high throughput of SMRT sequencing enables different targets (e.g. FMR1 CGG repeat, C9ORF72 GGGGCC repeat, HTT CAG repeat, Sca10 ATTCT repeat, etc.) from one DNA sample to be simultaneously enriched and sequenced in a single run (bioRxivhttps://doi.org/10.1101/203919).

Both methods have been used to target the FMR1 CGG repeat and showed for the first time the true biological CGG repeat variation in human cell lines (48) (bioRxivhttps://doi.org/10.1101/203919). Besides avoiding amplification biases, these methods permit native DNA capture and hence direct detection of epigenetics. In the future, this technique can possibly be used diagnostically to screen for full mutations and assess the methylation status of the FMR1 CGG repeat, both of which influence the phenotype of FXS (49–51). Traditionally this would be determined by Southern blots, a labour intensive and inaccurate method. Thus replacing Southern Blots with faster and more direct SMRT sequencing will greatly enhance FMR1 and additional repeat disorder diagnostics (49–52). PacBio's enrichment technique has also been used to study patients with expanded Sca10 ATTCT repeats (53). Here, SMRT sequencing revealed a complete absence of interruptions which could be linked to the parkinsonism phenotype of the patient.

Polymorphic regions

Genotyping the human leukocyte antigen (HLA) region, or the human major histocompatibility complex (MHC), is crucial for diagnosing autoimmune disorders and selection of donors in organ and stem cell transplantation. Genes in the region can be highly polymorphic, HLA-B being the most variable with >2000 alleles already annotated in 2012 (54). The high variability in sequence make this region exceptionally difficult to map with short reads (54). HLA can be divided into three molecule classes and regions, termed class I, II and III, though the first two are primarily studied. Amplicons of ∼400–900 bp have been used with 454 sequencing to target specific exons of class I genes (55,56). However, considering these genes are ∼3kb in length, entire alleles, as opposed to exons, can be sequenced in a single PacBio read. Class II genes can exceed 10kb making them more difficult, but still possible. Full length class I HLA alleles have been targeted in humans with hybrid PacBio-Illumina approaches (57) and PacBio only approaches (58,59). Many large HLA typing labs, such as the Anthony Nolan Research Institute (58,59), are utilizing or developing SMRT sequencing pipelines of their own or using commercial kits, such as those offered by GenDx (Utrecht, The Netherlands), to now target class I, as wells as many class II genes. This is rapidly expanding the number of known HLA alleles (57) and is becoming a gold standard for organ transplant genotyping and blood stem cell transplantation.

Similarly complex regions can also be analyzed with these approaches. The killer cell immunoglobulin-like receptor (KIR) region, whose genes encode proteins with domains that recognize HLA proteins, was recently analyzed with SMRT sequencing and for the first time multiple haplotypes were phased without imputation (60).

Pseudogene discrimination

The high sequence similarity between pseudogenes and their homologous functional genes makes distinguishing variation between the two extremely difficult when using short read technologies. In general, long reads spanning the actual gene regions can be used to anchor to unique regions and/or phase variants to discriminate between the pseudogene and the actual gene. For diagnostics it is common to target a specific locus or set of loci of interest as a cost effective way to overcome the limited throughput of current generation SMRT sequencing platforms. The easiest option to enrich for specific loci is amplifying the targets by doing a (multiplex) long-range PCR (up to 10 kb). To differentiate samples, barcodes can be added directly during PCR via primers (61,62), by a nested PCR approach (57,61,63,64), or by ligating hairpin adapters containing barcodes during library preparation (Pacific Biosciences Product Note: www.pacb.com/wp-content/uploads/2015/09/ProductNote-Barcoded-Adapters-Barcoded-Universal-Primers.pdf). Therefore, for multiplexed long-amplicon tests only a single library preparation is needed after pooling the barcoded amplicons, as opposed to fragmentation and multiple barcoded library preparations for short-read platforms. This therefore enables fast, cheap library preparations that can be sequenced in just a few hours, permitting the next step in complex gene loci diagnoses.

One application is using barcoded 6–8 kb amplicons, and potential nested amplicons, to target the drug metabolism gene CYP2D6 (61,63). This gene has homologous pseudogenes and copy number variants which impair reliable genotyping with short-read platforms (61,63). After SMRT sequencing, reads can then be aligned and variants called using alignment based or ‘Long Amplicon Analysis’ (LAA, included in SMRT analysis) based pipelines. LAA is particularly powerful in that it enables reference free analyses and phasing of the two alleles (61). The pipeline first demultiplexes reads (if needed), then looks for overlap, performs clustering (i.e. determines different amplicons), phases the clustered reads (i.e. determines different alleles), and determines consensus sequences with Quiver (34). LAA may require optimization, such as the minimal number of reads used for clustering. Too many can result in false alleles and long run times, whereas too little may result in allelic dropouts. Once assembled, alleles can be compared to each other or to a reference genome for annotation. Overall, SMRT sequencing permits expanding from targeting specific CYP2D6 variants/exons, to identification of phased variants across the entire loci, including up/downstream and all introns, that will enhance identification of metabolizer phenotypes in tested individuals and enhance personalized medicine (61). Similar long-range PCR with PacBio applications have been used to genotype and discriminate other genes from pseudogenes (Table 2), including PKD1 for diagnosing autosomal-dominant polycystic kidney disease (64) and IKBKG for diagnosing primary immunodeficiency diseases in patients suffering from life-threatening invasive pyogenic bacterial infections (65).

Applications of human SMRT sequencing and clinical utility

Table 2.

Applications of human SMRT sequencing and clinical utility

Target	Disease	Ref.
Tandem repeat sequencing
FMR1	Fragile X Syndrome	(43)a
HTT	Huntington's Disease	a
C9orf72	Amyotrophic Lateral Sclerosis (ALS)	a
SCA10	Spinocerebellar ataxia type 10, Parkinson's disease	(10,53)a
Highly polymorphic regions
HLA	Autoimmune disorders & transplantation	(57–59)
KIR	Autoimmune diseases & transplantation	(60)
Pseudogene discrimination
CYP2D6	Drug metabolism	(61,63)
PKD1	Autosomal-dominant polycystic kidney disease	(64)
IKBKG	Primary immunodeficiency diseases	(65)
Cancer
BCR-ABL1	Chronic Myeloid Leukemia (CML)	(69)
TP53	Myelodysplastic Syndromes (MDS) and Acute Myeloblastic Leukemia (AML)	(70)
Reproductive genomics
TCOF1	Treacher Collins syndrome	(67)
PTPN11	Noonan syndrome	(67)

abioRxivhttps://doi.org/10.1101/203919.

Open in new tab

REPRODUCTIVE GENOMICS

Reproductive genomic medicine and associated counseling, including pre-implantation genetic diagnosis (PGD), relies heavily on the ability to haplotype or phase alleles in embryos, patients, and parents. Long reads enable direct phasing of amplicons from targeted loci which can be used to determine parent-of-origin alleles in embryos or patients (66,67). In a family having one child with Treacher Collins syndrome, SMRT amplicons sequencing was used to confirm the paternal transmission of a TCOF1 variant that affects splicing of the gene and potentially causes the disease (67). For apparent de novo mutations that are a result of germ line mosaicism, determining the frequency of damaging alleles is informative in predicting recurrence in future offspring. For a couple with multiple miscarriages and suspected Noonan syndrome in the fetuses, SMRT amplicon sequencing identified a disease causing PTPN11 variant in 37% of the father's sperm (67). Digital Droplet PCR showed no signs of the variant in the father's blood, but confirmed the 40% frequency in the fathers sperm (67). This therefore enabled an estimate of recurrent risk for subsequent pregnancies. Whole-genome single-cell haplotyping based on arrays is already being used in practice for embryo selection before implantation, though phasing still requires additional family members (68). We envision a profound impact on future PGD applications by incorporating long-read whole-genome sequencing for direct phasing to eliminate the need for analyzing additional family members.

CANCER

During treatment of cancer patients, it is crucial to monitor low frequency mutations that can lead to a proliferative advantage of malignant cells. Chronic myeloid leukemia (CML) is a blood cancer that is caused by a translocation between chromosomes 9 and 22, giving rise to the BCR-ABL1 fusion protein. CML patients are normally treated with tyrosine kinase inhibitors (TKIs) to suppress BCR-ABL1, but the therapy can induce point mutations leading to drug resistance. It is therefore important to screen the BCR-ABL1 gene in CML patients responding poorly to TKI treatment and study the mutational landscape. In a study by Cavelier et al. (69), a ∼1.5 kb amplicon was constructed from BCR-ABL1 cDNA. SMRT sequencing allowed for detection of TKI resistance mutations down to a level of 1%, a significantly lower detection threshold as compared to the 15–20% reached by Sanger sequencing. Moreover, it was possible to phase co-existing mutations thereby giving new information about the clonal distribution of resistance mutations in BCR-ABL1, and also to identify a number of distinct splice isoforms. Apart from BCR-ABL1, a number of other cancer genes are suitable targets for clinical SMRT sequencing (Table 2). In a study of loss-of-function mutations in the tumor suppressor TP53, SMRT sequencing revealed that tumors from acute myeloblastic leukemia (AML) and myelodysplatic syndrome (MDS) patients harbor multiple TP53 mutations distributed in different alleles (70). In the future, detailed information about the subclonal heterogeneity of TP53 could be used to guide the treatment of these patients. Minor variants can also be detected in other types of somatic variation, unrelated to cancer. Gudmunsson et al. (71) used SMRT sequencing to obtain phasing information of somatic mosaicism mutations in GJB2 that led to the repair of skin lesions in a patient with keratitis-ichthyosis-deafness syndrome.

Whole genome and transcriptome sequencing (addressed in later sections) is at the moment only affordable for research, but in the near future will become a diagnostic option. Already whole genome and transcriptome SMRT sequencing has been applied to breast cancer cell models identifying novel gene fusion events with the known oncogene Her2 (Case Study: www.pacb.com/wp-content/uploads/Case-Study-Scientists-deconstruct-cancer-complexity-through-genome-and-transcriptome-analysis.pdf). Whole transcriptome sequencing of prostate cell models has also identified novel RLN1 and RLN2 gene fusions in prostate cancer (72). Importantly, SMRT sequencing can give a more precise view of the cancer gene structure, as was demonstrated in a study by Kohli et al. where a cryptic exon was detected in AR-V9 that was previously thought to be present only in AR-V7 (73). AR-V7 has been studied as a potential biomarker for drug resistance in prostate cancer, based on knockdown experiments that have in fact targeted both isoforms. Thus, AR-V9 may actually be a predictive biomarker for resistance.

Global changes in epigenetics is also a hallmark in cancer. Single molecule real-time bisulfite sequencing (SMRT-BS) enables quantitative and highly multiplexed detection of methylation in 1.5–2 kb amplicons (74,75). This is an improvement of the previous technologies that could only target typical bisulfite PCR sizes (∼300–500 bp) and potentially enables ∼91% of CpG islands in the human genome to be evaluated (75). To date this has been applied to multiple cancer cell lines, including those from an acute myeloid leukemia, chronic myeloid leukemia, anaplastic large cell lymphoma, plasma cell leukemia, Burkitt lymphoma, B-cell lymphoma and multiple myelomas (75). Expanding to genome wide diagnostics, when whole genome SMRT sequencing is performed on non-amplified material it is theoretically possible to determine epigenetic status across all nucleotides based on IPD ratios. Therefore, we envision in the near future cancer genomes, transcriptomes and epigenomes will commonly be characterized at previously unparalleled resolution.

VIRAL AND MICROBIAL MEDICAL SEQUENCING

In infectious disease, SMRT sequencing has been used to analyse influenza viruses (76), hepatitis B viruses (HBV) (77), hepatitis C viruses (HCV) (77,78) and human immunodeficiency viruses (HIV) (79,80) (Table 3). HCV and HIV are RNA molecules of a length of approximately 9 kb, while HBV is a circular DNA virus of size 3 kb. These viruses are suitable subjects for SMRT sequencing, since the entire virus genome can easily be contained in a single read. For example, Bull et al. (77) developed an assay where the resulting reads covered nearly the entire sequence for all six major HCV genotypes. In addition to determining the genome sequence of the infecting viruses, it is also possible to monitor mutations that are developing as a result of drug treatment. For HCV, resistance associated variants (RAVs) in the NS5A gene occurring at a frequency of <0.5% were successfully identified in samples from patients undergoing treatment by direct acting antiviral drugs (DAAs) (78). By full-length sequencing of the HIV-1 provirus, a 9700 bp molecule that encodes nine major proteins via alternative splicing, Ocwieja et al. (80) detected at least 109 different spliced RNAs, including two of which encode new proteins. The fact that this relatively small study could generate a lot of novel information about HIV-1, a molecule that has already been studied in great detail, demonstrates the advantage of full-length RNA sequencing to study the distribution of splicing isoforms in specific genes. Results from these types of experiments could possibly open up novel therapeutic opportunities in infectious disease.

Medically relevant microbial SMRT sequencing

Table 3.

Medically relevant microbial SMRT sequencing

Target/disease	Ref.
Hepatitis B/C virus	(77,78)
HIV	(79,80)
Influenza viruses	(76)
Tuberculosis bacteria	(85)
E. coli / Hemolytic–Uremic Syndrome	(86)
Salmonella enterica subsp. enterica serovar/gastroenteritis	(87)
Leishmania	(88)
Leptospira interrogans/leptospirosis	(90)
Helicobacter pylori strains/gastrointestinal diseases	(91)

Open in new tab

For bacteria, a single SMRT Cell often provides enough data to de novo assemble Escherichia coli size genomes into single contigs. HGAP is the most widely used assembler and works by taking a selection of longest reads and error correcting them with all reads, followed by Celera assembly (81,82), and finalized by polishing with all reads aligned to the final assembly (34). These long reads and new algorithms enable PacBio assemblies to be more complete and accurate compared to second-generation sequencing methods (83,84). Clinically relevant bacterial assemblies include a strain of the Tuberculosis bacteria Mycobacterium tuberculosis (85), the E. coli strain that caused a Hemolytic–Uremic Syndrome outbreak in Germany in 2011 (86), and strains of Salmonella enterica subsp. enterica serovar that cause gastroenteritis in humans (87) (Table 3). Pacbio sequencing and HGAP have also been used to assemble pathogenic single-cell eukaryote genomes that are more complex than a single chromosome, such as for a new Leishmania reference genome (88), a protozoan parasite that kills >30 000 people each year.

Though long reads permit superb microbial assemblies, what truly differentiates SMRT sequencing from second-generation machines is the ability to directly determine the epigenetics of these organisms. DNA methylation is overall ubiquitous in bacterial genomes (89), which simplifies SMRT analysis of epigenetic characteristics in these organisms. Analyses can be performed using IPD ratios of cases versus controls or vs an in silico control compared to known epigenetics signatures for 6-mA, 4-mC and (Tet converted) 5-mC (available in SMRT analysis). This has been used to discriminate virulent from avirulent Leptospira interrogans, a cause of leptospirosis in humans (90). The genome sequences have no major differences between strains, but higher levels of methylation are found in the avirulent strain (90). Methylation analysis has also been used to identify virulence factor genotype-dependent motifs in eight different H. pylori strains, a bacteria that can lead to gastrointestinal diseases (91). The simplicity to sequence, assemble, and call nucleotide, structural and epigenetic variation for a complete genome from a single SMRT Cell makes SMRT sequencing a truly revolutionizing technology in microbiology.

FUTURE: WHOLE TRANSCRIPTOME AND GENOME SEQUENCING

Traditionally RNA is converted to cDNA and then fragmented for short read sequencing (RNA-seq). Assembling the host of exons detected from RNA-seq into individual transcripts is extremely difficult and error prone. SMRT sequencing eliminates the need for fragmentation, instead sequencing cDNAs from the 5′ end of transcripts to the poly-A tail, termed Iso-Seq. This is an ideal method for complete cDNA sequencing (92). Iso-Seq has been used to sequence full transcriptomes from the blood of a normal Chinese adult male (93), a pool of 20 RNAs from different normal human tissues and organs (92), a trio of lymphoblastoid transcriptomes (94), and analyse prostate and breast cancer cell models (73) (Case Study: www.pacb.com/wp-content/uploads/Case-Study-Scientists-deconstruct-cancer-complexity-through-genome-and-transcriptome-analysis.pdf). As opposed to complex short-read alignment and re-assemblies, these papers demonstrate long-reads can easily detect splicing isoforms in human genes. Besides detecting a vast number of known isoforms, this method has also identified novel splicing forms and genes that have not previously been detected by short-read sequencing (93). Similar to genomic variant phasing, for gene loci with transcribed single nucleotide variants, these can be used to determine precisely which allele isoforms are expressed from (94). Though Iso-Seq is exceptional for transcript structure determination, the lower throughput when compared to second-generation platforms currently limits its usage for expression analysis. However, as costs drop and throughput increases, unbiased PacBio expression and isoform detection will become routine in the near future.

Whole genome sequencing (WGS) has become a widely used method to study variation in the human genome, and several 100’s of thousands of human genomes have been sequenced with short-reads during the last few years. However, the nature of these reads permit only relatively small assemblies and alignments provide only limited information on variation outside of SNPs and small insertions/deletions. SMRT sequencing is greatly expanding the utility of WGS, permitting a factor greater in assembly completeness (93,95) (BioRxiv: https://doi.org/10.1101/067447), even nearing reference genome contig sizes and including diploid aware assemblies by applying algorithms like FALCON-unzip (37). These PacBio WGS’s also demonstrate a vast repertoire of variation missed by short read WGSs. Low coverage (4–8×) sequencing recently was used to characterize structural variation in chromothrypsis-like chromosomes (96) and identify a pathogenic heterozygous 2184 bp deletion in a patient who presented with Carney complex that could not be identified by short-read sequencing (97). Higher coverage sequencing (∼60×) of two haploid genomes has also been used to identify a vast array of structural variations (461 553 from 2 bp to 28 kb in length), including >89% being missed in the analysis of data from the 1000 Genomes Project (98). From this study, Huddleston et al. (98) estimate a 5× increase in discovering indels >7 bp and additional SVs <1 kb which in total bps represents a majority of the difference between genomes. Additional remarkable findings from individual human de novo assemblies is that there seems to exist several megabases of novel sequence, i.e. sequences that are absent from the current (GRCh38) version of the human reference. For example, Shi et al. (93) reported 12.8 Mb of novel sequence in their de novo assembled individual genome, which would correspond to over 0.4% of the entire human genome of size ∼3 Gb. At this point, it is not known whether this novel sequence is common between all human individuals (and thereby missing from GRCh38) or if it mainly represents sequence variation found only in some specific individuals or population groups. Overall, these WGS studies demonstrate long-read sequencing can identify a substantial number of variation missed by short read platforms, including those relevant to clinical diagnoses.

CONCLUSIONS

The myth that SMRT sequencing is too error prone to be diagnostically useful is being expunged and replaced by evidence that it offers advantages over short-read sequencers. SMRT sequencing is opening up new diagnostic avenues, such as the ability to determine tandem repeat lengths, interruptions, and even epigenetics in a single test at base pair resolution. Long read sequencing is already considered the gold standard for some applications, such as for HLA genotyping for tissue transplants. While large scale implementation appears to be hampered by the cost and community expertise, this is likely to change rapidly. In addition to systematic price reductions and a growing customer base, new single molecule technologies such as nanopore based systems are likely to propel the field. Just as second-generation platforms stepped beyond Sanger sequencing and enabled a revolution in genomics medicine, third-generation single molecule sequencing platforms will likely be the next genetic diagnostic revolution

查看全文

相关阅读:
矩阵论基础 3.1初等变换
 最优化理论与方法 9 二次规划
 最优化理论与方法 10 罚函数法
 矩阵论基础 2.5 用Matlab实现矩阵的基本运算
 最优化理论与方法目录
 UG OPEN API编程基础 12UIStyler对话框
 第十四章达朗伯原理 1
矩阵论基础 2.3 方阵的几种运算
 矩阵论基础 3.5 用Matlab求解线性方程组
 测试一下博客的html代码机制

原文地址：https://www.cnblogs.com/wangprince2017/p/13768161.html