zoukankan      html  css  js  c++  java
  • PacBio下机数据解读

    今天被人问起如何看懂三代的下机数据,虽然解决了别人的问题,但感觉自己还是没有搞透。

    基本的目录结构:

    |-- HG002new_O1l_BP_P6_021315b_MB_100pM
    |   |-- D01_1.c60e446d-f276-41fc-9384-ffa937e22683.tar.gz
    |   |-- D01_2.19ee4f13-c420-4974-8262-cb1da56beccd.tar.gz
    |   |-- D01_3.94e34f0a-eef3-4b71-8f1b-c9790dec647e.tar.gz
    |   |-- D01_4.53ef7aed-e91e-46f9-bb71-8b021344b951.tar.gz
    |   |-- D01_5.55b1f7cb-ad44-4afb-bf2b-5c34fcb0a210.tar.gz
    |   `-- D01_6.b9b564dc-b794-4a7f-bc3b-854a7bc32887.tar.gz
    `-- pacbio_README

    解压后的目录结构:

    Analysis_Results/m140612_020550_42156_c100652082550000001823118110071460_s1_p0.1.bax.h5
    Analysis_Results/m140612_020550_42156_c100652082550000001823118110071460_s1_p0.1.log
    Analysis_Results/m140612_020550_42156_c100652082550000001823118110071460_s1_p0.1.subreads.fasta
    Analysis_Results/m140612_020550_42156_c100652082550000001823118110071460_s1_p0.1.subreads.fastq
    Analysis_Results/m140612_020550_42156_c100652082550000001823118110071460_s1_p0.2.bax.h5
    Analysis_Results/m140612_020550_42156_c100652082550000001823118110071460_s1_p0.2.log
    Analysis_Results/m140612_020550_42156_c100652082550000001823118110071460_s1_p0.2.subreads.fasta
    Analysis_Results/m140612_020550_42156_c100652082550000001823118110071460_s1_p0.2.subreads.fastq
    Analysis_Results/m140612_020550_42156_c100652082550000001823118110071460_s1_p0.3.bax.h5
    Analysis_Results/m140612_020550_42156_c100652082550000001823118110071460_s1_p0.3.log
    Analysis_Results/m140612_020550_42156_c100652082550000001823118110071460_s1_p0.3.subreads.fasta
    Analysis_Results/m140612_020550_42156_c100652082550000001823118110071460_s1_p0.3.subreads.fastq
    Analysis_Results/m140612_020550_42156_c100652082550000001823118110071460_s1_p0.bas.h5
    Analysis_Results/m140612_020550_42156_c100652082550000001823118110071460_s1_p0.sts.csv
    Analysis_Results/m140612_020550_42156_c100652082550000001823118110071460_s1_p0.sts.xml
    m140612_020550_42156_c100652082550000001823118110071460_s1_p0.1.xfer.xml
    m140612_020550_42156_c100652082550000001823118110071460_s1_p0.2.xfer.xml
    m140612_020550_42156_c100652082550000001823118110071460_s1_p0.3.xfer.xml
    m140612_020550_42156_c100652082550000001823118110071460_s1_p0.mcd.h5
    m140612_020550_42156_c100652082550000001823118110071460_s1_p0.metadata.xml

    可以看到数据是以HDF5的格式存储的,格式介绍:PacBio Sequences的HDF5格式

    那么,上面目录和文件名都有哪些含义呢?仔细的看说明文档就会知道

    ============
    Introduction
    ============
    These directories contain data from PacBio sequencing of HG002, HG003, and HG004,
    which are the son, father, and mother, respectively, in a trio of Ashkenazim Jewish
    ancestry from the Personal Genome Project and are candidate Reference Materials being
    characterized by NIST and the Genome in a Bottle Consortium. The coverage is approximately
    69X, 32X, and 30X for HG002, HG003, and HG004, respectively. 89.7% of the data is
    from P6-C4 chemistry, and the remaining from P5-C3 chemistry. Library preparation was
    performed at NIST and sequencing was performed at Mt. Sinai School of Medicine. 
    Details of the library preparation, sequencing, and data are provided below.


    ==============================================
    Library Preparation (include reagent versions)
    ==============================================
    SMRTbell library preparation and sequencing of HG002, HG003, and HG004 AJ Trio gDNA

    DNA library preparation and sequencing was performed according to the manufacturer's
    instructions with noted modifications. Following the Pacific Biosciences Protocol,
    "20-kb Template Preparation Using Blue Pippin Size-Selection System", library
    preparation was performed using the Pacific Biosciences SMRTbell Template Prep Kit 1.0
    (PN # 100-259-100).  In short, 10 µg of extracted, high-quality, genomic DNA from eac
    of HG-002, HG-003, and HG-004, the AJ trio, were used for library preparation. Genomic
    DNA extracts were verified with the Life Technologies Qubit 2.0 Fluorometer using the
    High Sensitivity dsDNA assay (PN# Q32851) to quantify the mass of double-stranded DNA
    present.  After quantification, each sample was diluted to 150 µL, using kit provide
    EB, yielding a concentration of ~ 66 ng/µL.  The 150 µL aliquots were individuall
    pipetted into the top chambers of Covaris G-tube (PN# 520079) spin columns and sheared
    for 60 seconds at 4500 rpm using an Eppendorf 5424 benchtop centrifuge. Once complete,
    the spin columns were flipped after verifying that all DNA was now in the lower chamber.
    The columns were spun for another 60 seconds at 4500 rpm to further shear the DNA and
    place the aliquot back into the upper chamber. In some cases G-tubes were centrifuged
    2-3 times, in both directions to ensure all volume had passed into the appropriate
    chamber.  Shearing resulted in a ~20,000 bp DNA fragments verified using an Agilent
    Bioanlyzer DNA 12000 gel chip (PN# 5067-1508).  The sheared DNA isolates were then
    purified using a 0.5X AMPure PB magnetic bead purification step (0.5X AMPure PB beads
    added, by volume, to each DNA sample, vortexed for 10 minutes at 2,000 rpm, followed
    by two washes with 70% alcohol and finally eluted in EB). This AMPure purification
    step assures removal of any small fragment and/or biological contaminant.  The sheared
    DNA concentration was then measured using the Qubit High Sensitivity dsDNA assay. 
    These values were used to calculate actual input mass for library preparation following
    shearing and purification.
     
    After purification, ~8-9 mg of each purified sheared sample went through the following
    library preparation process per the aforementioned protocol:
    ------------------------------------
             ExcVII Treatment
           (remove) ssDNA ends
                    |
        DNA Fragment Damage Repair
                    |
         DNA Fragment End Repair
                    |
      Purify Blunt-Ended DNA Fragments
                    |
    Blunt End SMRTbell Adapter Ligation
                    |
          Exonuclease Treatment
      (remove failed ligation product)
                    |
       Size Selection using BluePippin
                    |
    Clean and Concentrate Final Library
    -------------------------------------
       
    All library preparation reaction volumes were scaled to accommodate input mass for a
    given sample.  Library size selection was performed using the Sage Science BluePippin
    0.75% Agarose, Dye Free, PacBio ~20kb templates, S1 cassette (PN# PAC20KB).  Size
    selections were run overnight to maximize recovered mass. Approximately 2-5 mg of
    prepared libraries were size selected using a 10 kb start and 50 kb end in "Range" mode. 
    This selection is necessary to narrow the library distribution and maximize the SMRTbell
    sub-read length for the best de novo assembly possible.  Without selection, smaller
    2000 - 10,000 bp molecules dominate the zero-mode waveguide loading distribution,
    decreasing the sub-read length.  Size-selection was confirmed using pre and post size
    selected DNA using an Agilent DNA 12000 chip. Final library mass was measured using the
    Qubit High Sensitivity dsDNA Assay. Approximately 15-20% of the initial gDNA input mass
    resulted after elution from the agarose cassette, which was enough yield to proceed to
    primer annealing and DNA sequencing on the PacBio RSII instrument.  This entire library
    preparation and selection strategy was conducted 7, 2 and 2 times across HG002, HG003,
    and HG004 respectively, to provide enough library for the duration of this project.

    ==================================================
    Sequencing (include chemistry/instrument versions)
    ==================================================
    Sequencing reflects the P6-C4 sequencing enzyme and chemistry, respectively. (Note that
    10.3 % of the data was collected using the P5-C3 enzyme/chemistry prior to the release
    of the P6-C4 enzyme and chemistry.)  Primer was annealed to the size-selected SMRTbell
    with the full-length libraries (80ºC for 2 minute 30 followed by decreasing thetemperature
    by 0.1º/s to 25Cº. To prepare the polymerase-template complex, the SMRTbell template complex
    was then bound to the P6 enzyme using the Pacific Biosciences DNA Polymerase Binding Kit
    P6 v2 (PN# 100-372-700). A ratio of 10:1, polymerase to SMRTbell at 0.5 nM, was prepared
    and incubated for 4 hours at 30ºC and then held at 4ºC until ready for magbead loading
    prior to sequencing.  The Magnetic bead-loading step was conducted using the Pacific
    Biosciences MagBead Kit (PN# 100-133-600) at 4ºC for 60-minutes permanufacturer's guidelines. 
    The magbead-loaded, polymerase-bound, SMRTbell libraries were placed onto the RSII instrument
    at a sequencing concentration of 100 to 40 pM to optimize loading across various SMRTcells.
    Sequencing was performed using the C4 chemistry provided in the Pacific Biosciences DNA
    Sequence Bundle 4.0 (PN# 100-356-400).  The RSII was then configured for 240-minute
    continuous sequencing runs.

    ========================================
    Preliminary Analyses and Quality Control
    ========================================
    Assuming a 3.2 Gb human genome, sequencing was conducted to approximately 69X, 32X,
    and 30X coverage for HG002, HG003, and HG004 across 292, 139, and 132 SMRTcells,
    respectively.  27.4M, 13.2M, and 12.4M  subreads were generated resulting in 220.0,
    101.6, and 94.9 Gb of sequence data with sub-readlength N50 values of 11,087, 10,728,
    and 10,629 basepairs.

    ================================
    File/Directory naming convention
    ================================
    The file/directory naming convention is defined as follows:

    [SampleName]/[WellName]_[CollectionNumber].[UUID].tar.gz
    Note that SampleName may contain other genomes in the name, but the data directories
    only contain data from HG002, HG003, and HG004.

    For example, for SampleName of HG002new_O1_BP_P6_021815_MB_105pM,
    WellName of A01, and CollectionNumber of 3, you will see a tar.gz file in
    HG002new_O1_BP_P6_021815_MB_105pM directory with name A01_3.[UUID].tar.gz

    The UUID is currently used for only hashing purpose.

    The tar.gz file containts the raw SMRTPortal data including following contents:

    tar.gz
    |   [movie name].1.xfer.xml
    |   [movie name].2.xfer.xml
    |   [movie name].3.xfer.xml
    |   [movie name].mcd.h5
    |   [movie name].metadata.xml
    ---Analysis_Results
        |   [movie name].1.bax.h5
        |   [movie name].1.log
        |   [movie name].1.subreads.fasta
        |   [movie name].1.subreads.fastq
        |   [movie name].2.bax.h5
        |   [movie name].2.log
        |   [movie name].2.subreads.fasta
        |   [movie name].2.subreads.fastq
        |   [movie name].3.bax.h5
        |   [movie name].3.log
        |   [movie name].3.subreads.fasta
        |   [movie name].3.subreads.fastq
        |   [movie name].bas.h5
        |   [movie name].sts.csv
        |   [movie name].sts.xml


    The metadata.xml contains all the metadata of this particular sample in the xml format;
    for example, in the TemplatePrep field you might see "DNA Template Prep Kit 2.0 (3Kb - 10Kb),"
    and in the BindingKit field you might see "DNA/Polymerase Binding Kit P6," etc.

    For information about bas.h5/bax.h5 files, please see:
    http://files.pacb.com/software/instrument/2.0.0/bas.h5%20Reference%20Guide.pdf

    For information about subreads, please see:
    https://speakerdeck.com/pacbio/track-1-de-novo-assembly

  • 相关阅读:
    线程池
    多线程随笔
    注解随笔
    反射机制
    iO流
    FastDFS+docker建立分布式文件系统
    Java之Exception
    Java之String
    手写SpringMvc
    spring中一些常用注解的含义
  • 原文地址:https://www.cnblogs.com/leezx/p/6108721.html
Copyright © 2011-2022 走看看