zoukankan      html  css  js  c++  java
  • GATK--使用转载

    http://blog.sciencenet.cn/blog-1469385-819498.html

    文章目录

    • 一.准备工作
    • 二.流程概览
    • 三.流程

    首先说说GATK可以做什么。它主要用于从sequencing 数据中进行variant calling,包括SNP、INDEL。比如现在风行的exome sequencing找variant,一般通过BWA+GATK的pipeline进行数据分析。

    要run GATK,首先得了解它的网站(http://www.broadinstitute.org/gatk/)。

    一.准备工作

    (1)程序下载

    http://www.broadinstitute.org/gatk/download

    务必下载最新版,注意不是那个GATK-lite,而是GATK2.(目前是2,不知何时会更新到3.更新很快啊有木有,所以一定要用新版)解压到某个目录即可,无需安装。打开解压缩后的目录,会发现GenomeAnalysisTK.jar这个文件。我们要用的各种工具都在这个包里。

    (2)Reference sequence和annotation下载

    务必使用GATK resource bundle上各种序列和注释来run GATK。

    GATK resource bundle介绍:

    http://gatkforums.broadinstitute.org/discussion/1213/whats-in-the-resource-bundle-and-how-can-i-get-it

    GATK resource bundle FTP地址:

    http://gatkforums.broadinstitute.org/discussion/1215/how-can-i-access-the-gsa-public-ftp-server

    例如呢,以hg19(human genome build 19.此外,b37也很常用)为参考序列,

    输入如下命令: lftp ftp.broadinstitute.org -u gsapubftp-anonymous

    回车后键入空格,便可进入resource bundle。进入其中名为bundle的目录,找到最新版的hg19目录。

    要下载的文件包括:

    ucsc.hg19.fasta

    ucsc.hg19.fasta.fai

    1000G_omni2.5.hg19.vcf

    1000G_omni2.5.hg19.vcf.idx

    1000G_phase1.indels.hg19.vcf

    1000G_phase1.indels.hg19.vcf.idx

    dbsnp_137.hg19.vcf

    dbsnp_137.hg19.vcf.idx

    hapmap_3.3.hg19.vcf

    hapmap_3.3.hg19.vcf.idx

    Mills_and_1000G_gold_standard.indels.hg19.vcf

    Mills_and_1000G_gold_standard.indels.hg19.vcf.idx

    (3)R 设置

    GATK某些步骤需要使用R画一些图,此时会调用一些R packages。如果你使用的R没有装这些packages,就无法生成pdf文件。因此得安装一些必要的R packages,包括(但不限于)如下:

    • bitops
    • caTools
    • colorspace
    • gdata
    • ggplot2
    • gplots
    • gtools
    • RColorBrewer
    • reshape
    • gsalib

    如何check这些packages有没有安装?以ggplot2为例,进入R,输入library(ggplot2),如果没有error信息,就表明安装了ggplot2;如果没有安装,输入命令install.packages("ggplot2"),按照提示操作便可安装ggplot2。如果package安装不成功,很可能是权限问题:默认使用的R是系统管理员安装的,而你在/usr等目录下没有write权限。这是可以选择直接在自己目录下安装了R,然后install各种packages。也可以在 install.packages时指定将packages安装到自己的目录。

    将R packages安装到自己目录的方法:

    install.packages("ggplot2", lib="/your/own/R-packages/")

    然后将指定的路径添加到R_LIBS变量即可。

    gsalib的安装有点不一样。

    从该网站下载gsalib:https://github.com/broadgsa/gatk

    解压缩后,进入build.xml文件所在的目录,输入命令ant gsalib。这样就可以安装gsalib 了。如果该gsalib仍然无法load,可能需要将gsllib的path告诉R。

    http://gatkforums.broadinstitute.org/discussion/1244/what-is-a-gatkreport

    这里有较详细的步骤。(奇怪的是我没有~/.Rprofle文件,ant完后也没有add path,仍然work了)

    此外,gaslib安装要求java的版本为1.6。

    设置路径:

    export R_HOME=/your/R/install/path
    export R_LIBS=/your/R-packages/path

    如何查看当前使用的R的路径?输入 which R即可。

    二.流程概览

    005waXhyzy6H0V0dkVC3c

    图片摘自http://www.broadinstitute.org/gatk/guide/topic?name=best-practices。话说这个Best practices,刚开始接触GATK时每个人都推荐我看,可是我就是看不懂...许多东西要亲自做过后,再回头看才能懂。不过既然那么元老级人物都说要看看,所以如果你是新手,还是看看吧。

    好消息是现在该页面多了个BroadE Workshop的东东,里面的一些材料非常有用。http://www.broadinstitute.org/gatk/guide/events?id=2038#materials

    在工作之前,最好把这些work materials通览一遍。

    seqanswers上的这个wiki也是很好的入门材料,可以让你看看每一步在做什么。

    http://seqanswers.com/wiki/How-to/exome_analysis

    但是呢,它比较过时了,它使用的GATK版本应该是2.0以前的,现在的流程有了一些调整,一些参数设置也有变化。所以该pipeline仅供参考,切勿照抄代码。例如它关于reference sequence的那一部分,是自己手动generate hg19的genome sequence。但是,由于后面要用到dbSNP和indel的各种annotation,而这些file的index很可能和自己generate的hg19的index不一样(resource bundle上hg19除了chr1-22,X,Y,chrC,ChrM外还会多出一些没有组装的序列),这样会导致程序出错,到时候还得重头来过。所以,务必使用GATK resource bundle上的东西!

    我后面贴出的代码也一样,切勿照抄。鬼知道过几个月GATK又会有什么变化。It keeps changing all the time...时刻跟上GATK网站上的update才是王道。(虽然它的网站做得...实在算不上好...好多时候我都找不到自己想要的东西)

    三.流程

    (1)Mapping。

    因我只处理过Miseq和Hiseq的pair-end数据,以此为例。Mapping多使用BWA,因为它能允许indels。另一常用mapping软件Bowtie则不能generate indels,多用于perfect match的场合如sRNA mapping。

    首先建立reference的index。

    bwa index -a bwtsw -p hg19 hg19.fasta

    这个过程耗时良久...above 5h吧。

    -p是给reference起的名字。完成后会生成如下几个文件:

    hg19.amb

    hg19.ann

    hg19.bwt

    hg19.dict

    hg19.pac

    hg19.sa

    然后是mapping喽,开始贴代码...先对pair-end数据的pair1和pair2分别进行mapping,生成相应的sai file,然后再结合两者生成sam file。接着利用picard对sam file进行sort,并且生成sort后的bam file。

    sample=$1
    data_dir=***
    work_dir=***
    ref_dir=~/database/hg19
    sai_dir=$work_dir/saifiles
    sam_dir=$work_dir/samfiles
    bam_dir=$work_dir/bamfiles
    log_dir=$work_dir/logs
    met_dir=$work_dir/metrics
    other_dir=$work_dir/otherfiles
    gatk=/software/sequencing/GenomeAnalysisTK-2.3-4-g57ea19f/GenomeAnalysisTK.jar
    picard=/software/sequencing/picard_latest
    bwa aln $ref_dir/hg19 -t 4 $data_dir/${sample}_1.fastq.gz > $sai_dir/${sample}_1.sai
    bwa aln $ref_dir/hg19 -t 4 $data_dir/${sample}_2.fastq.gz > $sai_dir/${sample}_2.sai
    bwa sampe -r "@RG	ID:$sample	LB:$sample	SM:$sample	PL:ILLUMINA" $ref_dir/hg19 $sai_dir/${sample}_1.sai $sai_dir/${sample}_2.sai $data_dir/${sample}_1.fastq.gz $data_dir/${sample}_2.fastq.gz | gzip > $sam_dir/$sample.sam.gz
    java -Xmx10g -Djava.io.tmpdir=/tmp 
    -jar $picard/SortSam.jar 
    INPUT=$sam_dir/$sample.sam.gz 
    OUTPUT=$bam_dir/$sample.bam 
    SORT_ORDER=coordinate
    VALIDATION_STRINGENCY=LENIENT 
    CREATE_INDEX=true

    Tips:

    a. bwa以及后面使用的一系列tools都能直接读取gzip压缩文件。为了节省硬盘空间,以及减少压缩/解压缩过程的read/write时间,最好直接用*.gz文件作为input。output也最好先在管道里使用gzip压缩一下再输出。

    b.bwa sample一步里的-r选项不可忽略。

    c.Picard的SortSam需指定一个tmp目录,用于存放中间文件,中间文件会很大,above 10G.注意指定目录的空间大小。

    d.对于压缩后大小约3G的fastq.gz文件,bwa aln需run 30-60min。将生成的两个sai file merge到一起生成sam file需60-90min。利用picard sort sam需30-60min。

    e.Samtools也提供了工具进行sam file的sort和bam file的生成,并且不会生成大的中间文件。

    (2) Duplicate marking。

    这一步的目的是mark那些PCR的duplicate。由于PCR有biases,有的序列在会被overpresented, 它们有着相同的序列和一致的map位置,对GATK的work会造成影响。这一步给这些序列设置一个flag以标志它们,方便GATK的识别。还可以设置 REMOVE_DUPLICATES=true 来丢弃duplicated序列,不过还未探索过只标记不丢弃和丢弃对于后续分析的影响。官方流程里只用标记就好。

    java -Xmx15g
    -Djava.io.tmpdir=/tmp 
    -jar $picard/MarkDuplicates.jar 
    INPUT= $bam_dir/$sample.bam 
    OUTPUT= $bam_dir/$sample.dedupped.bam 
    METRICS_FILE= $met_dir/$sample.dedupped.metrics 
    CREATE_INDEX=true 
    VALIDATION_STRINGENCY=LENIENT

    (3)Local relignment

    INDEL附近的alignment通常不准,需要利用已知的indel信息进行realignment,分两步,第一步进行 RealignerTargetCreator,输出一个包含着possible indels的文件。第一步IndelRealigner,利用此文件进行realign。

    第一步需要用到一些已知的indel信息。那么该使用哪些文件呢?

    参考如下的链接:http://gatkforums.broadinstitute.org/discussion/1247/what-should-i-use-as-known-variantssites-for-running-tool-x

    其中这张图表尤其有用,可以看到RealignerTargetCreator和IndelRealigner步骤中推荐使用两个indel注释file,如代码中-known所示。

    Tool dbSNP 129 dbSNP >132 Mills indels 1KG indels HapMap Omni
    RealignerTargetCreator     X X    
    IndelRealigner     X X    
    BaseRecalibrator   X X X    
    (UnifiedGenotyper/ HaplotypeCaller)   X        
    VariantRecalibrator   X X   X X
    VariantEval X          
    java -Xmx15g -jar $gatk 
    -T RealignerTargetCreator 
    -R $ref_dir/hg19.fasta 
    -o $other_dir/$sample.realigner.intervals 
    -I $bam_dir/$sample.dedupped.bam 
    -known $ref_dir/1000G_phase1.indels.hg19.vcf 
    -known $ref_dir/Mills_and_1000G_gold_standard.indels.hg19.vcf
    java -Xmx15g -jar $gatk 
    -T IndelRealigner 
    -R $ref_dir/hg19.fasta 
    -I $bam_dir/$sample.dedupped.bam 
    -targetIntervals $other_dir/$sample.realigner.intervals 
    -o $bam_dir/$sample.dedupped.realigned.bam 
    -known $ref_dir/1000G_phase1.indels.hg19.vcf 
    -known $ref_dir/Mills_and_1000G_gold_standard.indels.hg19.vcf

    在一些老版本的pipeline中(如seqanswers上的那个),这一步做完后还要对pair-end数据的mate information进行fix。不过新版本已经不需要这一步了,Indel realign可以自己fix mate。参考http://gatkforums.broadinstitute.org/discussion/1562/need-to-run-a-step-with-fixmateinformation-after-realignment-step

    (4) Base quality score recalibration

    这一步简称BQSR,对base的quality score进行校正。同样要用到一些SNP和indel的已知文件(参考上表和代码中-known选项)。第一步BaseRecalibrator生成一个用于校正的recal.grp文件和一个校正前的plot;第一步BaseRecalibrator生成一个校正后的plot;第三步PringReads利用recal.grp进行校正,输出校正后的bam file。

    java -Xmx15g -jar $gatk 
    -T BaseRecalibrator 
    -R $ref_dir/hg19.fasta 
    -I $bam_dir/$sample.dedupped.realigned.bam 
    -knownSites $ref_dir/dbsnp_137.hg19.vcf 
    -knownSites $ref_dir/Mills_and_1000G_gold_standard.indels.hg19.vcf 
    -knownSites $ref_dir/1000G_phase1.indels.hg19.vcf 
    -o $other_dir/$sample.recal.grp 
    -plots $other_dir/$sample.recal.grp.pdf
    java -Xmx15g -jar $gatk 
    -T BaseRecalibrator 
    -R $ref_dir/hg19.fasta 
    -I $bam_dir/$sample.dedupped.realigned.bam 
    -BQSR $other_dir/$sample.recal.grp 
    -o $other_dir/$sample.post_recal.grp 
    -plots $other_dir/$sample.post_recal.grp.pdf 
    -knownSites $ref_dir/dbsnp_137.hg19.vcf 
    -knownSites $ref_dir/Mills_and_1000G_gold_standard.indels.hg19.vcf 
    -knownSites $ref_dir/1000G_phase1.indels.hg19.vcf
    java -Xmx15g -jar $gatk 
    -T PrintReads 
    -R $ref_dir/hg19.fasta 
    -I $bam_dir/$sample.dedupped.realigned.bam 
    -BQSR $other_dir/$sample.recal.grp 
    -o $bam_dir/$sample.dedupped.realigned.recal.bam

    TIPS:

    这一步里如果产生不了PDF文件,很可能是Rscript的执行出了问题。检查R里面是否安装了该安装的packages。如果仍无法找出原因,可以在BaseRecalibrator那两步加上-l DEBUG再运行,此时的log信息中会详细写明为什么Rscript执行不了。

    (5)Reduce bam file

    这一步类似将bam file进行压缩,减小size,也方便后续的UnifiedGenotyper对数据的读取和处理。

    java -Xmx15g -jar $gatk 
    -T ReduceReads 
    -R $ref_dir/hg19.fasta 
    -I $bam_dir/$sample.dedupped.realigned.recal.bam 
    -o $bam_dir/$sample.dedupped.realigned.recal.reduced.bam

    到现在已经准备好了call variant所需要的bam file。下一篇会讲讲如何利用UnifiedGenotyper进行variant calling以及downstream的分析。

    (6)Variant calling

    GATK提供两种工具进行Variant calling,HaplotypeCaller 和UnifiedGenotyper。

    按照GATK开发者的说法,HaplotypeCaller使用local de novo assembler和HMM likelihood function,性能优于UnifiedGenotyper,但是HaplotypeCaller还处于实验阶段,运行时可能会出现问题。GATK的推荐是如果可以用HaplotypeCaller,还是用它。要注意的是目前HaplotypeCaller的input不能使reduced bam files,也不能支持多线程。

    这里使用的GATK工具是UnifiedGenotyper。

    相比起对每一个sample分别进行calling,更好的方式是multi-sample calling。这样的pooled calling可以更多地利用多个sample的信息,对于特定位点,信息更多更加可信。

    data_dir=*******
    work_dir=******
    ref_dir=~/database/hg19
    bam_dir=$work_dir/bamfiles
    log_dir=$work_dir/logs
    met_dir=$work_dir/metrics
    other_dir=$work_dir/otherfiles
    vcf_dir=$work_dir/vcf
    gatk=/software/sequencing/GenomeAnalysisTK-2.3-4-g57ea19f/GenomeAnalysisTK.jar
    java -Xmx15g -jar $gatk 
    -glm BOTH 
    -l INFO 
    -R $ref_dir/hg19.fasta 
    -T UnifiedGenotyper 
    -I $bam_dir/sample1.dedupped.realigned.recal.reduced.bam 
    -I ... 
    -I ... 
    -D $ref_dir/dbsnp_137.hg19.vcf 
    -o $vcf_dir/allsamples.vcf 
    -metrics $met_dir/snpcall.metrics 
    -L $work_dir/Truseq.bed

    -l BOTH 指对SNP和INDEL同时进行calling。

    -L参数指定capture region的bed file。

    这一步如果是deep coverage的exome sequencing data,时间还是蛮久的,>12h。由于每一条染色体的calling是相对独立的,如果赶时间,可以分别对每条染色体进行calling,然后用CombineVariants将不同的vcf文件合并。UnifiedGenotyper也可以使用多线程,multithreading,具体请参照如下两篇:

    http://gatkforums.broadinstitute.org/discussion/1988/a-primer-on-parallelism-with-the-gatk#latest

    http://www.broadinstitute.org/gatk/guide/article?id=1975

    (7)Variant quality score recalibration

    在做这一步前建议好好阅读这篇

    http://gatkforums.broadinstitute.org/discussion/39/variant-quality-score-recalibration-vqsr

    弄清楚VQSR的原理:基于已知的variants,建立模型去评估和filter新的variants。

    要注意的是,VQSR只适用于samples数目大于30的exome sequencing或者whole genome sequenicing。如果sample数目不够或者capture region过小,请参考http://www.broadinstitute.org/gatk/guide/topic?name=best-practices中对于此类问题的解决方法建议。一般是使用VariantFiltration进行hard filter。

    对于能够进行VQSR的,分两步进行:

    a. VariantRecalibrator 这一步建立模型,生成用于评估和recalibration的file。

    java -Xmx15g -jar $gatk 
    -T VariantRecalibrator 
    -R $ref_dir/hg19.fasta 
    -mode SNP 
    -input $vcf_dir/allsamples.vcf 
    -resource:hapmap,known=false,training=true,truth=true,prior=15.0 $ref_dir/hapmap_3.3.hg19.vcf 
    -resource:omni,known=false,training=true,truth=false,prior=12.0 $ref_dir/1000G_omni2.5.hg19.vcf 
    -resource:dbsnp,known=true,training=false,truth=false,prior=6.0 $ref_dir/dbsnp_137.hg19.vcf 
    -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an InbreedingCoeff 
    -recalFile $other_dir/allsamples_snp.recal 
    -tranchesFile $other_dir/allsamples_snp.tranches 
    -rscriptFile $other_dir/allsamples_snp.plots.R 
    --TStranche 90.0 
    --TStranche 93.0 
    --TStranche 95.0 
    --TStranche 97.0 
    --TStranche 99.0 
    --TStranche 99.9 
    --TStranche 100.0

    首先run上面的代码。-an flag应尽可能多地包括variant的那些annotator参数,以方便后来我们对这些参数进行评估,以选出最优组合。这一步产生的allsamples_snp.plots.R的图片,类似下面这种:

    gatk

    每一张图都描述了两个annotator参数的散点图。具体含义请参考前述链接。我们要做的就是选择那些可以很好地区分known/novel以及retained/filtered的variants的参数。上图所示的MQRankSum和HaplotypeScore就可以很清楚地将known variants聚类,从而filter那些和已知的cluster差别特别大的variants,因此这两个参数保留。

    选择annotator参数的过程其实有点主观,需要经验...anyway, good luck.

    当我们选择完所有对variants有很好区分度的参数后,再将上面的代码run一次,这次在-an那一行去掉没被选上的参数,保留我们选择的参数。

    如果犹豫不定哪些参数该加还是不加,可以看看.tranches和.tranches.pdf这两个文件,看看在哪种参数组合下run出来的结果比较好一点。如何判断哪种结果好呢?(1)Ti/Tv rate.(2)保留的variants数目.(3)True positive和false positive的比例。一般来说Ti/Tv在接近于3的范围内越大越好,该值越小,False positive就越多。

    下图是.tranches文件的内容示例。其中列出在在给定的target truth sensitivity下,有多少的novel和known variants,以及对应的Ti/Tv比例。target truth sensitivity越高,包括的variants数目越多,可能的假阳性就越多;想要降低假阳性,增加结果的specificity,可以降低target truth sensitivity。target truth sensitivity可以通过--TStranche进行设置,VariantRecalibrator默认只会给出--TStranche 99的选项,我们在程序中可以多设置几个threshold,进而可以在sensitivity和specificity之间进行选择。

    tab

    对INDEL也是同样的操作。只是INDEL的training resources和SNP不一样。另外根据GATK的建议,还set了一些别的参数。

    java -Xmx15g -jar $gatk 
    -T VariantRecalibrator 
    -R $ref_dir/hg19.fasta 
    -mode INDEL 
    --maxGaussians 4 -std 10.0 -percentBad 0.12 
    -input $vcf_dir/allsamples.vcf 
    -resource:mills,known=true,training=true,truth=true,prior=12.0 $ref_dir/Mills_and_1000G_gold_standard.indels.hg19
    -an MQ -an FS -an InbreedingCoeff 
    -recalFile $other_dir/allsamples_indel.recal 
    -tranchesFile $other_dir/allsamples_indel.tranches 
    -rscriptFile $other_dir/allsamples_indel.plots.R

    同样也是选择-an的flag,然后再run一次。

    b.ApplyRecalibration 先对SNP进行apply,再对INDEL进行apply。

    java -Xmx15g -jar $gatk 
    -T ApplyRecalibration 
    -R $ref_dir/hg19.fasta 
    -mode SNP 
    -input $vcf_dir/allsamples.vcf 
    -tranchesFile $other_dir/allsamples_snp.tranches 
    -recalFile $other_dir/allsamples_snp.recal 
    -o $vcf_dir/allsamples_t90.SNPrecal.vcf 
    --ts_filter_level 90
    java -Xmx15g -jar $gatk 
    -T ApplyRecalibration 
    -R $ref_dir/hg19.fasta 
    -mode INDEL 
    -input $vcf_dir/allsamples_t90.SNPrecal.vcf 
    -tranchesFile $other_dir/allsamples_indel.tranches 
    -recalFile $other_dir/allsamples_indel.recal 
    -o $vcf_dir/allsamples_t90.bothrecal.vcf

    --ts_filter_lever 90 是指将target truth sensitivity在90以下的在“filter”field标记为PASS,其余的根据我们设置的TStranche选项,分别标记不同的flag。例如VQSRTrancheSNP90.00to93.00,VQSRTrancheSNP93.00to95.00等等。

    (8)Filtering

    做完VQSR后,就可以根据vcf文件中FILTER一栏进行filter。filter没有一定的标准,主要还是看自己的期望,在specificity和sensitivity之间做选择。参考的主要还是.tranches文件里的数据。如果想包括尽可能多的novel variants而不太在乎false positive的问题,可以选择较高的target truth sensitivity,例如97。这样我们的filter策略就是:保留FILTER一栏的值为PASS,VQSRTrancheSNP90.00to93.00,VQSRTrancheSNP93.00to95.00,VQSRTrancheSNP95.00to97.00的variants,filter掉其他的variants。使用awk就可以啦。

    (9)variants评估

    使用的工具为VariantEval。它可以给出很多summary信息,例如每个sample的Ti/Tv,variants数目等等。

    java -Xmx70g -jar $gatk 
    -T VariantEval 
    -R $ref_dir/genome.fa 
    --eval:allsamples $vcf_dir/allsamples_t90-97.bothrecal.vcf 
    -o $other_dir/allsamples_t90.eval.gatkreport 
    -D $ref_dir/dbsnp_137.hg19.vcf 
    -ST Sample -ST Filter 
    -noEV 
    -EV TiTvVariantEvaluator
    java -Xmx70g -jar $gatk 
    -T VariantEval 
    -R $ref_dir/genome.fa 
    --eval:allsamples $vcf_dir/allsamples_t90-97.bothrecal.vcf 
    -o $other_dir/allsamples_t90.eval_1.gatkreport 
    -D $ref_dir/dbsnp_137.hg19.vcf 
    -ST Sample -ST Filter 
    -noEV 
    -EV CompOverlap 
    -EV CountVariants 
    -EV IndelSummary 
    -EV MultiallelicSummary 
    -EV ValidationReport

    这一步很耗memory,所以有时候得run多次,每次产生一部分结果。

    -ST 对每个sample分别进行统计。

    -noEV 由于EV包括的选项太多,全部run有点吃不消,因此disable它,然后用-EV flag挑选我们感兴趣的指标进行summary。

    本文来自:

    http://blog.sina.com.cn/s/blog_681aaa550101bhdc.html

    http://blog.sina.com.cn/s/blog_681aaa550101cdmk.html

    GATK UnifiedGenotyper用于Variant calling

    懒人先看:

    * 标示的为最常用的参数。

    $ java -Xmx20g -jar GenomeAnalysisTK.jar   #使用GATK主程序
    *    -R ref.fasta   #参考序列
    *    -T UnifiedGenotyper   #使用GATK的该程序
    *    -I sample1.bam [-I sample2.bam] [-I ...]  #输入的bam比对结果
         --dbsnp dbSNP.vcf   #使用该文件中的variants ID加入到结果文件中
         --genotyping_mode GENOTYPE_GIVEN_ALLELES --allels know.vcf   #只对已知的variant进行基因分型
    *    --genotype_likelihoods_model [SNP/INDEL/BOTH]   #进行SNP、INDEl或者两者同时的calling
         --min_base_quality_score 17   #variant calling时碱基质量的最低要求。但是INDEL calling时,该值无效。而是去除reads的低质量3'端直到Q20。用下面的方法来鉴定
         --min_indel_count_for_genotyping 5   #至少有这么多的reads在某一位点表现出了indel
         --min_indel_fraction_per_sample 0.25   #至少有这么大比例的reads在某一点表现出了indel
    *    --out gatk.cvf   #输出到文件,否则为标准输出
    *    --output_mode [EMIT_VARIANTS_ONLY/EMIT_ALL_CONFIDENT_SITES/EMIT_ALL_SITES]   #输出模式,默认只输出variant位点
         --sample_ploidy 2   #样品的倍型
    *    --standard_min_confidence_threshold_for_calling 30.0   #设定variant位点的置信阈值,低于该阈值则为low quality
    *    --standard_min_confidence_threshold_for_emitting 30.0   #在vcf结果文件中,低于该值的位点则不进行报告

    1. UnifiedGenotyper简介

    UnifiedGenotyper是GATK(Genome Analysis ToolKit)中一个主要工具,用于Variant calling。在GATK网站上这样描述它:A variant caller which unifies the approaches of several disparate callers — Works for single-sample and multi-sample data.

    UnifiedGenotyper能对单个或多个sample进行SNP和INDEL calling。使用Beyesian genotype likelihood model来对N个sample进行基因型的判断和等位基因频率的计算。

    2. 命令的简单使用和输入要求

    一个使用UnifiedGenotyper对多个samples进行SNP calling的例子如下:

    $ java -jar GenomeAnalysisTK.jar 
        -R resources/Homo_sapiens_assembly18.fasta 
        -T UnifiedGenotyper 
        -I sample1.bam [-I sample2.bam] 
        --dbsnp dbSNP.vcf 
        -o snps.raw.vcf 
        -stand_call_conf [50.0] 
        -stand_emit_conf 10.0 
        -dcov [50 for 4x, 200 for >30x WGS or Whole exome] 
        [-L targets.interval_list]

    该上述命令中输入的文件有2个:参考序列的fasta文件 和 bam格式的比对结果文件。但是只有这两个文件是不行的,其实额外需要fasta文件的dict文件和fai文件,以及bam文件的bai文件。需要使用picard和samtools软件,命令行如下:

    $ java -jar $picardHome/CreateSequenceDictionary.jar R=./resources/Homo_sapiens_assembly18.fasta O=./resources/Homo_sapiens_assembly18.dict
    $ samtools faidx ./resources/Homo_sapiens_assembly18.fasta
    $ samtools index sample1.bam

    3. 附加信息

    3.1 Read filters

    在运行UnifiedGenotyper的时候,会对reads自动进行过滤,使用的相关命令有:
    DuplicateReadFilter
    FailsVendorQualityCheckFilter
    NotPrimaryAlignmentFilter
    MalformedReadFilter
    BadMateFilter
    MappingQualityUnavailableFilter
    UnmappedReadFilter

    3.2 并行化运算

    该程序能进行多线程运算,使用如下参数即可:
    NanoSchedulable (-nct)
    CPU数。对每个data的运行能使用的CPU数。即对一个数据执行的一个命令进行计算所能用的CPU数。
    TreeReducible (-nt)
    线程数。将一个总的data分成多少份,然后分别对每个data使用单独的命令进行运算,最大值为24. 最后合并结果可能消耗额外的时间;消耗的内存也按倍数增加。

    3.3 Downsampling 设置

    This tool overrides the engine’s default downsampling settings.

    Mode: BY_SAMPLE
    To coverage: 250

    3.4 Window size

    This tool uses a sliding window on the reference.

    Window start: -200 bp before the locus
    Window stop: 200 bp after the locus

    4. UnifiedGenotyper的参数

    4.1 GATK参数

    使用了GATK共有的参数:CommandLineGATK.

    4.2 UnifiedGenotyper特有的参数

    • --alleles / -alleles RodBinding[VariantContext] default: none

    The set of alleles at which to genotype when –genotyping_mode is GENOTYPE_GIVEN_ALLELES. When the UnifiedGenotyper is put into GENOTYPE_GIVEN_ALLELES mode it will genotype the samples using only the alleles provide in this rod binding –alleles binds reference ordered data. This argument supports ROD files of the following types: BCF2, VCF, VCF3

    • --annotateNDA / -nda default: false

    If provided, we will annotate records with the number of alternate alleles that were discovered (but not necessarily genotyped) at a given site. Depending on the value of the –max_alternate_alleles argument, we may genotype only a fraction of the alleles being sent on for genotyping. Using this argument instructs the genotyper to annotate (in the INFO field) the number of alternate alleles that were originally discovered at the site.

    • --annotation / -A List[String] default: none

    One or more specific annotations to apply to variant calls. Which annotations to add to the output VCF file. See the VariantAnnotator -list argument to view available annotations.

    • --comp / -comp List[RodBinding[VariantContext]] default: none

    Comparison VCF file. If a call overlaps with a record from the provided comp track, the INFO field will be annotated as such in the output with the track name (e.g. -comp:FOO will have ‘FOO’ in the INFO field). Records that are filtered in the comp track will be ignored. Note that ‘dbSNP’ has been special-cased (see the –dbsnp argument). –comp binds reference ordered data. This argument supports ROD files of the following types: BCF2, VCF, VCF3

    • --computeSLOD / -slod boolean default: false

    If provided, we will calculate the SLOD (SB annotation). Note that calculating the SLOD increases the runtime by an appreciable amount.

    • -contamination / --contamination_fraction_to_filter double default: 0.05

    Fraction of contamination in sequencing data (for all samples) to aggressively remove. If this fraction is greater is than zero, the caller will aggressively attempt to remove contamination through biased down-sampling of reads. Basically, it will ignore the contamination fraction of reads for each alternate allele. So if the pileup contains N total bases, then we will try to remove (N * contamination fraction) bases for each alternate allele.

    • -contaminationFile / --contamination_fraction_per_sample_file File

    Tab-separated File containing fraction of contamination in sequencing data (per sample) to aggressively remove. Format should be “” (Contamination is double) per line; No header.. This argument specifies a file with two columns “sample” and “contamination” specifying the contamination level for those samples. Samples that do not appear in this file will be processed with CONTAMINATION_FRACTION

    • --dbsnp / -D RodBinding[VariantContext] default: none

    dbSNP file. rsIDs from this file are used to populate the ID column of the output. Also, the DB INFO flag will be set when appropriate. dbSNP is not used in any way for the calculations themselves. –dbsnp binds reference ordered data. This argument supports ROD files of the following types: BCF2, VCF, VCF3

    • --excludeAnnotation / -XA List[String] default: none

    One or more specific annotations to exclude. Which annotations to exclude from output in the VCF file. Note that this argument has higher priority than the -A or -G arguments, so annotations will be excluded even if they are explicitly included with the other options.

    • --genotype_likelihoods_model / -glm Model default: SNP

    Genotype likelihoods calculation model to employ — SNP is the default option, while INDEL is also available for calling indels and BOTH is available for calling both together.
    The –genotype_likelihoods_model argument is an enumerated type (Model), which can have one of the following values:
    SNP
    INDEL
    GENERALPLOIDYSNP
    GENERALPLOIDYINDEL
    BOTH

    • --genotyping_mode / -gt_mode GENOTYPING_MODE default: DISCOVERY

    Specifies how to determine the alternate alleles to use for genotyping.
    The –genotyping_mode argument is an enumerated type (GENOTYPING_MODE), which can have one of the following values:
    DISCOVERY
    the Unified Genotyper will choose the most likely alternate allele
    GENOTYPE_GIVEN_ALLELES
    only the alleles passed in from a VCF rod bound to the -alleles argument will be used for genotyping

    • --group / -G String[] default: [Standard]

    One or more classes/groups of annotations to apply to variant calls. If specified, all available annotations in the group will be applied. See the VariantAnnotator -list argument to view available groups. Keep in mind that RODRequiringAnnotations are not intended to be used as a group, because they require specific ROD inputs.

    • --heterozygosity / -hets Double default: 0.0010

    Heterozygosity value used to compute prior likelihoods for any locus. The expected heterozygosity value used to compute prior likelihoods for any locus. The default priors are: het = 1e-3, P(hom-ref genotype) = 1 – 3 * het / 2, P(het genotype) = het, P(hom-var genotype) = het / 2

    • --indel_heterozygosity / -indelHeterozygosity double default: 1.25E-4

    Heterozygosity for indel calling. This argument informs the prior probability of having an indel at a site.

    • --indelGapContinuationPenalty / -indelGCP byte default: 10

    Indel gap continuation penalty, as Phred-scaled probability. I.e., 30 => 10^-30/10.

    • --indelGapOpenPenalty / -indelGOP byte default: 45

    Indel gap open penalty, as Phred-scaled probability. I.e., 30 => 10^-30/10.

    • --input_prior / -inputPrior List[Double] default: none

    Input prior for calls. By default, the prior specified with the argument –heterozygosity/-hets is used for variant discovery at a particular locus, using an infinite sites model, see e.g. Waterson (1975) or Tajima (1996). This model asserts that the probability of having a population of k variant sites in N chromosomes is proportional to theta/k, for 1=1:N There are instances where using this prior might not be desireable, e.g. for population studies where prior might not be appropriate, as for example when the ancestral status of the reference allele is not known. By using this argument, user can manually specify priors to be used for calling as a vector for doubles, with the following restriciotns: a) User must specify 2N values, where N is the number of samples. b) Only diploid calls supported. c) Probability values are specified in double format, in linear space. d) No negative values allowed. e) Values will be added and Pr(AC=0) will be 1-sum, so that they sum up to one. f) If user-defined values add to more than one, an error will be produced. If user wants completely flat priors, then user should specify the same value (=1/(2*N+1)) 2*N times,e.g. -inputPrior 0.33 -inputPrior 0.33 for the single-sample diploid case.

    • --max_alternate_alleles / -maxAltAlleles int default: 6

    Maximum number of alternate alleles to genotype. If there are more than this number of alternate alleles presented to the genotyper (either through discovery or GENOTYPE_GIVEN ALLELES), then only this many alleles will be used. Note that genotyping sites with many alternate alleles is both CPU and memory intensive and it scales exponentially based on the number of alternate alleles. Unless there is a good reason to change the default value, we highly recommend that you not play around with this parameter. As of GATK 2.2 the genotyper can handle a very large number of events, so the default maximum has been increased to 6.

    • --max_deletion_fraction / -deletions Double default: 0.05

    Maximum fraction of reads with deletions spanning this locus for it to be callable [to disable, set to < 0 or > 1; default:0.05].

    • --min_base_quality_score / -mbq int default: 17

    Minimum base quality required to consider a base for calling. The minimum confidence needed in a given base for it to be used in variant calling. Note that the base quality of a base is capped by the mapping quality so that bases on reads with low mapping quality may get filtered out depending on this value. Note too that this argument is ignored in indel calling. In indel calling, low-quality ends of reads are clipped off (with fixed threshold of Q20).

    • -minIndelCnt / --min_indel_count_for_genotyping int default: 5

    Minimum number of consensus indels required to trigger genotyping run. A candidate indel is genotyped (and potentially called) if there are this number of reads with a consensus indel at a site. Decreasing this value will increase sensitivity but at the cost of larger calling time and a larger number of false positives.

    • -minIndelFrac / --min_indel_fraction_per_sample double default: 0.25

    Minimum fraction of all reads at a locus that must contain an indel (of any allele) for that sample to contribute to the indel count for alleles. Complementary argument to minIndelCnt. Only samples with at least this fraction of indel-containing reads will contribute to counting and overcoming the threshold minIndelCnt. This parameter ensures that in deep data you don’t end up summing lots of super rare errors up to overcome the 5 read default threshold. Should work equally well for low-coverage and high-coverage samples, as low coverage samples with any indel containing reads should easily over come this threshold.

    • --out / -o VariantContextWriter default: stdout

    File to which variants should be written. A raw, unfiltered, highly sensitive callset in VCF format.

    • --output_mode / -out_mode OUTPUT_MODE default: EMIT_VARIANTS_ONLY

    Specifies which type of calls we should output.
    The –output_mode argument is an enumerated type (OUTPUT_MODE), which can have one of the following values:
    EMIT_VARIANTS_ONLY
    produces calls only at variant sites
    EMIT_ALL_CONFIDENT_SITES
    produces calls at variant sites and confident reference sites
    EMIT_ALL_SITES
    produces calls at any callable site regardless of confidence; this argument is intended only for point mutations (SNPs) in DISCOVERY mode or generally when running in GENOTYPE_GIVEN_ALLELES mode; it will by no means produce a comprehensive set of indels in DISCOVERY mode

    • --pair_hmm_implementation / -pairHMM HMM_IMPLEMENTATION default: ORIGINAL

    The PairHMM implementation to use for -glm INDEL genotype likelihood calculations. The PairHMM implementation to use for -glm INDEL genotype likelihood calculations. The various implementations balance a tradeoff of accuracy and runtime.
    The –pair_hmm_implementation argument is an enumerated type (HMM_IMPLEMENTATION), which can have one of the following values:
    EXACT
    ORIGINAL
    LOGLESS_CACHING

    • --pcr_error_rate / -pcr_error Double default: 1.0E-4

    The PCR error rate to be used for computing fragment-based likelihoods. The PCR error rate is independent of the sequencing error rate, which is necessary because we cannot necessarily distinguish between PCR errors vs. sequencing errors. The practical implication for this value is that it effectively acts as a cap on the base qualities.

    • --sample_ploidy / -ploidy int default: 2

    Plody (number of chromosomes) per sample. For pooled data, set to (Number of samples in each pool * Sample Ploidy)..

    • -stand_call_conf / --standard_min_confidence_threshold_for_calling double default: 30.0

    The minimum phred-scaled confidence threshold at which variants should be called. The minimum phred-scaled Qscore threshold to separate high confidence from low confidence calls. Only genotypes with confidence >= this threshold are emitted as called sites. A reasonable threshold is 30 for high-pass calling (this is the default).

    • -stand_emit_conf / --standard_min_confidence_threshold_for_emitting doubledefault: 30.0

    The minimum phred-scaled confidence threshold at which variants should be emitted (and filtered with LowQual if less than the calling threshold). This argument allows you to emit low quality calls as filtered records.

    GATK call snp 流程图

    一、原始数据处理

    测序得到的RAW DATA,使用NGSQCToolkit进行过滤
    $PATH/IlluQC.pl -pe read1.fastq read2.fastq 2 A -l 70 -s 20 -p 2 -z g

    • -pe paire-end数据
    • 2 = Paired End DNA Library
    • A = Automatic detection of FASTQ variant
    • -l The cut-off value for percentage of read length that should be of given quality
    • -s参数 -cutOffQualScore read质量值cutoff
    • -z g=输出结果为压缩文件

    二、preGATK处理

    GATK call snp 需要排序后的bam文件,bam文件由sam文件转化而来。sam文件bwa、bowtie等软件生成。

    1.bwa 生成sam文件

     #建立参考基因组索引
     $path/bwa  index  ref.fa    
     #*find input reads SA coordnate   *
     bwa aln chlorandrum.fa read1.filtered.fq>read1.fq.sai
     bwa aln chlorandrum.fa read2.filtered.fq >read2.fq.sai  
     #sai 格式 to sam 格式
     bwa sampe chlorandrum.fa -r "@RG	ID:<ID>	LB:<LIBRARY_NAME>	SM:<SAMPLE_NAME>	PL:ILLUMINA" read1.fq.sai read2.fq.sai read1.filtered.fq read2.filtered.fq >test_read.sam

    如果GATK call SNP 必须用-r 参数,


     


    @RG ID:<ID> LB:<LIBRARY_NAME> SM:<SAMPLE_NAME> PL:ILLUMINA 如果多样品 call SNP ,<SAMPLE_NAME> 更改成相应样品名称,否则将会被当成一个样品处理

    2.对sam文件进行重排序

     samtools faidx chlorandrum.fa
     java -jar $picard CreateSequenceDictionary R=ref.fa O=chlorandrum.dict
     java -jar $picard ReorderSam I=test_reads.sam O=test_reads.reordered.sam R=ref.fa

    3.sam文件转化成bam文件

    samtools view -bs  test_reads.reordered.sam -o test.bam

    4.对bam文件进行排序

    java -jar $path/picard.jar  SortSam INPUT=3.test_reads.reordered.sam2.bam OUTPUT=4.test_reads.reordered.sam2.sorted.bam SORT_ORDER=coordinate

    5.Duplicates Marking

    测序原理是随机打断,那么理论上出现两条完全相同的read的概率是非常低的,而且建库时PCR扩增存在偏向性,因此标出完全相同的read。

    java -jar $picard MarkDuplicates REMOVE_DUPLICATES= false MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 INPUT=test.bam OUTPUT=test.repeatmark.bam METRICS_FILE=test.bam.metrics

    6.bam文件生成索引

    samtools index  test.repeatmark.bam

    三、GATK变异检测

    1.生成raw vcf 文件
    java -Xmx2G -jar $GATK 
        -T HaplotypeCaller 
        -R ref.fa 
        -I test.repeatmark.bam  #若多样品,则-I sample1.bam -I sample2.bam
        --genotyping_mode DISCOVERY 
        -stand_emit_conf 10 
        -stand_call_conf 30 
        -o raw_variants.vcf

     

     
    2.select SNP
    java -Xmx2g -jar $GATK -R $REF -T SelectVariants -o $Slect_SNP --variant $RAW_vcf -selectType SNP  2>select_snp.err
    3.select indel
    java -Xmx2g -jar $GATK -R $REF -T SelectVariants -o $Slest_INdel --variant $RAW_vcf -selectType INDEL 2>select_indel.err
    4.filter SNP
    java -Xmx4g -jar $GATK -R $REF -T VariantFiltration --variant $Slect_SNP --clusterSize 4 --clusterWindowSize 10 --maskName aroundIndel --mask $Slest_INdel -maskExtend 3 --filterName "lowMQRankSum" --filterExpression "MQRankSum < -12.5" --filterName "highFS" --filterExpression "FS > 60.0" --filterName "lowReadPosRankSum" --filterExpression "ReadPosRankSum < -8.0" --filterName "lowMQ" --filterExpression "MQ < 40.0" --filterName "lowQD" --filterExpression "QD < 2.0" --out $Filter_SNP --genotypeFilterName "lowDP" --genotypeFilterExpression "DP < 8.0" >filter_snp.err
    • 参数详细解释
    • --clusterSize --clusterwindowsize 这两个参数一起用可以设定指定长度内的variants数量的最大值,例如-window设成10,-cluster设成3,表示10bp以内的variants的数量不应当超过3个,如果超过了则会用SnpCluster标示出来。一般这种聚集的SNP可以被认为是不可靠的(10bp里面超过3个(Bowen et al., 2011))。
    • --mask参数用来筛选与某个指定文件中记录一致的位点,例如筛选repeat区域的位点的时候就可以使用这个参数,这边给的是一个文件,支持的格式很多,所以适用的范围其实很广,这边暂时先简单的介绍一下;--maskName同样是来设置被FILTER掉的位点的标识的。
    • --filter --filterName 这两个参数也是一起使用的,可以一个第一个参数设置筛选用的条件表达式,第二个参数设置这个筛选在结果里面标识的名称,两个参数设置的时候必须是成对的,然后可以设置多次
    • --missingValuesInExpressionsShouldEvaluateAsFailing (Boolean with default value false)
      最后在讲一下这个参数,当筛选标准比较多的时候,可能有一些位点没有筛选条件当中的一条或几条,例如下面的这个表达式
      QUAL < 30.0 || QD < 2.0 || FS > 60.0 || MQ < 40.0 || HaplotypeScore > 13.0
      并不一定所有位点都有这些信息,这种情况下GATK运行的时候会报很多WARNING信息,用这个参数可以把这些缺少某些FLAG的位点也给标记成没有通过筛选的。
    • FS:使用Fisher’s精确检验来检测strand bias而得到的Fhred格式的p值。该值越小越好。一般进行filter的时候,可以设置 FS < 10~20。
    • MQ:RMS Mapping Quality
    • QD:Variant Confidence/Quality by Depth
    • ReadPosRankSum:Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias
    • MQRankSum:Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities
  • 相关阅读:
    C# 委托应用总结
    C语言指针总结
    SQL2005四个排名函数(row_number、rank、dense_rank和ntile)的比较
    C#接口
    C# Linq
    C#反射
    重写与重载
    mysql01
    ajax
    bootstrap02导航菜单
  • 原文地址:https://www.cnblogs.com/nkwy2012/p/6322775.html
Copyright © 2011-2022 走看看