一个全基因组重测序分析实战

zoukankan html css js c++ java

一个全基因组重测序分析实战
Original 2017-06-08 曾健明生信技能树
这里选取的是 GATK best practice 是目前认可度最高的全基因组重测序分析流程，尤其适用于 人类研究。

PS：其实本文应该属于直播我的基因组系列，有两个原因把它单独拿出来，

首先，直播我的基因组阅读量太低了，可能是大家觉得错过了前面的，后面的看起来没有必要，这里我可以肯定的告诉大家，这一讲是独立的，而且是全流程，你学好了这个，整个直播我的基因组就可以不用看了。

其次，最近有一些朋友写了一些GATK的教程，但是大多不合我意，作为回应，我也写一个，秀出我的教程风格。

流程介绍

bwa(MEM alignment)

picard(SortSam)

picard(MarkDuplicates)

picard(FixMateInfo)

GATK(RealignerTargetCreator)

GATK(IndelRealigner)

GATK(BaseRecalibrator)

GATK(PrintReads)

GATK(HaplotypeCaller)

GATK(GenotypeGVCFs)

在本文，我将会把我的 全基因组重测序数据走完上面所有的流程，并给出代码和时间消耗情况。

准备工作

首先是软件安装

## Download and install BWA

cd ~/biosoft

mkdir bwa && cd bwa

#http://sourceforge.net/projects/bio-bwa/files/

wget https://sourceforge.net/projects/bio-bwa/files/bwa-0.7.15.tar.bz2

tar xvfj bwa-0.7.15.tar.bz2 # x extracts, v is verbose (details of what it is doing), f skips prompting for each individual file, and j tells it to unzip .bz2 files

cd bwa-0.7.15

make

## Download and install samtools

## http://samtools.sourceforge.net/

## http://www.htslib.org/doc/samtools.html

cd ~/biosoft

mkdir samtools && cd samtools

wget https://github.com/samtools/samtools/releases/download/1.3.1/samtools-1.3.1.tar.bz2

tar xvfj samtools-1.3.1.tar.bz2

cd samtools-1.3.1

./configure --prefix=/home/jianmingzeng/biosoft/myBin

make

make install

~/biosoft/myBin/bin/samtools --help

~/biosoft/myBin/bin/plot-bamstats --help

cd htslib-1.3.1

./configure --prefix=/home/jianmingzeng/biosoft/myBin

make

make install

~/biosoft/myBin/bin/tabix

## Download and install picardtools

## https://sourceforge.net/projects/picard/

## https://github.com/broadinstitute/picard

cd ~/biosoft

mkdir picardtools && cd picardtools

wget http://ncu.dl.sourceforge.net/project/picard/picard-tools/1.119/picard-tools-1.119.zip

unzip picard-tools-1.119.zip

mkdir 2.9.2 && cd 2.9.2

wget https://github.com/broadinstitute/picard/releases/download/2.9.2/picard.jar

## GATK 需要自行申请下载，不能公开

其次是必备数据的下载：

cd ~/reference

mkdir -p genome/human_g1k_v37 && cd genome/human_g1k_v37

# http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/

nohup wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz &

gunzip human_g1k_v37.fasta.gz

wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.fai

wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/README.human_g1k_v37.fasta.txt

java -jar ~/biosoft/picardtools/picard-tools-1.119/CreateSequenceDictionary.jar R=human_g1k_v37.fasta O=human_g1k_v37.dict

cd ~/reference

mkdir -p index/bwa && cd index/bwa ~/reference/index/bwa/human_g1k_v37 ~/reference/genome/human_g1k_v37/human_g1k_v37.fasta 1>human_g1k_v37.bwa_index.log 2>&1 &

mkdir -p ~/biosoft/GATK/resources/bundle/b37

cd ~/biosoft/GATK/resources/bundle/b37

wget ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/b37/1000G_phase1.indels.b37.vcf.gz

wget ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/b37/1000G_phase1.indels.b37.vcf.idx.gz

wget ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/b37/Mills_and_1000G_gold_standard.indels.b37.vcf.gz

wget ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/b37/Mills_and_1000G_gold_standard.indels.b37.vcf.idx.gz

gunzip 1000G_phase1.indels.b37.vcf.idx.gz

gunzip 1000G_phase1.indels.b37.vcf.gz

gunzip Mills_and_1000G_gold_standard.indels.b37.vcf.gz

gunzip Mills_and_1000G_gold_standard.indels.b37.vcf.idx.gz

mkdir -p ~/annotation/variation/human/dbSNP

cd ~/annotation/variation/human/dbSNP

## https://www.ncbi.nlm.nih.gov/projects/SNP/

## ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b147_GRCh38p2/

## ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b147_GRCh37p13/

nohup wget ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b147_GRCh37p13/VCF/All_20160601.vcf.gz &

wget ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b147_GRCh37p13/VCF/All_20160601.vcf.gz.tbi

只有当软件安装完毕，还有参考基因组等必备文件准备齐全了，才能正式进入全基因组重测序分析流程！

全基因组重测序数据介绍

上面是我的全基因组数据fastq文件的截图，测序分成了5条lane，每条lane的数据量不一致。

数据分析

fastq2bam

首先给出代码：

module load java/1.8.0_91

GENOME=/home/jianmingzeng/reference/genome/human_g1k_v37/human_g1k_v37.fasta

INDEX=/home/jianmingzeng/reference/index/bwa/human_g1k_v37

GATK=/home/jianmingzeng/biosoft/GATK/GenomeAnalysisTK.jar

PICARD=/home/jianmingzeng/biosoft/picardtools/2.9.2/picard.jar

DBSNP=/home/jianmingzeng/annotation/variation/human/dbSNP/All_20160601.vcf.gz

SNP=/home/jianmingzeng/biosoft/GATK/resources/bundle/b37/1000G_phase1.snps.high_confidence.b37.vcf.gz

INDEL=/home/jianmingzeng/biosoft/GATK/resources/bundle/b37/Mills_and_1000G_gold_standard.indels.b37.vcf.gz

TMPDIR=/home/jianmingzeng/tmp/software

## samtools and bwa are in the environment

## samtools Version: 1.3.1 (using htslib 1.3.1)

## bwa Version: 0.7.15-r1140

: '

'

## please keep the confige in three columns format, which are fq1 fq2 sampe

cat $1 |while read id

do

arr=($id)

fq1=${arr[0]}

fq2=${arr[1]}

sample=${arr[2]}

#####################################################

################ Step 1 : Alignment #################

#####################################################

echo bwa `date`

bwa mem -t 5 -R "@RG ID:$sample SM:$sample LB:WGS PL:Illumina" $INDEX $fq1 $fq2 > $sample.sam

echo bwa `date`

#####################################################

################ Step 2: Sort and Index #############

#####################################################

echo SortSam `date`

java -Djava.io.tmpdir=$TMPDIR -Xmx40g -jar $PICARD SortSam SORT_ORDER=coordinate INPUT=$sample.sam OUTPUT=$sample.bam

samtools index $sample.bam

echo SortSam `date`

#####################################################

################ Step 3: Basic Statistics ###########

#####################################################

echo stats `date`

samtools flagstat $sample.bam > ${sample}.alignment.flagstat

samtools stats $sample.bam > ${sample}.alignment.stat

echo plot-bamstats -p ${sample}_QC ${sample}.alignment.stat

echo stats `date`

#####################################################

####### Step 4: multiple filtering for bam files ####

#####################################################

###MarkDuplicates###

echo MarkDuplicates `date`

java -Djava.io.tmpdir=$TMPDIR -Xmx40g -jar $PICARD MarkDuplicates

INPUT=$sample.bam OUTPUT=${sample}_marked.bam METRICS_FILE=$sample.metrics

echo MarkDuplicates `date`

###FixMateInfo###

echo FixMateInfo `date`

java -Djava.io.tmpdir=$TMPDIR -Xmx40g -jar $PICARD FixMateInformation

INPUT=${sample}_marked.bam OUTPUT=${sample}_marked_fixed.bam SO=coordinate

samtools index ${sample}_marked_fixed.bam

echo FixMateInfo `date`

echo ${sample}_marked_fixed.bam >>files.bamlist

rm $sample.sam $sample.bam ${sample}_marked.bam

done

samtools merge -@ 5 -b files.bamlist merged.bam

samtools index merged.bam

上面的代码有一点长，希望大家能用心的来理解，其实就是一个批量处理，对5条lane的测序数据循环处理，其实正式流程里面我一般是并行的，而不是循环，这里是为了给大家秀一下时间消耗情况，让大家对全基因组重测序分析有一个感性的认知。

时间消耗如下：

对L1来说，时间消耗如下：

[main] Real time: 15870.794 sec; CPU: 77463.156 sec

picard.sam.SortSam done. Elapsed time: 45.60 minutes.

picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 64.20 minutes.

picard.sam.FixMateInformation done. Elapsed time: 58.05 minutes.

总共耗时约7.2小时，仅仅是对10G的fastq完成比对压缩排序去PCR重复。

如果是其它文件大小的fastq输入数据，那么这个流程耗时如下：

[main] Real time: 9527.240 sec; CPU: 47758.233 sec

[main] Real time: 16000.325 sec; CPU: 80595.629 sec

[main] Real time: 29286.523 sec; CPU: 147524.841 sec

[main] Real time: 28104.568 sec; CPU: 141519.377 sec

picard.sam.SortSam done. Elapsed time: 29.02 minutes.

picard.sam.SortSam done. Elapsed time: 61.26 minutes.

picard.sam.SortSam done. Elapsed time: 98.39 minutes.

picard.sam.SortSam done. Elapsed time: 117.16 minutes.

picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 35.52 minutes.

picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 54.41 minutes.

picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 90.40 minutes.

picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 93.03 minutes.

picard.sam.FixMateInformation done. Elapsed time: 35.92 minutes.

picard.sam.FixMateInformation done. Elapsed time: 66.31 minutes.

picard.sam.FixMateInformation done. Elapsed time: 131.65 minutes.

picard.sam.FixMateInformation done. Elapsed time: 122.31 minutes.

前面我们说过，这5条lane的数据其实是可以并行完成这几个步骤的，最长耗时约12小时。每个数据处理我都分配了 5个线程， 40G的内存。

GATK重新处理bam文件

主要是针对上一个步骤合并了5个lane之后的 merge.bam文件

-rw-rw-r-- 1 jianmingzeng jianmingzeng 57G Jun 7 11:32 merged.bam

-rw-rw-r-- 1 jianmingzeng jianmingzeng 8.4M Jun 7 12:05 merged.bam.bai

代码是：

module load java/1.8.0_91

GENOME=/home/jianmingzeng/reference/genome/human_g1k_v37/human_g1k_v37.fasta

INDEX=/home/jianmingzeng/reference/index/bwa/human_g1k_v37

GATK=/home/jianmingzeng/biosoft/GATK/GenomeAnalysisTK.jar

PICARD=/home/jianmingzeng/biosoft/picardtools/2.9.2/picard.jar

DBSNP=/home/jianmingzeng/annotation/variation/human/dbSNP/All_20160601.vcf.gz

KG_SNP=/home/jianmingzeng/biosoft/GATK/resources/bundle/b37/1000G_phase1.snps.high_confidence.b37.vcf.gz

Mills_indels=/home/jianmingzeng/biosoft/GATK/resources/bundle/b37/Mills_and_1000G_gold_standard.indels.b37.vcf

KG_indels=/home/jianmingzeng/biosoft/GATK/resources/bundle/b37/1000G_phase1.indels.b37.vcf

TMPDIR=/home/jianmingzeng/tmp/software

## samtools and bwa are in the environment

## samtools Version: 1.3.1 (using htslib 1.3.1)

## bwa Version: 0.7.15-r1140

sample='merge'

###RealignerTargetCreator###

echo RealignerTargetCreator `date`

java -Djava.io.tmpdir=$TMPDIR -Xmx40g -jar $GATK -T RealignerTargetCreator

-I ${sample}.bam -R $GENOME -o ${sample}_target.intervals

-known $Mills_indels -known $KG_indels -nt 5

echo RealignerTargetCreator `date`

###IndelRealigner###

echo IndelRealigner `date`

java -Djava.io.tmpdir=$TMPDIR -Xmx40g -jar $GATK -T IndelRealigner

-I ${sample}.bam -R $GENOME -targetIntervals ${sample}_target.intervals

-o ${sample}_realigned.bam -known $Mills_indels -known $KG_indels

echo IndelRealigner `date`

###BaseRecalibrator###

echo BaseRecalibrator `date`

java -Djava.io.tmpdir=$TMPDIR -Xmx40g -jar $GATK -T BaseRecalibrator

-I ${sample}_realigned.bam -R $GENOME -o ${sample}_temp.table -knownSites $DBSNP

echo BaseRecalibrator `date`

###PrintReads###

echo PrintReads `date`

java -Djava.io.tmpdir=$TMPDIR -Xmx40g -jar $GATK -T PrintReads

-R $GENOME -I ${sample}_realigned.bam -o ${sample}_recal.bam -BQSR ${sample}_temp.table

samtools index ${sample}_recal.bam

echo PrintReads `date`

###delete_intermediate_files###

对L1样本来说，时间消耗如下：

INFO 15:50:24,097 ProgressMeter - Total runtime 1165.34 secs, 19.42 min, 0.32 hours

INFO 17:21:00,917 ProgressMeter - Total runtime 4265.44 secs, 71.09 min, 1.18 hours

INFO 19:58:23,969 ProgressMeter - Total runtime 9436.69 secs, 157.28 min, 2.62 hours

INFO 23:41:00,540 ProgressMeter - Total runtime 13349.77 secs, 222.50 min, 3.71 hours

可以看到最耗费时间的步骤是最后一个 PrintReads

如果是对5条lane合并的merged.bam来说，消耗时间如下：

INFO 17:54:59,396 ProgressMeter - Total runtime 5194.10 secs, 86.57 min, 1.44 hours

INFO 02:04:10,907 ProgressMeter - Total runtime 22558.84 secs, 375.98 min, 6.27 hours

···························

···························

可以看到时间消耗与输入的bam文件大小有关,merged.bam是L1.bam的6倍大小，当然，时间上并没有成正比。总之，对这个全基因组数据来说，时间消耗太夸张了，以至于我写完这篇文章这GATK的4个bam操作还没跑完。对L1需要约8小时，那么对merge.bam应该是需要40个小时。

variant calling by gatk hc

代码是：

module load java/1.8.0_91

GENOME=/home/jianmingzeng/reference/genome/human_g1k_v37/human_g1k_v37.fasta

INDEX=/home/jianmingzeng/reference/index/bwa/human_g1k_v37

GATK=/home/jianmingzeng/biosoft/GATK/GenomeAnalysisTK.jar

PICARD=/home/jianmingzeng/biosoft/picardtools/2.9.2/picard.jar

DBSNP=/home/jianmingzeng/annotation/variation/human/dbSNP/All_20160601.vcf.gz

KG_SNP=/home/jianmingzeng/biosoft/GATK/resources/bundle/b37/1000G_phase1.snps.high_confidence.b37.vcf.gz

Mills_indels=/home/jianmingzeng/biosoft/GATK/resources/bundle/b37/Mills_and_1000G_gold_standard.indels.b37.vcf

KG_indels=/home/jianmingzeng/biosoft/GATK/resources/bundle/b37/1000G_phase1.indels.b37.vcf

TMPDIR=/home/jianmingzeng/tmp/software

## samtools and bwa are in the environment

## samtools Version: 1.3.1 (using htslib 1.3.1)

## bwa Version: 0.7.15-r1140

fq1=P_jmzeng_DHG09057_AH33KVALXX_L1_1.clean.fq.gz

fq2=P_jmzeng_DHG09057_AH33KVALXX_L1_2.clean.fq.gz

sample='merge'

java -Djava.io.tmpdir=$TMPDIR -Xmx40g -jar $GATK -T HaplotypeCaller

-R $GENOME -I ${sample}_recal.bam --dbsnp $DBSNP

-stand_emit_conf 10 -o ${sample}_recal_raw.snps.indels.vcf

java -Djava.io.tmpdir=$TMPDIR -Xmx40g -jar $GATK -T HaplotypeCaller

-R $GENOME -I ${sample}_realigned.bam --dbsnp $DBSNP

-stand_emit_conf 10 -o ${sample}_realigned_raw.snps.indels.vcf

时间消耗如下：

INFO 20:40:49,063 ProgressMeter - Total runtime 39243.88 secs, 654.06 min, 10.90 hours

INFO 08:53:17,633 ProgressMeter - Total runtime 43939.69 secs, 732.33 min, 12.21 hours

可以看到对 recal.bam的处理比 recal.bam时间上要少2个小时，但是时间均消耗很长。

全部流程走完输出的文件如下(仅显示L1的流程数据):

流程探究

如果只给代码，那么这个教程意义不大，如果给出了input和output，还给出了时间消耗情况，那么这个教程可以说是中上水平了，读者只需要拿到数据就可以自己重复出来，既能估算硬件配置又能对大致的时间消耗有所了解。

但，这仍然不够，对我来说，我还可以介绍为什么要走每一个流程，以及每一个流程到底做了什么。可以这么说，你看完下面的流程探究，基本上就相当于你自己做过了一个全基因组重测序分析实战

我这里就对 L1样本进行解密：

首先的BWA

这个没什么好说的，基因组数据的比对首选，耗时取决于fastq文件的reads数目。

CMD: bwa mem -t 5 -R @RG ID:L1 SM:L1 LB:WGS PL:Illumina /home/jianmingzeng/reference/index/bwa/human_g1k_v37 P_jmzeng_DHG09057_AH33KVALXX_L1_1.clean.fq.gz P_jmzeng_DHG09057_AH33KVALXX_L1_2.clean.fq.gz

[main] Real time: 15870.794 sec; CPU: 77463.156 sec

接下来是PICARD

共3个步骤用到了这个软件，消耗时间及内存分别如下：

picard.sam.SortSam done. Elapsed time: 44.15 minutes. Runtime.totalMemory()=13184794624

picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 53.71 minutes. Runtime.totalMemory()=39832256512

picard.sam.FixMateInformation done. Elapsed time: 53.79 minutes. Runtime.totalMemory()=9425649664

比对得到的都是sam格式数据，文件占硬盘空间太大，一般需要压缩成二进制的bam格式文件，用的是 SortSam 至于 FixMateInformation步骤仅仅是对bam文件增加了MC和MQ这两个tags

add MC (CIGAR string for mate) and MQ (mapping quality of the mate/next segment) tags

而 markduplicates 步骤就比较复杂了，因为没有选择 REMOVE_DUPLICATES=True 所以并不会去除reads，只是标记一下而已，就是把sam文件的第二列改一下。

Read 119776742 records.

INFO 2017-06-05 10:57:22 MarkDuplicates Marking 14482525 records as duplicates.

INFO 2017-06-05 10:57:22 MarkDuplicates Found 943146 optical duplicate clusters.

下面列出了部分被改变的flag值，可以去下面的网页去查看每个flag的含义。

# https://broadinstitute.github.io/picard/explain-flags.html

# diff -y -W 50 |grep '|'

163 | 1187

83 | 1107

99 | 1123

163 | 1187

147 | 1171

83 | 1107

99 | 1123

99 | 1123

147 | 1171

147 | 1171

99 | 1123

147 | 1171

163 | 1187

83 | 1107

最后是GATK

SplitNCigarReads 这个步骤对基因组数据来说可以略去，主要是针对于转录组数据的

命令是：

Program Args: -T SplitNCigarReads -R /home/jianmingzeng/reference/genome/human_g1k_v37/human_g1k_v37.fasta

-I L1_marked_fixed.bam -o L1_marked_fixed_split.bam

-rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS

程序运行的log日志是：

INFO 13:04:52,813 ProgressMeter - Total runtime 2398.74 secs, 39.98 min, 0.67 hours

INFO 13:04:52,854 MicroScheduler - 0 reads were filtered out during the traversal out of approximately 120614036 total reads (0.00%)

INFO 13:04:52,854 MicroScheduler - -> 0 reads (0.00% of total) failing BadCigarFilter

INFO 13:04:52,854 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter

INFO 13:04:52,855 MicroScheduler - -> 0 reads (0.00% of total) failing ReassignOneMappingQualityFilter

可以看到，对全基因组测序数据来说，这个步骤毫无效果，而且还耗时40分钟，应该略去。

然后是indel区域的重排，需要结合 RealignerTargetCreator 和 IndelRealigner

命令是：

Program Args: -T RealignerTargetCreator -I L1_marked_fixed_split.bam

-R /home/jianmingzeng/reference/genome/human_g1k_v37/human_g1k_v37.fasta -o L1_target.intervals

-known /home/jianmingzeng/biosoft/GATK/resources/bundle/b37/Mills_and_1000G_gold_standard.indels.b37.vcf

-known /home/jianmingzeng/biosoft/GATK/resources/bundle/b37/1000G_phase1.indels.b37.vcf -nt 5

程序运行的log日志是：

INFO 15:50:24,097 ProgressMeter - Total runtime 1165.34 secs, 19.42 min, 0.32 hours

INFO 15:50:24,097 MicroScheduler - 22094746 reads were filtered out during the traversal out of approximately 120826819 total reads (18.29%)

INFO 15:50:24,104 MicroScheduler - -> 0 reads (0.00% of total) failing BadCigarFilter

INFO 15:50:24,104 MicroScheduler - -> 1774279 reads (1.47% of total) failing BadMateFilter

INFO 15:50:24,104 MicroScheduler - -> 14006627 reads (11.59% of total) failing DuplicateReadFilter

INFO 15:50:24,104 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter

INFO 15:50:24,104 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter

INFO 15:50:24,104 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter

INFO 15:50:24,105 MicroScheduler - -> 6313840 reads (5.23% of total) failing MappingQualityZeroFilter

INFO 15:50:24,105 MicroScheduler - -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter

INFO 15:50:24,105 MicroScheduler - -> 0 reads (0.00% of total) failing Platform454Filter

INFO 15:50:24,105 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter

命令是：

Program Args: -T IndelRealigner -I L1_marked_fixed_split.bam

-R /home/jianmingzeng/reference/genome/human_g1k_v37/human_g1k_v37.fasta

-targetIntervals L1_target.intervals -o L1_realigned.bam

-known /home/jianmingzeng/biosoft/GATK/resources/bundle/b37/Mills_and_1000G_gold_standard.indels.b37.vcf

-known /home/jianmingzeng/biosoft/GATK/resources/bundle/b37/1000G_phase1.indels.b37.vcf

程序运行的log日志是：

INFO 17:21:00,917 ProgressMeter - Total runtime 4265.44 secs, 71.09 min, 1.18 hours

INFO 17:21:00,920 MicroScheduler - 0 reads were filtered out during the traversal out of approximately 120614036 total reads (0.00%)

INFO 17:21:00,920 MicroScheduler - -> 0 reads (0.00% of total) failing BadCigarFilter

INFO 17:21:00,920 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter

最后是碱基质量的矫正，需要结合 BaseRecalibrator 和 PrintReads

命令是：

Program Args: -T BaseRecalibrator -I L1_realigned.bam

-R /home/jianmingzeng/reference/genome/human_g1k_v37/human_g1k_v37.fasta -o L1_temp.table

-knownSites /home/jianmingzeng/annotation/variation/human/dbSNP/All_20160601.vcf.gz

程序运行的log日志是：

INFO 19:58:23,969 ProgressMeter - Total runtime 9436.69 secs, 157.28 min, 2.62 hours

INFO 19:58:23,970 MicroScheduler - 21179430 reads were filtered out during the traversal out of approximately 120614036 total reads (17.56%)

INFO 19:58:23,970 MicroScheduler - -> 0 reads (0.00% of total) failing BadCigarFilter

INFO 19:58:23,970 MicroScheduler - -> 14073643 reads (11.67% of total) failing DuplicateReadFilter

INFO 19:58:23,970 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter

INFO 19:58:23,970 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter

INFO 19:58:23,971 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter

INFO 19:58:23,971 MicroScheduler - -> 7105787 reads (5.89% of total) failing MappingQualityZeroFilter

INFO 19:58:23,971 MicroScheduler - -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter

INFO 19:58:23,971 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter

命令是：

Program Args: -T PrintReads -R /home/jianmingzeng/reference/genome/human_g1k_v37/human_g1k_v37.fasta -I L1_realigned.bam -o L1_recal.bam -BQSR L1_temp.table

程序运行的log日志是：

INFO 23:41:00,540 ProgressMeter - Total runtime 13349.77 secs, 222.50 min, 3.71 hours

INFO 23:41:00,542 MicroScheduler - 0 reads were filtered out during the traversal out of approximately 120614036 total reads (0.00%)

INFO 23:41:00,542 MicroScheduler - -> 0 reads (0.00% of total) failing BadCigarFilter

INFO 23:41:00,542 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter

可以看到这个步骤非常的耗时，而且bam文件的大小近乎翻倍了。

最后是GATK真正的功能，variant-calling

我这里不仅仅是对最后recal的bam进行variant-calling 步骤，同时也对realign的bam做了，所以下面显示两个时间消耗的记录，因为GATK的 BaseRecalibrator 步骤太耗费时间，而且极大的增加了bam文件的存储，所以有必要确认这个步骤是否有必要。

命令是：

Program Args: -T HaplotypeCaller -R /home/jianmingzeng/reference/genome/human_g1k_v37/human_g1k_v37.fasta -I L1_recal.bam --dbsnp /home/jianmingzeng/annotation/variation/human/dbSNP/All_20160601.vcf.gz -stand_emit_conf 10 -o L1_recal_raw.snps.indels.vcf

程序运行的log日志是：

INFO 20:40:49,062 ProgressMeter - done 3.101804739E9 10.9 h 12.0 s 100.0% 10.9 h 0.0 s

INFO 20:40:49,063 ProgressMeter - Total runtime 39243.88 secs, 654.06 min, 10.90 hours

INFO 20:40:49,064 MicroScheduler - 22384946 reads were filtered out during the traversal out of approximately 119776742 total reads (18.69%)

INFO 20:40:49,064 MicroScheduler - -> 0 reads (0.00% of total) failing BadCigarFilter

INFO 20:40:49,064 MicroScheduler - -> 13732328 reads (11.46% of total) failing DuplicateReadFilter

INFO 20:40:49,065 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter

INFO 20:40:49,065 MicroScheduler - -> 8652618 reads (7.22% of total) failing HCMappingQualityFilter

INFO 20:40:49,066 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter

INFO 20:40:49,066 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter

INFO 20:40:49,066 MicroScheduler - -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter

INFO 20:40:49,067 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter

命令是：

Program Args: -T HaplotypeCaller -R /home/jianmingzeng/reference/genome/human_g1k_v37/human_g1k_v37.fasta -I L1_realigned.bam --dbsnp /home/jianmingzeng/annotation/variation/human/dbSNP/All_20160601.vcf.gz -stand_emit_conf 10 -o L1_realigned_raw.snps.indels.vcf

程序运行的log日志是：

INFO 08:53:17,633 ProgressMeter - done 3.101804739E9 12.2 h 14.0 s 100.0% 12.2 h 0.0 s

INFO 08:53:17,633 ProgressMeter - Total runtime 43939.69 secs, 732.33 min, 12.21 hours

INFO 08:53:17,634 MicroScheduler - 22384946 reads were filtered out during the traversal out of approximately 119776742 total reads (18.69%)

INFO 08:53:17,634 MicroScheduler - -> 0 reads (0.00% of total) failing BadCigarFilter

INFO 08:53:17,635 MicroScheduler - -> 13732328 reads (11.46% of total) failing DuplicateReadFilter

INFO 08:53:17,635 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter

INFO 08:53:17,635 MicroScheduler - -> 8652618 reads (7.22% of total) failing HCMappingQualityFilter

INFO 08:53:17,636 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter

INFO 08:53:17,636 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter

INFO 08:53:17,636 MicroScheduler - -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter

INFO 08:53:17,637 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter

如果想了解更多的全基因组重测序分析内容，欢迎点击阅读原文查看，也欢迎把此文转载给有需要的朋友。

对我的全基因组重测序数据来说，处理这个GATK最佳实践，耗时约12+40+15=67小时。

还有关于GATK调用多线程来加快处理步骤的事情，我就不多说了，你其实可以去GATK官网查看详细阅读说明。

猜你喜欢

工作资讯 | 学习课程 | 好书分享

菜鸟入门

Linux | Perl | R语言

数据分析

ChIP-seq（上）| ChIP-seq（下）| RNA-seq

WGS,WES,RNA-seq组与ChIP-seq之间的异同

编程实践

第0题 | 探索人类基因组序列

直播基因组分析

我的基因组 | 解惑帖

一个标准的基因检测报告目录
生信技能树
查看全文

相关阅读:
IOS获取senderTitle
IOSSegmentedControl（添加到Nav上）的简单使用方法
 MinaService
java nio(reactor/selector)
Java，反射机制
 http, get, post, HttpClient
java.util.ConcurrentModificationException
愛上一個不該愛的人
 how to be a good ladyprogrammer?
JQuery（那个夏日的午后）

原文地址：https://www.cnblogs.com/xiaojikuaipao/p/7103878.html

一个全基因组重测序分析实战

流程介绍

准备工作

首先是软件安装

其次是必备数据的下载：

全基因组重测序数据介绍

数据分析

fastq2bam

首先给出代码：

时间消耗如下：

GATK重新处理bam文件

代码是：

variant calling by gatk hc

代码是：

流程探究

首先的BWA

接下来是PICARD

最后是GATK