Hi!
I'd like to perform short germline variant calling on human DNA-seq samples (separate analysis of WES cohort and PCR-free WGS cohort, both paired end). The plan is to follow GATK best practices of short variant discovery with joint genotyping, starting with single-sample gVCF creation through HaplotypeCaller.
I have analyzed some samples with HaplotypeCaller 4.beta.5, and was unsure whether there had been any fixes between 4.beta.5 and 4.0.0.0 that would necessitate re-running the samples.
To check, I ran the 4.beta.5 and 4.0.0.0 HaplotypeCaller on chr21 of 1000Genomes sample NA11992 link.
I ran the same command, on the same machine, in different conda environments:
4.beta.5
gatk4 4.0b5 py27_0 bioconda
picard 2.16.0 py27_0 bioconda
setuptools 38.2.4 py27_0 conda-forge
wheel 0.30.0 py_1 conda-forge
4.0.0.0
gatk4 4.0.0.0 py27_0 bioconda
picard 2.17.2 py27_0 bioconda
setuptools 38.4.0 py27_0 conda-forge
wheel 0.30.0 py27_2 conda-forge
Java
$ java -version
openjdk version "1.8.0_121"
OpenJDK Runtime Environment (Zulu 8.20.0.5-linux64) (build 1.8.0_121-b15)
OpenJDK 64-Bit Server VM (Zulu 8.20.0.5-linux64) (build 25.121-b15, mixed mode)
GATK command
gatk-launch HaplotypeCaller -R $refs/GRCh37.71.nochr.fa -I $data/NA11992.mapped.ILLUMINA.bwa.CEU.exome.20130415.bam -O ${version}.NA11992.21.default.vcf.gz -ERC GVCF -L 21
Results showed 9 differences using diff
(10 including the the different ##GATKCommandLine in the gVCF header). Sometimes PL or SB fields change, sometimes non-variant blocks are subdivided differently, and sometimes indel calls change. Three differences are below:
$diff 4.0.0.0.NA11992.21.default.vcf 4.beta.5.NA11992.21.default.vcf
...
36482c36480
< 21 19701769 . AT A,<NON_REF> 33.73 . BaseQRankSum=-0.253;ClippingRankSum=0.000;DP=6;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=0.842;RAW_MQ=19369.00;ReadPosRankSum=0.524 GT:AD:DP:GQ:PGT:PID:PL:SB
0/1:2,3,0:5:66:0|1:19701769_AT_A:71,0,66,77,75,152:0,2,1,2
---
> 21 19701769 . AT A,<NON_REF> 33.73 . BaseQRankSum=-0.253;ClippingRankSum=0.000;DP=6;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=0.842;RAW_MQ=19369.00;ReadPosRankSum=0.524 GT:AD:DP:GQ:PL:SB 0/1:2,3,0:5:66:71,0,66,77,76,152:0,2,1,2
36485,36486c36483,36485
< 21 19701776 . T <NON_REF> . . END=19701778 GT:DP:GQ:MIN_DP:PL 0/0:6:12:6:0,12,180
< 21 19701779 . TG T,<NON_REF> 19.78 . BaseQRankSum=0.431;ClippingRankSum=0.000;DP=6;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=-0.967;RAW_MQ=19369.00;ReadPosRankSum=1.282 GT:AD:DP:GQ:PGT:PID:PL:SB
0/1:4,2,0:6:57:1|0:19701769_AT_A:57,0,146,69,152,221:1,3,0,2
---
> 21 19701776 . T <NON_REF> . . END=19701777 GT:DP:GQ:MIN_DP:PL 0/0:6:12:6:0,12,180
> 21 19701778 . TTG T,<NON_REF> 0 . DP=6;ExcessHet=3.0103;MLEAC=0,0;MLEAF=0.00,0.00;RAW_MQ=19369.00 GT:AD:DP:GQ:PL:SB 0/0:6,0,0:6:18:0,18,203,18,203,203:1,5,0,0
> 21 19701779 . TG T,<NON_REF> 3.96 . BaseQRankSum=0.431;ClippingRankSum=0.000;DP=6;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=-0.967;RAW_MQ=19369.00;ReadPosRankSum=1.282 GT:AD:DP:GQ:PL:SB 0/1:4,2,0:6:39:39,0,146,51,152,204:1,3,0,2
54434c54433,54435
< 21 26959938 . A <NON_REF> . . END=26959949 GT:DP:GQ:MIN_DP:PL 0/0:9:24:8:0,24,260
---
> 21 26959938 . A <NON_REF> . . END=26959943 GT:DP:GQ:MIN_DP:PL 0/0:8:24:8:0,24,260
> 21 26959944 . C <NON_REF> . . END=26959944 GT:DP:GQ:MIN_DP:PL 0/0:9:27:9:0,27,275
> 21 26959945 . A <NON_REF> . . END=26959949 GT:DP:GQ:MIN_DP:PL 0/0:9:24:9:0,24,360
...
Overall, can I use 4.beta.5 HaplotypeCaller results in downstream analysis, or should results be re-analyzed with 4.0.0.0 HaplotypeCaller? This analysis showed relatively few differences, but I'm still unsure about 4.beta.5 HaplotypeCaller.