I am using a combination of HaplotypeCaller local (non-spark), in GVCF mode, followed by GatherVcfs to merge them, and I get very different call results across runs. I would expect the probabilities/confidence values to change slightly, but not so much the number of calls. Is this normal?
I'm using the gatk from docker://broadinstitute/gatk:4.beta.6 . My BAM/BAI files pass validation.
I see other posts about results being non-deterministic. But I'm not passing any of the -nt or -nct flags in this case.
I'm splitting all my contigs (bed file) in roughly equal-sized chunks, and calling HaplotypeCaller, like so. The VCF file produced changes a lot if I do 8 chunks, vs 128. I'm not sure whether that makes things worse.
# chunk 000
java -jar /gatk/gatk.jar HaplotypeCaller -R ANN0859.bam --emitRefConfidence GVCF -L bed_chunk_000.bed -O ANN0859.bam_000.g.vcf -hets 0.010000
# chunk 001
java -jar /gatk/gatk.jar HaplotypeCaller -R ANN0859.bam --emitRefConfidence GVCF -L bed_chunk_001.bed -O ANN0859.bam_001.g.vcf -hets 0.010000
...
I merge them like so (passing all the chunks in order):
java -jar /gatk/gatk.jar GatherVcfs -I ANN0859.bam_000.g.vcf -I ANN0859.bam_001.g.vcf ...
The entire bed is sorted, and the chunks are not overlapping. I've made sure that I'm not losing any contigs when I split my bed file.
To provide an example difference for one of the chromosomes, I get the following calls (for 128 chunks) in the final output gVCF:
HanXRQChr00c0117 2497 . G <NON_REF> . . END=2580 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0
HanXRQChr00c0117 10708 . G <NON_REF> . . END=25539 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0
(EOF)
And if I divide the work in 8 (longer) chunks, that last section just explodes into 1960 different calls:
HanXRQChr00c0117 10708 . G <NON_REF> . . END=14265 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0
HanXRQChr00c0117 14266 . C <NON_REF> . . END=14267 GT:DP:GQ:MIN_DP:PL 0/0:1:3:1:0,3,42
...
HanXRQChr00c0117 14309 . T C,<NON_REF> 0.13 . DP=2;MLEAC=0,0;MLEAF=nan,nan;RAW_MQ=7200 GT:PGT:PID ./.:0|1:14309_T_C
HanXRQChr00c0117 14310 . T <NON_REF> . . END=14315 GT:DP:GQ:MIN_DP:PL 0/0:1:3:1:0,3,45
HanXRQChr00c0117 14316 . T C,<NON_REF> 0.13 . DP=2;MLEAC=0,0;MLEAF=nan,nan;RAW_MQ=7200 GT:PGT:PID ./.:0|1:14309_T_C
HanXRQChr00c0117 14317 . T <NON_REF> . . END=14321 GT:DP:GQ:MIN_DP:PL 0/0:1:3:1:0,3,45
...
HanXRQChr00c0117 14358 . T <NON_REF> . . END=14359 GT:DP:GQ:MIN_DP:PL 0/0:4:12:4:0,12,180
HanXRQChr00c0117 14360 . A G,<NON_REF> 30.02 . DP=4;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.5,0;RAW_MQ=14400 GT:AD:DP:GQ:PL:SB 1/1:0,1,0:1:3:45,3,0,4
5,3,45:0,0,1,0
...
HanXRQChr00c0117 25479 . T <NON_REF> . . END=25484 GT:DP:GQ:MIN_DP:PL 0/0:8:24:8:0,24,296
HanXRQChr00c0117 25485 . T <NON_REF> . . END=25485 GT:DP:GQ:MIN_DP:PL 0/0:8:21:8:0,21,315
HanXRQChr00c0117 25486 . T <NON_REF> . . END=25521 GT:DP:GQ:MIN_DP:PL 0/0:6:18:6:0,18,217
HanXRQChr00c0117 25522 . A <NON_REF> . . END=25524 GT:DP:GQ:MIN_DP:PL 0/0:7:15:7:0,15,225
HanXRQChr00c0117 25525 . T <NON_REF> . . END=25539 GT:DP:GQ:MIN_DP:PL 0/0:5:9:3:0,9,133
(EOF)
I thought at first that maybe the chunk boundaries were at play, but those contigs are in the middle of a chunk file.