I would appreciate some help in understanding better the differences between haploid and diploid mode when it comes to calling and joint-genotyping (HaplotypeCaller + GenotypeGVCFs) in gatk. In particular, differences in the reported read depth (DP). Here my example case.
HaplotypeCaller using haploid mode.
(sample1_chrY.bam)
java -jar ~/software/gatk/GenomeAnalysisTK.jar -T HaplotypeCaller --intervals Y:2650345-2650345 \
--input_file ~/data/bam/sample1_chrY.bam --emitRefConfidence GVCF --max_alternate_alleles 3 \
--contamination_fraction_to_filter 0.05 --min_base_quality_score 20 \
--sample_ploidy 1 --pcr_indel_model NONE --dbsnp ~/data/variations/dbsnp_138/dbsnp_138.b37.vcf \
--reference_sequence ~/data/fasta/Homo_sapiens_assembly19/Homo_sapiens_assembly19.fasta \
--output ~/data/haploid_calls/sample1_chrY.g.vcf.gz
Y 2650345 . A <NON_REF> . . END=2650345 GT:DP:GQ:MIN_DP:PL 0:13:99:13:0,429
HaplotypeCaller using diploid mode.
(sample1_chrY.bam)
java -jar ~/software/gatk/GenomeAnalysisTK.jar -T HaplotypeCaller --intervals Y:2650345-2650345 \
--input_file ~/data/bam/sample1_chrY.bam --emitRefConfidence GVCF --max_alternate_alleles 3 \
--contamination_fraction_to_filter 0.05 --min_base_quality_score 20 \
--sample_ploidy 2 --pcr_indel_model NONE --dbsnp ~/data/variations/dbsnp_138/dbsnp_138.b37.vcf \
--reference_sequence ~/data/fasta/Homo_sapiens_assembly19/Homo_sapiens_assembly19.fasta \
--output ~/data/diploid_calls/sample1_chrY.g.vcf.gz
Y 2650345 . A <NON_REF> . . END=2650345 GT:DP:GQ:MIN_DP:PL 0/0:13:33:13:0,33,495
The only difference between the last two calls to HaplotypeCaller is the parameter --sample-ploidy. In both cases (ploidy 1 and ploidy 2), the reference call is being supported by 13 reads (DP field). Concordant with this, looking at this position using the bam file in IGV (see image below), it is possible to confirm that there are 14 reads covering the position and only one base in one of the reads is of low quality (QV 2), hence a DP of 13 makes sense.
![]()
What it's more, the number of reads spanning this position even increases (up to 24 DP + 2 artificial haplotypes) when looking at the same sample/position but using the already locally re-aligned reads that can be output by gatk in a bam file. See below.
HaplotypeCaller using haploid mode.
(sample1_chrY.bam)
java -jar ~/software/gatk/GenomeAnalysisTK.jar -T HaplotypeCaller --intervals Y:2650345-2650345 \
--input_file ~/data/bam/sample1_chrY.bam --emitRefConfidence GVCF --max_alternate_alleles 3 \
--contamination_fraction_to_filter 0.05 --min_base_quality_score 20 \
--sample_ploidy 1 --pcr_indel_model NONE --dbsnp ~/data/variations/dbsnp_138/dbsnp_138.b37.vcf \
--reference_sequence ~/data/fasta/Homo_sapiens_assembly19/Homo_sapiens_assembly19.fasta \
-forceActive -disableOptimizations --bamOutput ~/data/sample1_RE-AL_HAP_chrY.bam
![]()
HaplotypeCaller using diploid mode.
(sample1_chrY.bam)
java -jar ~/software/gatk/GenomeAnalysisTK.jar -T HaplotypeCaller --intervals Y:2650345-2650345 \
--input_file ~/data/bam/sample1_chrY.bam --emitRefConfidence GVCF --max_alternate_alleles 3 \
--contamination_fraction_to_filter 0.05 --min_base_quality_score 20 \
--sample_ploidy 2 --pcr_indel_model NONE --dbsnp ~/data/variations/dbsnp_138/dbsnp_138.b37.vcf \
--reference_sequence ~/data/fasta/Homo_sapiens_assembly19/Homo_sapiens_assembly19.fasta \
-forceActive -disableOptimizations --bamOutput ~/data/sample1_RE-AL_DIP_chrY.bam
![]()
However, when I do multi-sample joint genotyping using GenotypeGVCFs, DP values and the number of supporting reads reported vary significantly between g.vcf files produced in haploid and those produced in diploid mode. DP values get significantly reduced, in particular for reference calls it seems. To simplify, I added only a second extra sample in the example here below.
GenotypeGVCFs
(Using g.vcf files produced with HaplotypeCaller in haploid mode)
java -jar ~/software/gatk/GenomeAnalysisTK.jar -T GenotypeGVCFs --intervals Y:2650345-2650345 \
--standard_min_confidence_threshold_for_calling 10 --dbsnp ~/data/variations/dbsnp_138/dbsnp_138.b37.vcf \
--reference_sequence ~/data/fasta/Homo_sapiens_assembly19/Homo_sapiens_assembly19.fasta --max_alternate_alleles 3 \
--variant ~/data/haploid_calls/sample1_chrY.g.vcf.gz --variant ~/data/haploid_calls/sample2_chrY.g.vcf.gz \
--out ~/data/raw_vcfs/raw_haploid_calls.vcf.gz
Y 2650345 . A G 497.76 . AC=1;AF=0.500;AN=2;DP=23;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;QD=31.11;SOR=0.941 GT:AD:DP:GQ:PL 0:6,0:6:99:0,109 1:0,16:16:99:529,0
GenotypeGVCFs
(Using g.vcf files produced with HaplotypeCaller in diploid mode)
java -jar ~/software/gatk/GenomeAnalysisTK.jar -T GenotypeGVCFs --intervals Y:2650345-2650345 \
--standard_min_confidence_threshold_for_calling 10 --dbsnp ~/data/variations/dbsnp_138/dbsnp_138.b37.vcf \
--reference_sequence ~/data/fasta/Homo_sapiens_assembly19/Homo_sapiens_assembly19.fasta --max_alternate_alleles 3 \
--variant ~/data/diploid_calls/sample1_chrY.g.vcf.gz --variant ~/data/diploid_calls/sample2_chrY.g.vcf.gz \
--out ~/data/raw_vcfs/raw_diploid_calls.vcf.gz
Y 2650345 . A G 494.42 . AC=2;AF=0.500;AN=4;DP=30;ExcessHet=0.7918;FS=0.000;MLEAC=2;MLEAF=0.500;MQ=60.00;QD=30.90;SOR=0.941 GT:AD:DP:GQ:PL 0/0:13,0:13:33:0,33,495 1/1:0,16:16:48:529,48,0
As can be seen, using g.vcf files produced in haploid mode, the final DP value for sample1 gets down to 6 reads, while previously was 13. The number 13 however, is reported when g.vcf files produced in diploid mode are used.
So, why?
I would be very thankful about some help understanding this. Additional information here below
.- I'm using gatk v3.7-0-gcfedb67, Java 1.8.0_40-b26
.- In the case of sample2, there is no difference in final DP values as reported using haploid vs diploid g.vcf files. In this case is an ALT call, but it does happen in REF calls just as the example for sample1.
.- This is an example with one SNP, but the issue is widespread across the call set, at least when it comes to REF calls.
.- I've checked with the help of IGV -> all the 13 reads/base_positions that I think should be reported in haploid mode (only 6 reported) pass -mmq 20 and -mbq 20
.- there is no significant strand bias
.- data is Illumina, PCR-free, 150 bp paired-reads, reads alligned with bwa-mem, and picard for marking duplicates.
Best,
Rodrigo