Hi GATK team,
I was hoping I could get some insight on determining rate of heterozygosity from a gvcf file. We have three diploid lizard samples. Each was run through our GATK pipeline using HC in GVCF mode with -ERC GVCF
followed by joint genotyping using all three samples.
I want to determine the rate of heterozygosity in each lizard by counting the number of heterozygous sites and dividing by the number of callable sites (i.e not './.') for every position in the genome (whether or not it is a variant).
However after reading a response to a question on the forum http://gatkforums.broadinstitute.org/discussion/4017/what-is-a-gvcf-and-how-is-it-different-from-a-regular-vcf,
"Short answer is that you shouldn't be looking at the genotype calls emitted by HC in GVCF mode. Longer answer, the gVCF is meant to be only an intermediate and the genotype calls are not final"
I am note sure this is the correct way to count heterozygous sites.
From the HC gvcf file, could I extract the number of callable sites from the GCVFBlocks and variant entries in the file to get the total number of callable sites (excluding './.' entries)? Then count the number of heterozygouse genotypes from the joint genotyping gvcf output?
HC command:
java -Xmx3g -jar /home/dut/bin/GATK/GenomeAnalysisTK.jar \
-T HaplotypeCaller \
--variant_index_type LINEAR \
--variant_index_parameter 128000 \
-ERC GVCF \
-R reference.fa \
-I sample001.bam \
-stand_call_conf 30 \
-stand_emit_conf 30 \
-mbq 17 \
-o sample001.rawVAR.vcf)
JointGenotyping command:
java -Xmx3g -jar /home/dut/bin/GATK/GenomeAnalysisTK.jar \
-T GenotypeGVCFs \
--includeNonVariantSites \
-ploidy 2 \
-R reference.fa \
--variant sample8450.rawVAR.vcf \
--variant sample003.rawVAR.vcf \
--variant sample001.rawVAR.vcf \
-o jg_sample001_sample003_sample8450.vcf
We are using GATK version 3.3.0
Any suggestions are appreciated! Thank you for your time.
Best,
Morgan