I've been experiencing some apparent errors with HaplotypeCaller that I think could be related to how it chooses candidate haplotypes when performing multi-sample calling. Please see the example files I've uploaded to the server (cooketho_20130103.tar.gz). For instance if you look at position 3511 in sample 2, there are 14 non-reference reads and 0 reference reads. When HaplotypeCaller is run with just this sample, it calls this locus homozygous non-reference, which seems to me to be the correct behavior. But when run with all 14 samples, it doesn't call a SNP at this locus. Repeating the run in debug mode shows that the (immediate) cause is that there were 11 candidate haplotypes found, and not a single one of them had the non-reference allele at position 3511. Why?
I came across an earlier post that suggested in some cases increasing the --minPruning
value can be of use, but I tried this to no avail.
http://gatkforums.broadinstitute.org/discussion/1764/haplotypecaller-in-cohorts
My organism is a plant, and is is considerably more heterozygous than human, but changing the --heterozygosity
value did not appear to help either. Double check me on this if you like.
Can you please suggest a fix, or perhaps release some documentation on how HaplotypeCaller selects candidate haplotypes?
P.S. Any idea of when the source will be released to the public, or when a more comprehensive manual will be released? Would be very helpful for figuring out what is going on in cases like this.
Thanks!
Tom