Hello,
I would like to replicate the behaviour of gakt described in Mallick et al. 2016 for the Simon's genomes data set. They explain in the supplementary information the following:
"GATK UnifiedGenotyper has a built-in prior for Bayesian SNP calling that assumes that the site is more likely to be homozygous for the reference allele than homozygous for the variant allele. For a diploid sample, the default priors for a homozygous reference, heterozygote and homozygous non-reference genotypes are (0.9985, 0.001, 0.0005), respectively. When there is ambiguity in a heterozygote, GATK prefers the reference homozygote. This is a reference bias, and while this bias is not typically problematic for medical studies, it can complicate interpretation of population genetics signals. With the Genome Sequencing and Analysis Group at the Broad Institute, we developed an alternative model that was integrated into the UnifiedGenotyper, allowing reference-bias free priors to be specified. We are using a prior (0.4995, 0.001, 0.4995). Details are at: https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_ genotyper_UnifiedGenotyper.php#--input_prior."
I think these two examples might just do the thing:
(using either 3.x or 4.0.x)
java -jar ~/software/GenomeAnalysisTK-3.8-0-ge9d806836/GenomeAnalysisTK.jar -T HaplotypeCaller --emitRefConfidence GVCF --reference_sequence ~/hs37d5.fasta --input_file ~/file.bam --input_prior 0.001 --input_prior 0.4995
java -jar ~/software/gatk-package-4.0.3.0-local.jar -T HaplotypeCaller --emitRefConfidence GVCF--reference_sequence ~/hs37d5.fasta --input_file ~/file.bam --input_prior 0.001 --input_prior 0.4995
Does this makes sense, sorry?
These examples assume the two prior options have positional assingments to AC=1 -> 0/1 , and AC=2 -> 1/1 , ... and that as stated in the documentation about priors, AC=0 becomes 1 minus the sum of the two previous, thus effectively:
prior(0/0)=0.4995, prior(0/1)=0.001, prior(1/1)=0.4995
To understand the whole thing I'm building on these previous posts from @tommycarstensen , @magicDGS and @saeschba . Thanks guys too, and any info or extra feedback you may have, please let me know.
https://gatkforums.broadinstitute.org/gatk/discussion/8787/input-prior-default-value
https://gatkforums.broadinstitute.org/gatk/discussion/5877/caller-input-prior-option
https://gatkforums.broadinstitute.org/gatk/discussion/9489/should-it-say-ac-0-in-the-input-prior-documentation-for-the-haplotypecaller
This last question/topic makes me wonder too if AC should not be better understood here in terms of GT. I'm mostly familiar with the VCF format and AC stands there for allele count, which is a property of a site across many samples. Here in HaplotypeCaller we go over one sample at a time, not many. Maybe some inheritance from UnifiedGenotyper?
Best regards and many thanks for your comments,
Rodrigo