Quantcast
Channel: haplotypecaller — GATK-Forum
Viewing all articles
Browse latest Browse all 1335

free of reference bias priors in HaplotypeCaller

$
0
0

Hello,

I would like to replicate the behaviour of gakt described in Mallick et al. 2016 for the Simon's genomes data set. They explain in the supplementary information the following:

"GATK UnifiedGenotyper has a built-in prior for Bayesian SNP calling that assumes that the site is more likely to be homozygous for the reference allele than homozygous for the variant allele. For a diploid sample, the default priors for a homozygous reference, heterozygote and homozygous non-reference genotypes are (0.9985, 0.001, 0.0005), respectively. When there is ambiguity in a heterozygote, GATK prefers the reference homozygote. This is a reference bias, and while this bias is not typically problematic for medical studies, it can complicate interpretation of population genetics signals. With the Genome Sequencing and Analysis Group at the Broad Institute, we developed an alternative model that was integrated into the UnifiedGenotyper, allowing reference-bias free priors to be specified. We are using a prior (0.4995, 0.001, 0.4995). Details are at: https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_ genotyper_UnifiedGenotyper.php#--input_prior."

I think these two examples might just do the thing:
(using either 3.x or 4.0.x)

java -jar ~/software/GenomeAnalysisTK-3.8-0-ge9d806836/GenomeAnalysisTK.jar -T HaplotypeCaller --emitRefConfidence GVCF --reference_sequence ~/hs37d5.fasta --input_file ~/file.bam --input_prior 0.001 --input_prior 0.4995

java -jar ~/software/gatk-package-4.0.3.0-local.jar -T HaplotypeCaller --emitRefConfidence GVCF--reference_sequence ~/hs37d5.fasta --input_file ~/file.bam --input_prior 0.001 --input_prior 0.4995

Does this makes sense, sorry?
These examples assume the two prior options have positional assingments to AC=1 -> 0/1 , and AC=2 -> 1/1 , ... and that as stated in the documentation about priors, AC=0 becomes 1 minus the sum of the two previous, thus effectively:

prior(0/0)=0.4995, prior(0/1)=0.001, prior(1/1)=0.4995

To understand the whole thing I'm building on these previous posts from @tommycarstensen , @magicDGS and @saeschba . Thanks guys too, and any info or extra feedback you may have, please let me know.

https://gatkforums.broadinstitute.org/gatk/discussion/8787/input-prior-default-value
https://gatkforums.broadinstitute.org/gatk/discussion/5877/caller-input-prior-option
https://gatkforums.broadinstitute.org/gatk/discussion/9489/should-it-say-ac-0-in-the-input-prior-documentation-for-the-haplotypecaller

This last question/topic makes me wonder too if AC should not be better understood here in terms of GT. I'm mostly familiar with the VCF format and AC stands there for allele count, which is a property of a site across many samples. Here in HaplotypeCaller we go over one sample at a time, not many. Maybe some inheritance from UnifiedGenotyper?

Best regards and many thanks for your comments,
Rodrigo


Viewing all articles
Browse latest Browse all 1335

Trending Articles