This document describes the procedure used by HaplotypeCaller to assign genotypes to individual samples based on the allele likelihoods calculated in the previous step. For more context information on how this fits into the overall HaplotypeCaller method, please see the more general HaplotypeCaller documentation.
Note that this describes the regular mode of HaplotypeCaller, which does not emit an estimate of reference confidence. For details on how the reference confidence model works and is applied in -ERC
modes (GVCF
and BP_RESOLUTION
) please see the reference confidence model documentation.
Overview
The previous step produced a table of per-read allele likelihoods for each candidate variant site under consideration. Now, all that remains to do is to evaluate those likelihoods in aggregate to determine what is the most likely genotype of the sample at each site. This is done by applying Bayes' theorem to calculate the likelihoods of each possible genotype, and selecting the most likely. This produces a genotype call as well as the calculation of various metrics that will be annotated in the output VCF if a variant call is emitted.
1. Preliminary assumptions / limitations
Keep in mind that we are trying to infer the genotype of each sample given the observed sequence data, so the degree of confidence we can have in a genotype depends on both the quality and the quantity of the available data. By definition, low coverage and low quality will both lead to lower confidence calls. The GATK only uses reads that satisfy certain mapping quality thresholds, and only uses “good” bases that satisfy certain base quality thresholds (see documentation for default values).
Up to the released version that is current at time of writing (3.2-2), both the HaplotypeCaller and GenotypeGVCFs (but not UnifiedGenotyper) assume that the organism of study is diploid. This assumption is made formally in the mathematical development of the Bayesian calculation. A generalized form of the genotyping algorithm that can handle ploidies other than 2 will be available in an upcoming version.
Finally, please note that the tools assume that all reads are independent observations.
2. Calculating genotype likelihoods using Bayes' Theorem
We use the approach described in [Li2011] to calculate the posterior probabilities of non-reference alleles (Methods 2.3.5 and 2.3.6) extended to handle multi-allelic variation.
The basic formula we use for all types of variation under consideration (SNPs, insertions and deletions) is:
$$ P(G|D) = \frac{ P(G) P(D|G) }{ \sum_{i} P(G_i) P(D|G_i) } $$
If that is meaningless to you, please don't freak out -- we're going to break it down and go through all the components one by one. First of all, the term on the left:
$$ P(G|D) $$
is the quantity we are trying to calculate for each possible genotype: the conditional probability of the genotype G given the observed data D.
Now let's break down the term on the right:
$$ \frac{ P(G) P(D|G) }{ \sum_{i} P(G_i) P(D|G_i) } $$
We can ignore the denominator (bottom of the fraction) because it ends up being the same for all the genotypes, and the point of calculating this likelihood is to determine the most likely genotype. The important part is the numerator (top of the fraction):
$$ P(G) P(D|G) $$
which is composed of two things: the prior probability of the genotype and the conditional probability of the data given the genotype.
The first one is the easiest to understand. The prior probability of the genotype G:
$$ P(G) $$
represents how probably we expect to see this genotype based on previous observations, studies of the population, and so on. By default, the GATK tools use a flat prior (always the same value) but you can input your own set of priors if you have information about the frequency of certain genotypes in the population you're studying.
The second one is a little trickier to understand if you're not familiar with Bayesian statistics. It is called the conditional probability of the data given the genotype, but what does that mean? Assuming that the genotype G is the true genotype,
$$ P(D|G) $$
is the probability of observing the sequence data that we have in hand. That is, how likely would we be to pull out a read with a particular sequence from an individual that has this particular genotype? We don't have that number yet, so this requires a little more calculation, using the following formula:
$$ P(D|G) = \prod{j} \left( \frac{P(D_j | H_1)}{2} + \frac{P(D_j | H_2)}{2} \right) $$
You'll notice that this is where the diploid assumption comes into play, since here we decomposed the genotype G into:
$$ G = H_1H_2 $$
which allows for exactly two possible haplotypes. In future versions we'll have a generalized form of this that will allow for any number of haplotypes.
Now, back to our calculation, what's left to figure out is this:
$$ P(D_j|H_n) $$
which as it turns out is the conditional probability of the data given a particular haplotype (or specifically, a particular allele), aggregated over all supporting reads. Conveniently, that is exactly what we calculated in Step 3 of the HaplotypeCaller process, when we used the PairHMM to produce the likelihoods of each read against each haplotype, and then marginalized them to find the likelihoods of each read for each allele under consideration. So all we have to do at this point is plug the values from that table into the equation above, and we can work our way back up to obtain:
$$ P(G|D) $$
for the genotype G.
3. Selecting a genotype and emitting the call record
We go through the process of calculating a likelihood for each possible genotype based on the alleles that were observed at the site, considering every possible combination of alleles. For example, if we see an A, T, and C at a site, the possible genotypes are AA, AT, AC, TT, TC, and CC, and we end up with 6 corresponding probabilities. We pick the largest one, which corresponds to the most likely genotype, and assign that to the sample.
Note that depending on the variant calling options specified in the command-line, we may only emit records for actual variant sites (where at least one sample has a genotype other than homozygous-reference) or we may also emit records for reference sites. The latter is discussed in the reference confidence model documentation.
Assuming that we have a non-ref genotype, all that remains is to calculate the various site-level and genotype-level metrics that will be emitted as annotations in the variant record. For details of how these metrics are calculated, please see the variant annotations documentation.