This document is out of date and has been replaced by the following: http://www.broadinstitute.org/gatk/guide/article?id=4017
As of GATK 2.7, the HaplotypeCaller is now capable of emitting a per-bp or summarized confidence estimate for a site being strictly homozygous-reference.
How it works
The mode works by assembling the reads to create potential haplotypes, realigning the reads to their most likely haplotypes, and then projecting these reads back onto the reference sequence via their haplotypes to compute alignments of the reads to the reference. For each position in the genome we have either a non-reference call (via the standard calling mechanism) or we can estimate the chance that some (unknown) non-reference allele is segregating at this position by examining the realigned reads that span the reference base. At this base we perform two calculations:
- Estimate the confidence that no SNP exists at the site by contrasting all reads with the ref base vs all reads with any non-reference base.
- Estimate the confidence that no indel of size < X (determined by command line parameter) could exist at this site by calculating the number of reads that provide evidence against such an indel, and from this value estimate the chance that we would not have seen the allele confidently.
Based on this, we emit the genotype likelihoods (PL
) and compute the GQ
(from the PL
s) for the least confidence of these two models.
We use a symbolic allele, <NON_REF>
, to indicate that the site is homozygous reference, and because we have an ALT allele we can provide allele-specific AD
and PL
field values.
How to use it
This mode can be enabled with the command line argument --emitRefConfidence (-ERC)
, which has 3 options:
NONE
: don't emit anything (default)BP_RESOLUTION
: emit detailed information for each BPGVCF
: emit a block summarized version of theBP_RESOLUTION
data
BP_RESOLUTION mode
This mode emits a very detailed model for the reference confidence at each reference base within the requested calling intervals. The output looks like:
20 10000435 . T <NON_REF> . . . GT:AD:CD:DP:GQ:PL 0/0:54,0:53:54:99:0,160,2385
20 10000436 . G <NON_REF> . . . GT:AD:CD:DP:GQ:PL 0/0:53,0:53:53:99:0,159,2011
20 10000437 . C <NON_REF> . . . GT:AD:CD:DP:GQ:PL 0/0:53,0:53:53:99:0,159,1971
20 10000438 . C <NON_REF> . . . GT:AD:CD:DP:GQ:PL 0/0:53,0:53:53:99:0,159,1816
20 10000439 . T G 1553.77 . AC=2;AF=1.00;AN=2;DP=56;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=221.40;MQ0=0;QD=27.75 GT:AD:DP:GQ:PL 1/1:0,55:55:99:1582,164,0
20 10000440 . T <NON_REF> . . . GT:AD:CD:DP:GQ:PL 0/0:55,0:55:55:99:0,166,2063
20 10000441 . T <NON_REF> . . . GT:AD:CD:DP:GQ:PL 0/0:55,0:55:55:99:0,165,2000
20 10000442 . T <NON_REF> . . . GT:AD:CD:DP:GQ:PL 0/0:56,0:56:56:99:0,168,2095
20 10000443 . A <NON_REF> . . . GT:AD:CD:DP:GQ:PL 0/0:56,0:56:56:99:0,169,2089
20 10000444 . A <NON_REF> . . . GT:AD:CD:DP:GQ:PL 0/0:56,0:56:56:99:0,168,2093
20 10000445 . C <NON_REF> . . . GT:AD:CD:DP:GQ:PL 0/0:56,0:56:56:99:0,169,2137
This shows each reference base in a 11 bp interval. Only one site (10000439) contains a SNP call. The other 10 bases are reference. For each reference base not part of a variant call we emit a VCF record with have the special symbolic allele <NON_REF>
indicating the call is between the reference base and any possible non-reference allele that might be segregating at this site. The genotype fields include the standard GATK GT
, AD
, DP
, GQ
, and PL
fields, all calculated as ref vs. any non-reference base. There's a new special CD
field as well, that gives the number of indel informative reads at this location, given the maximum size of the indel to consider.
Note that there's no site-level QUAL
field value. We discussed this internally and since the QUAL
is the probability that the site is polymorphic, all of the QUAL
field values should be 0 here, so we decided to drop it.
GVCF mode
GVCF
mode summarizes the output of BP_RESOLUTION
by grouping sequential reference calls with similar GQ
values into blocks, and emitting summarized blocks instead of the values at each base pair. The blocking is discrete, deterministic, and can be controlled by the command line parameter GVCFGQBands
. The algorithm creates bands of GQ
values and merges sequential blocks with GQ
values within the same band into a single synthetic block.
For example, if the GVCFBQBands
argument is -GQB 10 -GQB 20
then we'll create blocks in the GVCF where sequential <NON_REF>
calls have GQ
values either < 10, between 10 and 19, or are 20 or greater.
The GVCF
INFO
and FORMAT
fields have a few new annotations. First, END
and BLOCK_SIZE
indicate the extent of the reference block and its size (redundant yes but easy to read). For each GVCF
sample field there's now only GT
, DP
and GQ
from the standard annotations. GT
is always 0/0 while DP
and GQ
are representative values (currently median) from the bp resolution DP
and GQ
values within the block. The MIN_DP
and MIN_GQ
are the minimum values seen within the block for their respective fields.
The output for the above example in GVCF
mode looks like:
20 10000435 . T <NON_REF> . . BLOCK_SIZE=4;END=10000438 GT:DP:GQ:MIN_DP:MIN_GQ 0/0:53:99:53:159
20 10000439 . T G 1553.77 . AC=2;AF=1.00;AN=2;DP=56;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=221.40;MQ0=0;QD=27.75 GT:AD:DP:GQ:PL 1/1:0,55:55:99:1582,164,0
20 10000440 . T <NON_REF> . . BLOCK_SIZE=6;END=10000445 GT:DP:GQ:MIN_DP:MIN_GQ 0/0:56:99:55:165
Here's a GVCF output for 100 bp around the same variant:
20 10000400 . T <NON_REF> . . BLOCK_SIZE=39;END=10000438 GT:DP:GQ:MIN_DP:MIN_GQ 0/0:57:99:53:128
20 10000439 . T G 1553.77 . AC=2;AF=1.00;AN=2;DP=56;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=221.40;MQ0=0;QD=27.75 GT:AD:DP:GQ:PL 1/1:0,55:55:99:1582,164,0
20 10000440 . T <NON_REF> . . BLOCK_SIZE=61;END=10000500 GT:DP:GQ:MIN_DP:MIN_GQ 0/0:54:99:49:136
Current caveats
- Currently only works for single sample calling
- Currently slows down the calculation significantly (but this will be improved in the future)
- Only
GVCF
mode is likely to work with downstream GATK tools right now, as the special<NON_REF>
allele may be treated as a polymorphic site. If you want to use the BP resolution mode you will likely need to split the VCF into the ref sites and polymorphic sites, perform standard filtering etc to the polymorphic VCF, and then merge the ref and non-ref VCFs back together
Feedback
We are interested in community feedback on the output format and additional information we should provide to make the reference model as useful as possible. Please post suggestions as comments here.