Quantcast
Channel: haplotypecaller — GATK-Forum
Viewing all articles
Browse latest Browse all 1335

Using the reference confidence calculation mode and generating a GVCF [RETIRED]

$
0
0

This document is out of date and has been replaced by the following: http://www.broadinstitute.org/gatk/guide/article?id=4017

As of GATK 2.7, the HaplotypeCaller is now capable of emitting a per-bp or summarized confidence estimate for a site being strictly homozygous-reference.

How it works

The mode works by assembling the reads to create potential haplotypes, realigning the reads to their most likely haplotypes, and then projecting these reads back onto the reference sequence via their haplotypes to compute alignments of the reads to the reference. For each position in the genome we have either a non-reference call (via the standard calling mechanism) or we can estimate the chance that some (unknown) non-reference allele is segregating at this position by examining the realigned reads that span the reference base. At this base we perform two calculations:

  • Estimate the confidence that no SNP exists at the site by contrasting all reads with the ref base vs all reads with any non-reference base.
  • Estimate the confidence that no indel of size < X (determined by command line parameter) could exist at this site by calculating the number of reads that provide evidence against such an indel, and from this value estimate the chance that we would not have seen the allele confidently.

Based on this, we emit the genotype likelihoods (PL) and compute the GQ (from the PLs) for the least confidence of these two models.

We use a symbolic allele, <NON_REF>, to indicate that the site is homozygous reference, and because we have an ALT allele we can provide allele-specific AD and PL field values.

How to use it

This mode can be enabled with the command line argument --emitRefConfidence (-ERC), which has 3 options:

  • NONE: don't emit anything (default)
  • BP_RESOLUTION: emit detailed information for each BP
  • GVCF: emit a block summarized version of the BP_RESOLUTION data

BP_RESOLUTION mode

This mode emits a very detailed model for the reference confidence at each reference base within the requested calling intervals. The output looks like:

20      10000435        .       T       <NON_REF>       .       .       .       GT:AD:CD:DP:GQ:PL       0/0:54,0:53:54:99:0,160,2385
20      10000436        .       G       <NON_REF>       .       .       .       GT:AD:CD:DP:GQ:PL       0/0:53,0:53:53:99:0,159,2011
20      10000437        .       C       <NON_REF>       .       .       .       GT:AD:CD:DP:GQ:PL       0/0:53,0:53:53:99:0,159,1971
20      10000438        .       C       <NON_REF>       .       .       .       GT:AD:CD:DP:GQ:PL       0/0:53,0:53:53:99:0,159,1816
20      10000439        .       T       G       1553.77 .       AC=2;AF=1.00;AN=2;DP=56;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=221.40;MQ0=0;QD=27.75    GT:AD:DP:GQ:PL  1/1:0,55:55:99:1582,164,0
20      10000440        .       T       <NON_REF>       .       .       .       GT:AD:CD:DP:GQ:PL       0/0:55,0:55:55:99:0,166,2063
20      10000441        .       T       <NON_REF>       .       .       .       GT:AD:CD:DP:GQ:PL       0/0:55,0:55:55:99:0,165,2000
20      10000442        .       T       <NON_REF>       .       .       .       GT:AD:CD:DP:GQ:PL       0/0:56,0:56:56:99:0,168,2095
20      10000443        .       A       <NON_REF>       .       .       .       GT:AD:CD:DP:GQ:PL       0/0:56,0:56:56:99:0,169,2089
20      10000444        .       A       <NON_REF>       .       .       .       GT:AD:CD:DP:GQ:PL       0/0:56,0:56:56:99:0,168,2093
20      10000445        .       C       <NON_REF>       .       .       .       GT:AD:CD:DP:GQ:PL       0/0:56,0:56:56:99:0,169,2137

This shows each reference base in a 11 bp interval. Only one site (10000439) contains a SNP call. The other 10 bases are reference. For each reference base not part of a variant call we emit a VCF record with have the special symbolic allele <NON_REF> indicating the call is between the reference base and any possible non-reference allele that might be segregating at this site. The genotype fields include the standard GATK GT, AD, DP, GQ, and PL fields, all calculated as ref vs. any non-reference base. There's a new special CD field as well, that gives the number of indel informative reads at this location, given the maximum size of the indel to consider.

Note that there's no site-level QUAL field value. We discussed this internally and since the QUAL is the probability that the site is polymorphic, all of the QUAL field values should be 0 here, so we decided to drop it.

GVCF mode

GVCF mode summarizes the output of BP_RESOLUTION by grouping sequential reference calls with similar GQ values into blocks, and emitting summarized blocks instead of the values at each base pair. The blocking is discrete, deterministic, and can be controlled by the command line parameter GVCFGQBands. The algorithm creates bands of GQ values and merges sequential blocks with GQ values within the same band into a single synthetic block.

For example, if the GVCFBQBands argument is -GQB 10 -GQB 20 then we'll create blocks in the GVCF where sequential <NON_REF> calls have GQ values either < 10, between 10 and 19, or are 20 or greater.

The GVCF INFO and FORMAT fields have a few new annotations. First, END and BLOCK_SIZE indicate the extent of the reference block and its size (redundant yes but easy to read). For each GVCF sample field there's now only GT, DP and GQ from the standard annotations. GT is always 0/0 while DP and GQ are representative values (currently median) from the bp resolution DP and GQ values within the block. The MIN_DP and MIN_GQ are the minimum values seen within the block for their respective fields.

The output for the above example in GVCF mode looks like:

20      10000435        .       T       <NON_REF>       .       .       BLOCK_SIZE=4;END=10000438       GT:DP:GQ:MIN_DP:MIN_GQ  0/0:53:99:53:159
20      10000439        .       T       G       1553.77 .       AC=2;AF=1.00;AN=2;DP=56;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=221.40;MQ0=0;QD=27.75    GT:AD:DP:GQ:PL  1/1:0,55:55:99:1582,164,0
20      10000440        .       T       <NON_REF>       .       .       BLOCK_SIZE=6;END=10000445       GT:DP:GQ:MIN_DP:MIN_GQ  0/0:56:99:55:165

Here's a GVCF output for 100 bp around the same variant:

20      10000400        .       T       <NON_REF>       .       .       BLOCK_SIZE=39;END=10000438      GT:DP:GQ:MIN_DP:MIN_GQ  0/0:57:99:53:128
20      10000439        .       T       G       1553.77 .       AC=2;AF=1.00;AN=2;DP=56;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=221.40;MQ0=0;QD=27.75    GT:AD:DP:GQ:PL  1/1:0,55:55:99:1582,164,0
20      10000440        .       T       <NON_REF>       .       .       BLOCK_SIZE=61;END=10000500      GT:DP:GQ:MIN_DP:MIN_GQ  0/0:54:99:49:136

Current caveats

  • Currently only works for single sample calling
  • Currently slows down the calculation significantly (but this will be improved in the future)
  • Only GVCF mode is likely to work with downstream GATK tools right now, as the special <NON_REF> allele may be treated as a polymorphic site. If you want to use the BP resolution mode you will likely need to split the VCF into the ref sites and polymorphic sites, perform standard filtering etc to the polymorphic VCF, and then merge the ref and non-ref VCFs back together

Feedback

We are interested in community feedback on the output format and additional information we should provide to make the reference model as useful as possible. Please post suggestions as comments here.


Viewing all articles
Browse latest Browse all 1335

Trending Articles