GVCF - Genomic Variant Call Format

December 23, 2017, 2:16 pm

≪ Previous: GenotypeGVCFs: Long runtime exclusively with a single sample

GVCF stands for Genomic VCF. A GVCF is a kind of VCF, so the basic format specification is the same as for a regular VCF (see the spec documentation here), but a Genomic VCF contains extra information.

This document explains what that extra information is and how you can use it to empower your variant discovery analyses.

Important notes

What we're covering here is strictly limited to GVCFs produced by HaplotypeCaller in GATK versions 3.0 and above. The term GVCF is sometimes used simply to describe VCFs that contain a record for every position in the genome (or interval of interest) regardless of whether a variant was detected at that site or not (such as VCFs produced by UnifiedGenotyper with --output_mode EMIT_ALL_SITES). GVCFs produced by HaplotypeCaller in GATK versions 3.x and 4.x contain additional information that is formatted in a very specific way. Read on to find out more.

GVCF files produced by HaplotypeCaller from GATK versions 3.x and 4.x are not substantially different. While we don't recommend mixing versions, and we have not tested this ourselves, it should be okay to use GVCFs made by different versions if the annotations and the GVCFBlock definitions (see below) are the same.

General comparison of VCF vs. GVCF

The key difference between a regular VCF and a GVCF is that the GVCF has records for all sites, whether there is a variant call there or not. The goal is to have every site represented in the file in order to do joint analysis of a cohort in subsequent steps. The records in a GVCF include an accurate estimation of how confident we are in the determination that the sites are homozygous-reference or not. This estimation is generated by the HaplotypeCaller's built-in reference model.

Note that some other tools (including the GATK's own UnifiedGenotyper) may output an all-sites VCF that looks superficially like the BP_RESOLUTION GVCFs produced by HaplotypeCaller, but they do not provide an accurate estimate of reference confidence, and therefore cannot be used in joint genotyping analyses.

The two types of GVCFs

As you can see in the figure above, there are two options you can use with -ERC: GVCF and BP_RESOLUTION. With BP_RESOLUTION, you get a GVCF with an individual record at every site: either a variant record, or a non-variant record. With GVCF, you get a GVCF with individual variant records for variant sites, but the non-variant sites are grouped together into non-variant block records that represent intervals of sites for which the genotype quality (GQ) is within a certain range or band. The GQ ranges are defined in the ##GVCFBlock line of the GVCF header. The purpose of the blocks (also called banding) is to keep file size down, so we recommend using the -GVCF option over BP_RESOLUTION.

Example GVCF file

This is a banded GVCF produced by HaplotypeCaller with the -GVCF option.

Header:

As you can see in the first line, the basic file format is a valid version 4.2 VCF:

##fileformat=VCFv4.2
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location">
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

One FORMAT annotation is unique to the GVCF format:

##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">

This defines what was the minimum amount of coverage observed at any one site within a block of records.

The header goes on:

##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another">
##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
##GATKCommandLine=<ID=HaplotypeCaller,CommandLine="[full command line goes here]",Version=4.beta.6-117-g4588584-SNAPSHOT,Date="December 23, 2017 4:04:34 PM EST">

At this point in the header we see the GVCFBlock definitions, which indicate the GQ ranges used for banding:

[individual blocks from 1 to 55]
##GVCFBlock55-56=minGQ=55(inclusive),maxGQ=56(exclusive)
##GVCFBlock56-57=minGQ=56(inclusive),maxGQ=57(exclusive)
##GVCFBlock57-58=minGQ=57(inclusive),maxGQ=58(exclusive)
##GVCFBlock58-59=minGQ=58(inclusive),maxGQ=59(exclusive)
##GVCFBlock59-60=minGQ=59(inclusive),maxGQ=60(exclusive)
##GVCFBlock60-70=minGQ=60(inclusive),maxGQ=70(exclusive)
##GVCFBlock70-80=minGQ=70(inclusive),maxGQ=80(exclusive)
##GVCFBlock80-90=minGQ=80(inclusive),maxGQ=90(exclusive)
##GVCFBlock90-99=minGQ=90(inclusive),maxGQ=99(exclusive)
##GVCFBlock99-100=minGQ=99(inclusive),maxGQ=100(exclusive)

In recent versions of GATK, the banding strategy has been tuned to provide high resolution at lower values of GQ (59 and below) and more compression at high values (60 and above). Note that since GQ is capped at 99, records where the corresponding PL is greater than 99 are lumped into the 99-100 band.

After that, the header goes on:

##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
##INFO=<ID=ExcessHet,Number=1,Type=Float,Description="Phred-scaled p-value for exact test of excess heterozygosity">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=RAW_MQ,Number=1,Type=Float,Description="Raw data for RMS Mapping Quality">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
##contig=<ID=20,length=63025520,assembly=GRCh37>
##source=HaplotypeCaller

Records

The first thing you'll notice, hopefully, is the <NON_REF> symbolic allele listed in every record's ALT field. This provides us with a way to represent the possibility of having a non-reference allele at this site, and to indicate our confidence either way.

The second thing to look for is the END tag in the INFO field of non-variant block records. This tells you at what position the block ends. For example, the first line is a non-variant block that starts at position 20:10001567 and ends at 20:10001616.

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NA12878
20  10001567    .   A   <NON_REF>   .   .   END=10001616    GT:DP:GQ:MIN_DP:PL  0/0:38:99:34:0,101,1114
20  10001617    .   C   A,<NON_REF> 493.77  .   BaseQRankSum=1.632;ClippingRankSum=0.000;DP=38;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=0.000;RAW_MQ=136800.00;ReadPosRankSum=0.170    GT:AD:DP:GQ:PL:SB   0/1:19,19,0:38:99:522,0,480,578,538,1116:11,8,13,6
20  10001618    .   T   <NON_REF>   .   .   END=10001627    GT:DP:GQ:MIN_DP:PL  0/0:39:99:37:0,105,1575
20  10001628    .   G   A,<NON_REF> 1223.77 .   DP=37;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;RAW_MQ=133200.00   GT:AD:DP:GQ:PL:SB   1/1:0,37,0:37:99:1252,111,0,1252,111,1252:0,0,21,16
20  10001629    .   G   <NON_REF>   .   .   END=10001660    GT:DP:GQ:MIN_DP:PL  0/0:43:99:38:0,102,1219
20  10001661    .   T   C,<NON_REF> 1779.77 .   DP=42;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;RAW_MQ=151200.00   GT:AD:DP:GQ:PGT:PID:PL:SB   1/1:0,42,0:42:99:0|1:10001661_T_C:1808,129,0,1808,129,1808:0,0,26,16
20  10001662    .   T   <NON_REF>   .   .   END=10001669    GT:DP:GQ:MIN_DP:PL  0/0:44:99:43:0,117,1755
20  10001670    .   T   G,<NON_REF> 1773.77 .   DP=42;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;RAW_MQ=151200.00   GT:AD:DP:GQ:PGT:PID:PL:SB   1/1:0,42,0:42:99:0|1:10001661_T_C:1802,129,0,1802,129,1802:0,0,25,17
20  10001671    .   G   <NON_REF>   .   .   END=10001673    GT:DP:GQ:MIN_DP:PL  0/0:43:99:42:0,120,1800
20  10001674    .   A   <NON_REF>   .   .   END=10001674    GT:DP:GQ:MIN_DP:PL  0/0:42:96:42:0,96,1197
20  10001675    .   A   <NON_REF>   .   .   END=10001695    GT:DP:GQ:MIN_DP:PL  0/0:41:99:39:0,105,1575
20  10001696    .   A   <NON_REF>   .   .   END=10001696    GT:DP:GQ:MIN_DP:PL  0/0:38:97:38:0,97,1220

Note that toward the end of this snippet, you see multiple consecutive non-variant block records. These were not merged into a single record because the sites they contain belong to different ranges of GQ (which are defined in the header).

↧

VCF - Variant Call Format

December 23, 2017, 4:04 pm

≫ Next: How should I cite GATK in my own publications?

≪ Previous: GVCF - Genomic Variant Call Format

This document describes "regular" VCF files produced for GERMLINE short variant (SNP and indel) calls (e.g. by HaplotypeCaller in "normal" mode and by GenotypeGVCFs). For information on the special kind of VCF called GVCF produced by HaplotypeCaller in -ERC GVCF mode, please see the GVCF entry. For information specific to SOMATIC calls, see the Mutect2 documentation.

Overview
Structure of a VCF file
Interpreting the header information
Structure of variant call records
Interpreting genotype and other sample-level information
Basic operations: validating, subsetting and exporting from a VCF
Merging VCF files

1. Overview

VCF stands for Variant Call Format. It is a standardized text file format for representing SNP, indel, and structural variation calls. The VCF specification used to be maintained by the 1000 Genomes Project, but its management and further development has been taken over by the Genomic Data Toolkit team of the Global Alliance for Genomics and Health. The full format spec can be found in the Samtools/Hts-specs repository along with other useful specifications like SAM/BAM/CRAM. We highly encourage you to take a look at those documents, as they contain a lot of useful information that we don't go over in this document.

VCF is the primary (and only well-supported) format used by the GATK for variant calls. We prefer it above all others because while it can be a bit verbose, the VCF format is very explicit about the exact type and sequence of variation as well as the genotypes of multiple samples for this variation.

That being said, this highly detailed information can be challenging to understand. The information provided by the GATK tools that infer variation from high-throughput sequencing data, such as the HaplotypeCaller, is especially complex. This document describes the key features and annotations that you need to know about in order to understand VCF files output by the GATK tools.

Note that VCF files are plain text files, so you can open them for viewing or editing in any text editor, with the following caveats:

Some VCF files are very large, so your personal computer may struggle to load the whole file into memory. In such cases, you may need to use a different approach, such as using UNIX tools to access the part of the dataset that is relevant to you, or subsetting the data using tools like GATK's SelectVariants.
NEVER EDIT A VCF IN A WORD PROCESSOR SUCH AS MICROSOFT WORD BECAUSE IT WILL SCREW UP THE FORMAT! You have been warned
Don't write home-brewed VCF parsing scripts. It never ends well.

2. Structure of a VCF file

A valid VCF file is composed of two main parts: the header, and the variant call records.

The header contains information about the dataset and relevant reference sources (e.g. the organism, genome build version etc.), as well as definitions of all the annotations used to qualify and quantify the properties of the variant calls contained in the VCF file. The header of VCFs generated by GATK tools also include the command line that was used to generate them. Some other programs also record the command line in the VCF header, but not all do so as it is not required by the VCF specification. For more information about the header, see the next section.

The actual data lines will look something like this:

[HEADER LINES]
#CHROM  POS ID      REF ALT QUAL    FILTER  INFO          FORMAT          NA12878
20  10001019    .   T   G   364.77  .   AC=1;AF=0.500;AN=2;BaseQRankSum=0.699;ClippingRankSum=0.00;DP=34;ExcessHet=3.0103;FS=3.064;MLEAC=1;MLEAF=0.500;MQ=42.48;MQRankSum=-3.219e+00;QD=11.05;ReadPosRankSum=-6.450e-01;SOR=0.537   GT:AD:DP:GQ:PL  0/1:18,15:33:99:393,0,480
20  10001298    .   T   A   884.77  .   AC=2;AF=1.00;AN=2;DP=30;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=29.49;SOR=1.765    GT:AD:DP:GQ:PL  1/1:0,30:30:89:913,89,0
20  10001436    .   A   AAGGCT  1222.73 .   AC=2;AF=1.00;AN=2;DP=29;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=25.36;SOR=0.836    GT:AD:DP:GQ:PL  1/1:0,28:28:84:1260,84,0
20  10001474    .   C   T   843.77  .   AC=2;AF=1.00;AN=2;DP=27;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=31.25;SOR=1.302    GT:AD:DP:GQ:PL  1/1:0,27:27:81:872,81,0
20  10001617    .   C   A   493.77  .   AC=1;AF=0.500;AN=2;BaseQRankSum=1.63;ClippingRankSum=0.00;DP=38;ExcessHet=3.0103;FS=1.323;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.00;QD=12.99;ReadPosRankSum=0.170;SOR=1.179   GT:AD:DP:GQ:PL  0/1:19,19:38:99:522,0,480

After the header lines and the field names, each line represents a single variant, with various properties of that variant represented in the columns. Note that all the lines shown in the example above describe SNPs and indels, but other variation types could be described (see the VCF specification for details). Depending on how the callset was generated, there may only be records for sites where a variant was identified, or there may also be "invariant" records, ie records for sites where no variation was identified.

You will sometimes come across VCFs that have only 8 columns, and contain no FORMAT or sample-specific information. These are called "sites-only" VCFs, and represent variation that has been observed in a population. Generally, information about the population of origin should be included in the header.

3. Interpreting the header information

The following is a valid VCF header produced by GenotypeGVCFs on an example data set (derived from our favorite test sample, NA12878). You can download similar test data from our resource bundle and try looking at it yourself.

##fileformat=VCFv4.2
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location">
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another">
##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=RGQ,Number=1,Type=Integer,Description="Unconditional reference genotype confidence, encoded as a phred quality -10*log10 p(genotype call is wrong)">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
##GATKCommandLine.HaplotypeCaller=<ID=HaplotypeCaller,Version=3.7-0-gcfedb67,Date="Fri Jan 20 11:14:15 EST 2017",Epoch=1484928855435,CommandLineOptions="[command-line goes here]">
##GATKCommandLine=<ID=GenotypeGVCFs,CommandLine="[command-line goes here]",Version=4.beta.6-117-g4588584-SNAPSHOT,Date="December 23, 2017 5:45:56 PM EST">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
##INFO=<ID=ExcessHet,Number=1,Type=Float,Description="Phred-scaled p-value for exact test of excess heterozygosity">
##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect strand bias">
##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
##INFO=<ID=RAW_MQ,Number=1,Type=Float,Description="Raw data for RMS Mapping Quality">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
##INFO=<ID=SOR,Number=1,Type=Float,Description="Symmetric Odds Ratio of 2x2 contingency table to detect strand bias">
##contig=<ID=20,length=63025520>
##reference=file:///data/ref/ref.fasta
##source=GenotypeGVCFs

That's a lot of lines, so let's break it down into digestible bits. Note that the header lines are always listed in alphabetical order.

VCF spec version

The first line:

##fileformat=VCFv4.2

tells you the version of the VCF specification to which the file conforms. This may seem uninteresting but it can have some important consequences for how to handle and interpret the file contents. As genomics is a fast moving field, the file formats are evolving fairly rapidly, so some of the encoding conventions change. If you run into unexpected issues while trying to parse a VCF file, be sure to check the version and the spec for any relevant format changes.

FILTER lines

The FILTER lines tell you what filters have been applied to the data. In our test file, one filter has been applied:

##FILTER=<ID=LowQual,Description="Low quality">

Records that fail any of the filters listed here will contain the ID of the filter (here, LowQual) in its FILTER field (see how records are structured further below).

FORMAT and INFO lines

These lines define the annotations contained in the FORMAT and INFO columns of the VCF file, which we explain further below. If you ever need to know what an annotation stands for, you can always check the VCF header for a brief explanation (at least if you're using a civilized program that writes definition lines to the header).

GATKCommandLine

The GATKCommandLine lines contain all the parameters that went used by the tool that generated the file. Here, GATKCommandLine.HaplotypeCaller refers to a command line invoking HaplotypeCaller. These parameters include all the arguments that the tool accepts, along with the values that were applied (if you don't pass one, a default is applied); so it's not just the arguments specified explicitly by the user in the command line.

Contig lines and Reference

These contain the contig names, lengths, and which reference assembly was used with the input BAM file. This can come in handy when someone gives you a callset but doesn't tell you which reference it was derived from -- remember that for many organisms, there are multiple reference assemblies, and you should always make sure to use the appropriate one!

For more information on genome references, see the corresponding Dictionary entry.

4. Structure of variant call records

For each site record, the information is structured into columns (also called fields) as follows:

#CHROM  POS ID  REF ALT     QUAL    FILTER  INFO    FORMAT  NA12878 [other samples...]

The first 8 columns of the VCF records (up to and including INFO) represent the properties observed at the level of the variant (or invariant) site. Keep in mind that when multiple samples are represented in a VCF file, some of the site-level annotations represent a summary or average of the values obtained for that site from the different samples.

Sample-specific information such as genotype and individual sample-level annotation values are contained in the FORMAT column (9th column) and in the sample-name columns (10th and beyond). In the example above, there is one sample called NA12878; if there were additional samples there would be additional columns to the right. Most programs order the sample columns alphabetically by sample name, but this is not always the case, so be aware that you can't depend on ordering rules for parsing VCF output!

Site-level properties and annotations

These first 7 fields are required by the VCF format and must be present, although they can be empty (in practice, there has to be a dot, ie . to serve as a placeholder).

CHROM and POS

The contig and genomic coordinates on which the variant occurs. Note that for deletions the position given is actually the base preceding the event.

ID

An optional identifier for the variant. Based on the contig and position of the call and whether a record exists at this site in a reference database such as dbSNP. A typical identifier is the dbSNP ID, which in human data would look like rs28548431, for example.

REF and ALT

The reference allele and alternative allele(s) observed in a sample, set of samples, or a population in general (depending how the VCF was generated). The REF and ALT alleles are the only required elements of a VCF record that tell us whether the variant is a SNP or an indel (or in complex cases, a mixed-type variant). If we look at the following two sites, we see the first is a SNP, the second is an insertion and the third is a deletion:

20  10001298    .   T   A   884.77  .   [CLIPPED]   GT:AD:DP:GQ:PL  1/1:0,30:30:89:913,89,0
20  10001436    .   A   AAGGCT  1222.73 .   [CLIPPED]   GT:AD:DP:GQ:PL  1/1:0,28:28:84:1260,84,0
20  10004769    .   TAAAACTATGC T   622.73  .   [CLIPPED]   GT:AD:DP:GQ:PL  0/1:18,17:35:99:660,0,704

Note that REF and ALT are always given on the forward strand. For insertions, the ALT allele includes the inserted sequence as well as the base preceding the insertion so you know where the insertion is compared to the reference sequence. For deletions, the ALT allele is the base before the deletion.

QUAL

The Phred-scaled probability that a REF/ALT polymorphism exists at this site given sequencing data. Because the Phred scale is -10 * log(1-p), a value of 10 indicates a 1 in 10 chance of error, while a 100 indicates a 1 in 10^10 chance (see the FAQ article for a detailed explanation). These values can grow very large when a large amount of data is used for variant calling, so QUAL is not often a very useful property for evaluating the quality of a variant call. See our documentation on filtering variants for more information on this topic.

Not to be confused with the sample-level annotation GQ; see this FAQ article for an explanation of the differences in what they mean and how they should be used.

FILTER

This field contains the name(s) of any filter(s) that the variant fails to pass, or the value PASS if the variant passed all filters. If the FILTER value is ., then no filtering has been applied to the records. It is extremely important to apply appropriate filters before using a variant callset in downstream analysis. See our documentation on filtering variants for more information on this topic.

INFO

Various site-level annotations. This field is not required to be present in the VCF.

The annotations contained in the INFO field are represented as tag-value pairs, where the tag and value are separated by an equal sign, ie =, and pairs are separated by colons, ie ; as in this example: MQ=99.00;MQ0=0;QD=17.94. They typically summarize context information from the samples, but can also include information from other sources (e.g. population frequencies from a database resource). Some are annotated by default by the GATK tools that produce the callset, and some can be added on request. They are always defined in the VCF header, so that's an easy way to check what an annotation means if you don't recognize it. You can also find additional information on how they are calculated and how they should be interpreted in the "Annotations" section of the Tool Documentation.

Sample-level annotations

At this point you've met all the fields up to INFO in this lineup:

#CHROM  POS ID  REF ALT     QUAL    FILTER  INFO    FORMAT  NA12878 [other samples...]

All the rest is going to be sample-level information. Sample-level annotations are tag-value pairs, like the INFO annotations, but the formatting is a bit different. The short names of the sample-level annotations are recorded in the FORMAT field. The annotation values are then recorded in corresponding order in each sample column (where the sample names are the SM tags identified in the read group data). Typically, you will at minimum have information about the genotype and confidence in the genotype for the sample at each site. See the next section on genotypes for more details.

5. Interpreting genotype and other sample-level information

The sample-level information contained in the VCF (also called "genotype fields") may look a bit complicated at first glance, but they're actually not that hard to interpret once you understand that they're just sets of tags and values.

Let's take a look at three of the records shown earlier, simplified to just show the key genotype annotations:

20  10001019    .   T   G   364.77  .   [CLIPPED]   GT:AD:DP:GQ:PL  0/1:18,15:33:99:393,0,480
20  10001298    .   T   A   884.77  .   [CLIPPED]   GT:AD:DP:GQ:PL  1/1:0,30:30:89:913,89,0
20  10001436    .   A   AAGGCT  1222.73 .   [CLIPPED]   GT:AD:DP:GQ:PL  1/1:0,28:28:84:1260,84,0

Looking at that last column, here is what the tags mean:

GT

The genotype of this sample at this site. For a diploid organism, the GT field indicates the two alleles carried by the sample, encoded by a 0 for the REF allele, 1 for the first ALT allele, 2 for the second ALT allele, etc. When there's a single ALT allele (by far the more common case), GT will be either:

- 0/0 : the sample is homozygous reference
- 0/1 : the sample is heterozygous, carrying 1 copy of each of the REF and ALT alleles
- 1/1 : the sample is homozygous alternate

In the three sites shown in the example above, NA12878 is observed with the allele combinations T/G, A/A and AAGGCT/AAGGCT respectively. For non-diploids, the same pattern applies; in the haploid case there will be just a single value in GT (e.g. 1); for polyploids there will be more, e.g. 4 values for a tetraploid organism (e.g. 0/0/1/1).

AD and DP

Allele depth (AD) and depth of coverage (DP). These are complementary fields that represent two important ways of thinking about the depth of the data for this sample at this site.

AD is the unfiltered allele depth, i.e. the number of reads that support each of the reported alleles. All reads at the position (including reads that did not pass the variant caller’s filters) are included in this number, except reads that were considered uninformative. Reads are considered uninformative when they do not provide enough statistical evidence to support one allele over another.

DP is the filtered depth, at the sample level. This gives you the number of filtered reads that support each of the reported alleles. You can check the variant caller’s documentation to see which filters are applied by default. Only reads that passed the variant caller’s filters are included in this number. However, unlike the AD calculation, uninformative reads are included in DP.

See the Tool Documentation for more details on AD (DepthPerAlleleBySample) and DP (Coverage) for more details.

PL

"Normalized" Phred-scaled likelihoods of the possible genotypes. For the typical case of a monomorphic site (where there is only one ALT allele) in a diploid organism, the PL field will contain three numbers, corresponding to the three possible genotypes (0/0, 0/1, and 1/1). The PL values are "normalized" so that the PL of the most likely genotype (assigned in the GT field) is 0 in the Phred scale. We use "normalized" in quotes because these are not probabilities. We set the most likely genotype PL to 0 for easy reading purpose.The other values are scaled relative to this most likely genotype.

Keep in mind, if you're not familiar with the statistical lingo, that when we say PL is the "Phred-scaled likelihood of the genotype", we mean it is "How much less likely that genotype is compared to the best one". Have a look at this article for an example of how PL is calculated.

GQ

The Genotype Quality represents the Phred-scaled confidence that the genotype assignment (GT) is correct, derived from the genotype PLs. Specifically, the GQ is the difference between the PL of the second most likely genotype, and the PL of the most likely genotype. As noted above, the values of the PLs are normalized so that the most likely PL is always 0, so the GQ ends up being equal to the second smallest PL, unless that PL is greater than 99. In GATK, the value of GQ is capped at 99 because larger values are not more informative, but they take more space in the file. So if the second most likely PL is greater than 99, we still assign a GQ of 99.

Basically the GQ gives you the difference between the likelihoods of the two most likely genotypes. If it is low, you can tell there is not much confidence in the genotype, i.e. there was not enough evidence to confidently choose one genotype over another. See the FAQ article on the Phred scale to get a sense of what would be considered low.

Not to be confused with the site-level annotation QUAL; see this FAQ article for an explanation of the differences in what they mean and how they should be used.

A few examples

With all the definitions out of the way, let's interpret the genotype information for a few records from our NA12878 callset, starting with at position 10001019 on chromosome 20:

20  10001019    .   T   G   364.77  .   [CLIPPED]   GT:AD:DP:GQ:PL  0/1:18,15:33:99:393,0,480

At this site, the called genotype is GT = 0/1, which corresponds to a heterozygous genotype with alleles T/G. The confidence indicated by GQ = 99 is very good; there were a total of 33 informative reads at this site (DP=33), 18 of which supported the REF allele (=had the reference base) and 15 of which supported the ALT allele (=had the alternate base) (indicated by AD=18,15). The degree of certainty in our genotype is evident in the PL field, where PL(0/1) = 0 (the normalized value that corresponds to a likelihood of 1.0) as is always the case for the assigned allele; the next PL is PL(0/0) = 393, corresponding to 10^(-39.3), or 5.0118723e-40 which is a very small number indeed; and the next one will be even smaller. The GQ ends up being 99 because of the capping as explained above.

Now let's look at a site where our confidence is quite a bit lower:

20  10024300    .   C   CTT 43.52   .   [CLIPPED]   GT:AD:DP:GQ:PL  0/1:1,4:6:20:73,0,20

Here we have an indel -- specifically an insertion of TT after the reference C base at position 10024300. The called genotype is GT = 0/1 again, but this time the GQ = 20 indicates that even though this is probably a real variant (the QUAL is not too bad), we're not sure we have the right genotype. Looking at the coverage annotations, we see we only had 6 reads there, of which 1 supported REF and 4 supported ALT (and one read must have been considered uninformative, possibly due to quality issues). With so little coverage, we can't be sure that the genotype shouldn't in fact be homozygous variant.

Finally, let's look at a more complicated example:

20  10009875    .   A   G,AGGGAGG   1128.77 .   [CLIPPED]   GT:AD:DP:GQ:PL  1/2:0,11,5:16:99:1157,230,161,487,0,434

This site is a doozy; two credible ALT alleles were observed, but the REF allele was not -- so technically this is a biallelic site in our sample, but will be considered multiallelic because there are more than two alleles notated in the record. It's also a mixed-type record, since one of the ALTs by itself would make it an A->G SNP, and the other would make it an insertion of GGGAGG after the reference A. The called genotype is GT = 1/2, which means it's a heterozygous genotype composed of two different ALT alleles. The coverage wasn't great, and wasn't all that balanced between the two ALTs (since one was supported by 11 reads and the other by 5) but it was sufficient for the program to have high confidence in its call.

6. Basic operations: validating, subsetting and exporting from a VCF

These are a few common things you may want to do with your VCFs that don't deserve their own tutorial. Let us know if there are other operations you think we should cover here.

Validate your VCF

By that I mean check that the format of the file is correct, follows the specification, and will therefore not break any well-behave tool you choose to run on it. You can do this very simply with ValidateVariants. Note that ValidateVariants can also be used on GVCFs if you use the --gvcf argument.

Subset records from your VCF

Sometimes you want to subset just one or a few samples from a big cohort. Sometimes you want to subset to just a genomic region. Sometimes you want to do both at the same time! Well, the same tool can do both, and more; it's called SelectVariants and has a lot of options for doing this like that (including operating over intervals in the usual way). There are many options for setting the selection criteria, depending on what you want to achieve. For example, given a single VCF file, one or more samples can be extracted from the file, based either on a complete sample name, or on a pattern match. Variants can also be selected based on annotated properties, such as depth of coverage or allele frequency. This is done using JEXL expressions. Other VCF files can also be used to modify the selection based on concordance or discordance between different callsets (see --discordance / --concordance arguments in the Tool Doc.

Important notes about subsetting operations

In the output VCF, some annotations such as AN (number of alleles), AC (allele count), AF (allele frequency), and DP (depth of coverage) are recalculated as appropriate to accurately reflect the composition of the subset callset.
By default, SelectVariants will keep all ALT alleles, even if they are no longer supported by any samples after subsetting. This is the correct behavior, as reducing samples down shouldn't change the character of the site, only the AC in the subpopulation. In some cases this will produce monomorphic records, i.e. where no ALT alleles are supported. The tool accepts flags that exclude unsupported alleles and/or monomorphic records from the output.

Extract information from a VCF in a sane, (mostly) straightforward way

Use VariantsToTable.

No, really, don't write your own parser if you can avoid it. This is not a comment on how smart or how competent we think you are -- it's a comment on how annoyingly obtuse and convoluted the VCF format is.

Seriously. The VCF format lends itself really poorly to parsing methods like regular expressions, and we hear sob stories all the time from perfectly competent people whose home-brewed parser broke because it couldn't handle a more esoteric feature of the format. We know we broke a bunch of people's scripts when we introduced a new representation for spanning deletions in multisample callsets. OK, we ended up replacing it with a better representation a month later that was a lot less disruptive and more in line with the spirit of the specification -- but the point is, that first version was technically legal according to the 4.2 spec, and that sort of thing can happen at any time. So yes, the VCF is a difficult format to work with, and one way to deal with that safely is to not home-brew parsers.

(Why are we sticking with it anyway? Because, as Winston Churchill famously put it, VCF is the worst variant call representation, except for all the others.)

7. Merging VCF files

There are three main reasons why you might want to combine variants from different files into one, and the tool to use depends on what you are trying to achieve.

The most common case is when you have been parallelizing your variant calling analyses, e.g. running HaplotypeCaller per-chromosome, producing separate VCF files (or GVCF files) per-chromosome. For that case, you can use the Picard tool MergeVcfs to merge the files. See the relevant Tool Doc page for usage details.
The second case is when you have been using HaplotypeCaller in -ERC GVCF or -ERC BP_RESOLUTION to call variants on a large cohort, producing many GVCF files. You then need to consolidate them before joint-calling variants with GenotypeGVCFs (for performance reasons). This can be done with either CombineGVCFs or ImportGenomicsDB tools, both of which are specifically designed to handle GVCFs in this way. See the relevant Tool Doc pages for usage details and the Best Practices workflow documentation to learn more about the logic of this workflow.
The third case is when you want to compare variant calls that were produced from the same samples but using different methods, for comparison. For example, if you're evaluating variant calls produced by different variant callers, different workflows, or the same but using different parameters. For this case, we recommend taking a different approach; rather than merging the VCF files (which can have all sorts of complicated consequences), you can us the VariantAnnotator tool to annotate one of the VCFs with the other treated as a resource. See the relevant Tool Doc page for usage details.

There is actually one more reason why you might want to combine variants from different files into one, but we do not recommend doing it: you have produced variant calls from various samples separately, and want to combine them for analysis. This is how people used to do variant analysis on large numbers of samples, but we don't recommend proceeding this way because that workflow suffers from serious methodological flaws. Instead, you should follow our recommendations as laid out in the Best Practices documentation.

↧

How should I cite GATK in my own publications?

December 24, 2017, 2:18 am

≫ Next: Variant annotations

≪ Previous: VCF - Variant Call Format

To date we have published three papers on GATK, plus a preprint in bioRxiv (citation details below). You're welcome to choose which paper is most representative of what aspect of GATK you called on in your work.

Poplin et al. 2017 : Detailed description of HaplotypeCaller; best reference for germline joint calling

The fourth paper, technically just a manuscript deposited in bioRxiv -- but it counts! This is a good citation to include in a Materials and Methods section or in a Discussion if you're talking about the joint calling process.

Scaling accurate genetic variant discovery to tens of thousands of samples Ryan Poplin, Valentin Ruano-Rubio, Mark A. DePristo, Tim J. Fennell, Mauricio O. Carneiro, Geraldine A. Van der Auwera, David E. Kling, Laura D. Gauthier, Ami Levy-Moonshine, David Roazen, Khalid Shakir, Joel Thibault, Sheila Chandran, Chris Whelan, Monkol Lek, Stacey Gabriel, Mark J. Daly, Benjamin Neale, Daniel G. MacArthur, Eric Banks, 2017 bioRxiv

Article

Van der Auwera et al. 2013 : Hands-on tutorial with step-by-step explanations

The third GATK paper describes the Best Practices for Variant Discovery (version 2.x). It is intended mainly as a learning resource for first-time users and as a protocol reference. This is a good citation to include in a Materials and Methods section.

From FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline Van der Auwera GA, Carneiro M, Hartl C, Poplin R, del Angel G, Levy-Moonshine A, Jordan T, Shakir K, Roazen D, Thibault J, Banks E, Garimella K, Altshuler D, Gabriel S, DePristo M, 2013 CURRENT PROTOCOLS IN BIOINFORMATICS 43:11.10.1-11.10.33

Article | PubMed

Remember that as our work continues and our Best Practices recommendations evolve, specific command lines, argument values and even tool choices described in the paper become obsolete. Be sure to always refer to our Best Practices documentation for the most up-to-date and version-appropriate recommendations.

DePristo et al. 2011 : First incarnation of the Best Practices workflow

The second GATK paper describes in more detail some of the key tools commonly used in the GATK for high-throughput sequencing data processing and variant discovery. The paper covers base quality score recalibration, indel realignment, SNP calling with UnifiedGenotyper, variant quality score recalibration and their application to deep whole genome, whole exome, and low-pass multi-sample calling. This is a good citation if you use the GATK for variant discovery.

A framework for variation discovery and genotyping using next-generation DNA sequencing data DePristo M, Banks E, Poplin R, Garimella K, Maguire J, Hartl C, Philippakis A, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell T, Kernytsky A, Sivachenko A, Cibulskis K, Gabriel S, Altshuler D, Daly M, 2011 NATURE GENETICS 43:491-498

Article | Pubmed

Note that the workflow described in this paper corresponds to the version 1.x to 2.x best practices. Some key steps for variant discovery have been significantly modified in later versions (3.x onwards). This paper should not be used as a definitive guide to variant discovery with GATK. For that, please see our online documentation guide.

McKenna et al. 2010 : Original description of the GATK framework

The first GATK paper covers the computational philosophy underlying the GATK and is a good citation for the GATK in general.

The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA, 2010 GENOME RESEARCH 20:1297-303

Article | Pubmed

Example

We sequenced 10 samples on 10 lanes on an Illumina HiSeq 2000, aligned the resulting reads to the hg19 reference genome with BWA (Li & Durbin), applied GATK (McKenna et al., 2010) base quality score recalibration, indel realignment, duplicate removal, and performed SNP and INDEL discovery and genotyping across all 10 samples simultaneously using standard hard filtering parameters or variant quality score recalibration according to GATK Best Practices recommendations (DePristo et al., 2011; Van der Auwera et al., 2013).

↧

Variant annotations

December 24, 2017, 3:02 am

≫ Next: HaplotypeCaller in a nutshell

≪ Previous: How should I cite GATK in my own publications?

Variant annotations are can be produced by HaplotypeCaller, Mutect2, VariantAnnotator and GenotypeGVCFs. The available annotations are listed under Annotations in the Tool Documentation.

Note that some annotations values calculated by the different tools may be different for the same original data. There are a few things that generally account for differences in annotation values, linked to the normal behaviors of the tools. In all cases, you can end up looking at different sets or numbers of reads, which causes some of the annotation values to be different. It's usually not a cause for alarm. Remember that many of these annotations should be interpreted relatively, not absolutely.

Local realignment

HaplotypeCaller and Mutect2 apply a read realignment step that can modify local coverage counts and other annotation values that take coverage into account. In contrast, VariantAnnotator will calculate annotations either based on the pileup if you give it the original BAM file, or it will calculate summary metrics based on existing VCF fields. GenotypeGVCFs only calculates annotations based on existing VCF fields (that's pretty much its raison d'être).

Read filtering

Some tools apply different read filters by default.

Downsampling

Some tools apply downsampling in order to ensure good performance, but they may do so to different depths of coverage by default compared to others.

See related forum discussions here and here.

↧

HaplotypeCaller in a nutshell

December 28, 2017, 12:55 pm

≫ Next: HaplotypeCaller Reference Confidence Model (GVCF mode)

≪ Previous: Variant annotations

This document outlines the basic operation of the HaplotypeCaller run in its default mode on a single sample, and does not cover the additional processing and calculations done when it is run in "GVCF mode" (with -ERC GVCF or -ERC BP_RESOLUTION) or when it is run on multiple samples. For more details and discussion of the GVCF workflow, see the Best Practices documentation on germline short variant discovery as well as the HaplotypeCaller manuscript on bioRxiv.

Overview

The core operations performed by HaplotypeCaller can be grouped into these major steps:

1. Define active regions. The program determines which regions of the genome it needs to operate on, based on the presence of significant evidence for variation.

2. Determine haplotypes by re-assembly of the active region. For each ActiveRegion, the program builds a De Bruijn-like graph to reassemble the ActiveRegion and identifies what are the possible haplotypes present in the data. The program then realigns each haplotype against the reference haplotype using the Smith-Waterman algorithm in order to identify potentially variant sites.

3. Determine likelihoods of the haplotypes given the read data. For each ActiveRegion, the program performs a pairwise alignment of each read against each haplotype using the PairHMM algorithm. This produces a matrix of likelihoods of haplotypes given the read data. These likelihoods are then marginalized to obtain the likelihoods of alleles per read for each potentially variant site.

4. Assign sample genotypes. For each potentially variant site, the program applies Bayes’ rule, using the likelihoods of alleles given the read data to calculate the posterior likelihoods of each genotype per sample given the read data observed for that sample. The most likely genotype is then assigned to the sample.

1. Define active regions

In this first step, the program traverses the sequencing data to identify regions of the genomes in which the samples being analyzed show substantial evidence of variation relative to the reference. The resulting areas are defined as “active regions”, and will be passed on to the next step. Areas that do not show any variation beyond the expected levels of background noise will be skipped in the next step. This aims to accelerate the analysis by not wasting time performing reassembly on regions that are identical to the reference anyway.

To define these active regions, the program operates in three phases. First, it computes an activity score for each individual genome position, yielding the raw activity profile, which is a wave function of activity per position. Then, it applies a smoothing algorithm to the raw profile, which is essentially a sort of averaging process, to yield the actual activity profile. Finally, it identifies local maxima where the activity profile curve rises above the preset activity threshold, and defines appropriate intervals to encompass the active profile within the preset size constraints. For more details on how the activity profile is computed and processed, as well as what options are available to modify the active region parameters, please see this article.

Once this process is complete, the program applies a few post-processing steps to finalize the the active regions (see detailed doc above). The final output of this process is a list of intervals corresponding to the active regions which will be processed in the next step.

2. Determine haplotypes by local assembly of the active region.

The goal of this step is to reconstruct the possible sequences of the real physical segments of DNA present in the original sample organism. To do this, the program goes through each active region and uses the input reads that mapped to that region to construct complete sequences covering its entire length, which are called haplotypes. This process will typically generate several different possible haplotypes for each active region due to:

real diversity on polyploid (including CNV) or multi-sample data
possible allele combinations between variant sites that are not totally linked within the active region
sequencing and mapping errors

In order to generate a list of possible haplotypes, the program first builds an assembly graph for the active region using the reference sequence as a template. Then, it takes each read in turn and attempts to match it to a segment of the graph. Whenever portions of a read do not match the local graph, the program adds new nodes to the graph to account for the mismatches. After this process has been repeated with many reads, it typically yields a complex graph with many possible paths. However, because the program keeps track of how many reads support each path segment, we can select only the most likely (well-supported) paths. These likely paths are then used to build the haplotype sequences which will be used for scoring and genotyping in the next step.

The assembly and haplotype determination procedure is described in full detail in this method article.

Once the haplotypes have been determined, each one is realigned against the original reference sequence in order to identify potentially variant sites. This produces the set of sites that will be processed in the next step. A subset of these sites will eventually be emitted as variant calls to the output VCF.

3. Evaluating the evidence for haplotypes and variant alleles

Now that we have all these candidate haplotypes, we need to evaluate how much evidence there is in the data to support each one of them. So the program takes each individual read and aligns it against each haplotype in turn (including the reference haplotype) using the PairHMM algorithm, which takes into account the information we have about the quality of the data (i.e. the base quality scores and indel quality scores). This outputs a score for each read-haplotype pairing, expressing the likelihood of observing that read given that haplotype.

Those scores are then used to calculate out how much evidence there is for individual alleles at the candidate sites that were identified in the previous step. The process is called marginalization over alleles and produces the actual numbers that will finally be used to assign a genotype to the sample in the next step.

For further details on the pairHMM output and the marginalization process, see this document.

4. Assigning per-sample genotypes

The previous step produced a table of per-read allele likelihoods for each candidate variant site under consideration. Now, all that remains to do is to evaluate those likelihoods in aggregate to determine what is the most likely genotype of the sample at each site. This is done by applying Bayes' theorem to calculate the likelihoods of each possible genotype, and selecting the most likely. This produces a genotype call as well as the calculation of various metrics that will be annotated in the output VCF if a variant call is emitted.

For further details on the genotyping calculations, see this document.

This concludes the overview of how HaplotypeCaller works.

↧

HaplotypeCaller Reference Confidence Model (GVCF mode)

December 28, 2017, 3:02 pm

≫ Next: Calculation of PL and GQ by HaplotypeCaller and GenotypeGVCFs

≪ Previous: HaplotypeCaller in a nutshell

This document describes the reference confidence model applied by HaplotypeCaller to generate a per-sample GVCF, invoked by -ERC GVCF or -ERC BP_RESOLUTION.

As explained here, HaplotypeCaller works by assembling the reads to create potential haplotypes, realigning the reads to their most likely haplotypes, and then projecting these reads back onto the reference sequence via their haplotypes to compute alignments of the reads to the reference. At that point, we can calculate the likelihoods of each possible genotype and emit variant calls.

What that article does not explain is how HaplotypeCaller additionally estimates the chance that some (unknown) non-reference allele is segregating at this position by examining the realigned reads that span the reference base. At this base we perform two calculations:

Estimate the confidence that no SNP exists at the site by contrasting all reads with the REF base vs. all reads with any non-reference base.
Estimate the confidence that no indel of size < X (determined by command line parameter) could exist at this site by calculating the number of reads that provide evidence against such an indel, and from this value estimate the chance that we would not have seen the allele confidently.

Based on this, we emit the genotype likelihoods (PL) and compute the GQ (from the PLs) for the least confidence of these two models. We use a symbolic ALT allele, <NON_REF>, to hold the likelihood that the site is not homozygous reference, as well as allele-specific AD and PL field values.

We do this at all sites in the territory covered by the analysis, including homozygous-reference sites, both inside and outside the ActiveRegions determined by HaplotypeCaller.

↧

Calculation of PL and GQ by HaplotypeCaller and GenotypeGVCFs

December 28, 2017, 3:32 pm

≫ Next: Local re-assembly and haplotype determination (HaplotypeCaller & Mutect2)

≪ Previous: HaplotypeCaller Reference Confidence Model (GVCF mode)

PL is a sample-level annotation calculated by HaplotypeCaller and GenotypeGVCFs, recorded in the sample-level columns of variant records in VCF files. This annotation represents the normalized Phred-scaled likelihoods of the genotypes considered in the variant record for each sample.

This article clarifies how the PL values are calculated and how this relates to the value of the GQ field.

The basic math
Example and interpretation
Special case: non-reference confidence model (GVCF mode)

1. The basic math

The basic formula for calculating PL is:

$$ PL = -10 * \log{P(Genotype | Data)} $$

where P(Genotype | Data) is the conditional probability of the Genotype given the sequence Data that we have observed. The process by which we determine the value of P(Genotype | Data) is described here.

Once we have that probability, we simply take the log of it and multiply it by -10 to put it into Phred scale. Then we normalize the values across all genotypes so that the PL value of the most likely genotype is 0, which we do simply by subtracting the value of the lowest PL from all the values.

The reason we like to work in Phred scale is because it makes it much easier to work with the very small numbers involved in these calculations. One thing to keep in mind of course is that Phred is a log scale, so whenever we need to do a division or multiplication operation (e.g. multiplying probabilities), in Phred scale this will be done as a subtraction or addition.

2. Example and interpretation

Here’s a worked-out example to illustrate this process. Suppose we have a site where the reference allele is A, we observed one read that has a non-reference allele T at the position of interest, and we have in hand the conditional probabilities calculated by HaplotypeCaller based on that one read (if we had more reads, their contributions would be multiplied -- or in log space, added).

Please note that the values chosen for this example have been simplified and may not be reflective of actual probabilities calculated by Haplotype Caller.

# Alleles
Reference: A
Read: T

# Conditional probabilities calculated by HC
P(AA | Data) = 0.000001
P(AT | Data) = 0.000100
P(TT | Data) = 0.010000

Calculate the raw PL values

We want to determine the PLs of the genotype being 0/0, 0/1, and 1/1, respectively. So we apply the formula given earlier, which yields the following values:

Genotype	A/A	A/T	T/T
Raw PL	-10 * log(0.000001) = 60	-10 * log(0.000100) = 40	-10 * log(0.010000) = 20

Our first observation here is that the genotype for which the conditional probability was the highest turns out to get the lowest PL value. This is expected because, as described here, the PL is the likelihood of the genotype, which means (rather unintuitively if you’re not a stats buff) it is the probability that the genotype is not correct. So, low values mean a genotype is more likely, and high values means it’s less likely.

Normalize

At this point we have one more small transformation to make before we emit the final PL values to the VCF: we are going to normalize the values so that the lowest PL value is zero, and the rest are scaled relative to that. Since we’re in log space, we do this simply by subtracting the lowest value, 20, from the others, yielding the following final PL values:

Genotype	A/A	A/T	T/T
Normalized PL	60 - 20 = 40	40 - 20 = 20	20 - 20 = 0

We see that there is a direct relationship between the scaling of the PLs and the original probabilities: we had chosen probabilities that were each 100 times more or less likely than the next, and in the final PLs we see that the values are spaced out by a factor of 20, which is the Phred-scale equivalent of 100. This gives us a very convenient way to estimate how the numbers relate to each other -- and how reliable the genotype assignment is -- with just a glance at the PL field in the VCF record.

Genotype quality

We actually formalize this assessment of genotype quality in the GQ annotation, as described also here.The value of GQ is simply the difference between the second lowest PL and the lowest PL (which is always 0). So, in our example GQ = 20 - 0 = 20. Note that the value of GQ is capped at 99 for practical reasons, so even if the calculated GQ is higher, the value emitted to the VCF will be 99.

3. Special case: non-reference confidence model (GVCF mode)

When you run HaplotypeCaller with -ERC GVCF to produce a gVCF, there is an additional calculation to determine the genotype likelihoods associated with the symbolic <NON-REF> allele (which represents the possibilities that remain once you’ve eliminated the REF allele and any ALT alleles that are being evaluated explicitly).

The PL values for any possible genotype that includes the <NON-REF> allele have to be calculated a little differently than what is explained above because HaplotypeCaller cannot directly determine the conditional probabilities of genotypes involving <NON-REF>. Instead, it uses base quality scores to model the genotype likelihoods.

Local re-assembly and haplotype determination (HaplotypeCaller & Mutect2)

December 28, 2017, 3:43 pm

≫ Next: ActiveRegion determination (HaplotypeCaller & Mutect2)

≪ Previous: Calculation of PL and GQ by HaplotypeCaller and GenotypeGVCFs

This document details the procedure used by HaplotypeCaller to re-assemble read data and determine candidate haplotypes as a prelude to variant calling. For more context information on how this fits into the overall HaplotypeCaller method, please see the more general HaplotypeCaller documentation.

This procedure is also applied by Mutect2 for somatic short variant discovery. See this article for a direct comparison between HaplotypeCaller and Mutect2.

Overview
Reference graph assembly
Threading reads through the graph
Graph refinement
Select best haplotypes
Identify potential variation sites

1. Overview

The previous step produced a list of ActiveRegions that showed some evidence of possible variation (see ActiveRegion procedure documentation). Now, we need to process each Active Region in order to generate a list of possible haplotypes based on the sequence data we have for that region.

To do so, the program first builds an assembly graph for each active region (determined in the previous step) using the reference sequence as a template. Then, it takes each read in turn and attempts to match it to a segment of the graph. Whenever portions of a read do not match the local graph, the program adds new nodes to the graph to account for the mismatches. After this process has been repeated with many reads, it typically yields a complex graph with many possible paths. However, because the program keeps track of how many reads support each path segment, we can select only the most likely (well-supported) paths. These likely paths are then used to build the haplotype sequences which will be used to call variants and assign per-sample genotypes in the next steps.

2. Reference graph assembly

First, we construct the reference assembly graph, which starts out as a simple directed DeBruijn graph. This involves decomposing the reference sequence into a succession of kmers (pronounced "kay-mers"), which are small sequence subunits that are k bases long. Each kmer sequence overlaps the previous kmer by k-1 bases. The resulting graph can be represented as a series of nodes and connecting edges indicating the sequential relationship between the adjacent bases. At this point, all the connecting edges have a weight of 0.

In addition to the graph, we also build a hash table of unique kmers, which we use to keep track of the position of nodes in the graph. At the beginning, the hash table only contains unique kmers found in the reference sequence, but we will add to it in the next step.

A note about kmer size: by default, the program will attempt to build two separate graphs, using kmers of 10 and 25 bases in size, respectively, but other kmer sizes can be specified from the command line with the -kmerSize argument. The final set of haplotypes will be selected from the union of the graphs obtained using each k.

3. Threading reads through the graph

This is where our simple reference graph turns into a read-threading graph, so-called because we're going to take each read in turn and try to match it to a path in the graph.

We start with the first read and compare its first kmer to the hash table to find if it has a match. If there is a match, we look up its position in the reference graph and record that position. If there is no match, we consider that it is a new unique kmer, so we add that unique kmer to the hash table and add a new node to the graph. In both cases, we then move on and repeat the process with the next kmer in the read until we reach the end of the read.

When two consecutive kmers in a read belong to two nodes that were already connected by an edge in the graph, we increase the weight of that edge by 1. If the two nodes were not connected yet, we add a new edge to the graph with a starting weight of 1. As we repeat the process on each read in turn, edge weights will accumulate along the paths that are best supported by the read data, which will help us select the most likely paths later on.

Note on graph complexity, cycles and non-unique kmers

For this process to work properly, we need the graph to be sufficiently complex (where the number of non-unique k-mers is less that 4-fold the number of unique kmers found in the data) and without cycles. In certain genomic regions where there are a lot of repeated sequences, these conditions may not be met, because repeats cause cycles and diminish the number of available unique kmers. If none of the kmer sizes provided results in a viable graph (complex enough and without cycles) the program will automatically try the operation again with larger kmer sizes. Specifically, we take the largest k provided by the user (or by the default settings) and increase it by 10 bases. If no viable graph can be obtained after iterating over increased kmer sizes 6 times, we give up and skip the active region entirely.

4. Graph refinement

Once all the reads have been threaded through the graph, we need to clean it up a little. The main cleaning-up operation is called pruning (like the gardening technique). The goal of the pruning operation is to remove noise due to errors. The basic idea is that sections of the graph that are supported by very few reads are most probably the result of stochastic errors, so we are going to remove any sections that are supported by fewer than a certain threshold number of reads. By default the threshold value is 2, but this can be controlled from the command line using the -minPruning argument. In practice, this means that linear chains in the graph (linear sequence of vertices and edges without any branching) where all edges have fewer than 2 supporting reads will be removed. Increasing the threshold value will lead to faster processing and higher specificity, but will decrease sensitivity. Decreasing this value will do the opposite, decreasing specificity but increasing sensitivity.

At this stage, the program also performs graph refinement operations, such as recovering dangling heads and tails from the splice junctions to compensate for issues that are related to limitations in graph assembly.

Note that if you are calling multiple samples together, the program also looks at how many of the samples support each segment, and only prunes segments for which fewer than a certain number of samples have the minimum required number of supporting reads. By default this sample number is 1, so as long as one sample in the cohort passes the pruning threshold, the segment will NOT be pruned. This is designed to avoid losing singletons (variants that are unique to a single sample in a cohort). This parameter can also be controlled from the command line using the -minPruningSamples argument, but keep in mind that increasing the default value may lead to decreased sensitivity.

5. Select best haplotypes

Now that the graph is all cleaned up, the program builds haplotype sequences by traversing all possible paths in the graph and calculates a likelihood score for each one. This score is calculated as the product of transition probabilities of the path edges, where the transition probability of an edge is computed as the number of reads supporting that edge divided by the sum of the support of all edges that share that same source vertex.

In order to limit the amount of computation needed for the next step, we limit the number of haplotypes that will be considered for each value of k (remember that the program builds graphs for multiple kmer sizes). This is easy to do since we conveniently have scores for each haplotype; all we need to do is select the N haplotypes with the best scores. By default that number is very generously set to 128 (so the program would proceed to the next step with up to 128 haplotypes per value of k) but this can be adjusted from the command line using the -maxNumHaplotypesInPopulation argument. You would mainly want to decrease this number in order to improve speed; increasing that number would rarely be reasonable, if ever.

6. Identify potential variation sites

Once we have a list of plausible haplotypes, we perform a Smith-Waterman alignment (SWA) of each haplotype to the original reference sequence across the active region in order to reconstruct a CIGAR string for the haplotype. Note that indels will be left-aligned; that is, their start position will be set as the leftmost position possible.

This finally yields the potential variation sites that will be put through the variant modeling step next. Note that this list of candidate sites is essentially a super-set of what will eventually be the final set of called variants. Every site that will be called variant is in the super-set, but not every site that is in the super-set will be called variant.

↧

ActiveRegion determination (HaplotypeCaller & Mutect2)

December 28, 2017, 3:48 pm

≫ Next: Evaluating the evidence for haplotypes and variant alleles (HaplotypeCaller & Mutect2)

≪ Previous: Local re-assembly and haplotype determination (HaplotypeCaller & Mutect2)

This document details the procedure used by HaplotypeCaller to define ActiveRegions on which to operate as a prelude to variant calling. For more context information on how this fits into the overall HaplotypeCaller method, please see the more general HaplotypeCaller documentation.

This procedure is also applied by Mutect2 for somatic short variant discovery. See this article for a direct comparison between HaplotypeCaller and Mutect2.

Note that some of the command line argument names in this article may not be up to date. If you encounter any problems, please let us know in the comments so we can fix them.

Overview
Calculating the raw activity profile
Smoothing the activity profile
Setting the ActiveRegion thresholds and intervals

1. Overview

To define active regions, the HaplotypeCaller operates in three phases. First, it computes an activity score for each individual genome position, yielding the raw activity profile, which is a wave function of activity per position. Then, it applies a smoothing algorithm to the raw profile, which is essentially a sort of averaging process, to yield the actual activity profile. Finally, it identifies local maxima where the activity profile curve rises above the preset activity threshold, and defines appropriate intervals to encompass the active profile within the preset size constraints.

2. Calculating the raw activity profile

Active regions are determined by calculating a profile function that characterizes “interesting” regions likely to contain variants. The raw profile is first calculated locus by locus.

In the normal case (no special mode is enabled) the per-position score is the probability that the position contains a variant as calculated using the reference-confidence model applied to the original alignment.

If using the mode for genotyping given alleles (GGA) or the advanced-level flag -useAlleleTrigger, and the site is overlapped by an allele in the VCF file provided through the -alleles argument, the score is set to 1. If the position is not covered by a provided allele, the score is set to 0.

This operation gives us a single raw value for each position on the genome (or within the analysis intervals requested using the -L argument).

3. Smoothing the activity profile

The final profile is calculated by smoothing this initial raw profile following three steps. The first two steps consist in spreading individual position raw profile values to contiguous bases. As a result each position will have more than one raw profile value that are added up in the third and last step to obtain a final unique and smoothed value per position.

Unless one of the special modes is enabled (GGA or allele triggering), the position profile value will be copied over to adjacent regions if enough high quality soft-clipped bases immediately precede or follow that position in the original alignment. At time of writing, high-quality soft-clipped bases are those with quality score of Q29 or more. We consider that there are enough of such a soft-clips when the average number of high quality bases per soft-clip is 7 or more. In this case the site profile value is copied to all bases within a radius of that position as large as the average soft-clip length without exceeding a maximum of 50bp.
Each profile value is then divided and spread out using a Gaussian kernel covering up to 50bp radius centered at its current position with a standard deviation, or sigma, set using the -bandPassSigma argument (current default is 17 bp). The larger the sigma, the broader the spread will be.
For each position, the final smoothed value is calculated as the sum of all its profile values after steps 1 and 2.

4. Setting the ActiveRegion thresholds and intervals

The resulting profile line is cut in regions where it crosses the non-active to active threshold (currently set to 0.002). Then we make some adjustments to these boundaries so that those regions that are to be considered active, with a profile running over that threshold, fall within the minimum (fixed to 50bp) and maximum region size (customizable using -activeRegionMaxSize).

If the region size falls within the limits we leave it untouched (it's good to go).
If the region size is shorter than the minimum, it is greedily extended forward ignoring that cut point and we come back to step 1. Only if this is not possible because we hit a hard-limit (end of the chromosome or requested analysis interval) we will accept the small region as it is.
If it is too long, we find the lowest local minimum between the maximum and minimum region size. A local minimum is a profile value preceded by a large one right up-stream (-1bp) and an equal or larger value down-stream (+1bp). In case of a tie, the one further downstream takes precedence. If there is no local minimum we simply force the cut so that the region has the maximum active region size.

Of the resulting regions, those with a profile that runs over this threshold are considered active regions and progress to variant discovery and or calling whereas regions whose profile runs under the threshold are considered inactive regions and are discarded except if we are running HC in reference confidence mode.

There is a final post-processing step to clean up and trim the ActiveRegion:

Remove bases at each end of the read (hard-clipping) until there a base with a call quality equal or greater than minimum base quality score (customizable parameter -mbq, 10 by default).
Include or exclude remaining soft-clipped ends. Soft clipped ends will be used for assembly and calling unless the user has requested their exclusion (using -dontUseSoftClippedBases), if the read and its mate map to the same chromosome, and if they are in the correct standard orientation (i.e. LR and RL).
Clip off adaptor sequences of the read if present.
Discard all reads that no longer overlap with the ActiveRegion after the trimming operations described above.
Downsample remaining reads to a maximum of 1000 reads per sample, but respecting a minimum of 5 reads starting per position. This is performed after any downsampling by the traversal itself (-dt, -dfrac, -dcov etc.) and cannot be overriden from the command line.

↧

Evaluating the evidence for haplotypes and variant alleles (HaplotypeCaller & Mutect2)

December 28, 2017, 3:56 pm

≫ Next: Assigning per-sample genotypes (HaplotypeCaller)

≪ Previous: ActiveRegion determination (HaplotypeCaller & Mutect2)

This document details the procedure used by HaplotypeCaller to evaluate the evidence for variant alleles based on candidate haplotypes determined in the previous step for a given ActiveRegion. For more context information on how this fits into the overall HaplotypeCaller method, please see the more general HaplotypeCaller documentation.

This procedure is also applied by Mutect2 for somatic short variant discovery. See this article for a direct comparison between HaplotypeCaller and Mutect2.

Overview
Evaluating the evidence for each candidate haplotype
Evaluating the evidence for each candidate site and corresponding alleles

1. Overview

The previous step produced a list of candidate haplotypes for each ActiveRegion, as well as a list of candidate variant sites borne by the non-reference haplotypes. Now, we need to evaluate how much evidence there is in the data to support each haplotype. This is done by aligning each sequence read to each haplotype using the PairHMM algorithm, which produces per-read likelihoods for each haplotype. From that, we'll be able to derive how much evidence there is in the data to support each variant allele at the candidate sites, and that produces the actual numbers that will finally be used to assign a genotype to the sample.

2. Evaluating the evidence for each candidate haplotype

We originally obtained our list of haplotypes for the ActiveRegion by constructing an assembly graph and selecting the most likely paths in the graph by counting the number of supporting reads for each path. That was a fairly naive evaluation of the evidence, done over all reads in aggregate, and was only meant to serve as a preliminary filter to whittle down the number of possible combinations that we're going to look at in this next step.

Now we want to do a much more thorough evaluation of how much evidence we have for each haplotype. So we're going to take each individual read and align it against each haplotype in turn (including the reference haplotype) using the PairHMM algorithm (see Durbin et al., 1998). If you're not familiar with PairHMM, it's a lot like the BLAST algorithm, in that it's a pairwise alignment method that uses a Hidden Markov Model (HMM) and produces a likelihood score. In this use of the PairHMM, the output score expresses the likelihood of observing the read given the haplotype by taking into account the information we have about the quality of the data (i.e. the base quality scores and indel quality scores). Note: If reads from a pair overlap at a site and they have the same base, the base quality is capped at Q20 for both reads (Q20 is half the expected PCR error rate). If they do not agree, we set both base qualities to Q0.

This produces a big table of likelihoods where the columns are haplotypes and the rows are individual sequence reads. The table essentially represents how much supporting evidence there is for each haplotype (including the reference), itemized by read.

3. Evaluating the evidence for each candidate site and corresponding alleles

Having per-read likelihoods for entire haplotypes is great, but ultimately we want to know how much evidence there is for individual alleles at the candidate sites that we identified in the previous step. To find out, we take the per-read likelihoods of the haplotypes and marginalize them over alleles, which produces per-read likelihoods for each allele at a given site. In practice, this means that for each candidate site, we're going to decide how much support each read contributes for each allele, based on the per-read haplotype likelihoods that were produced by the PairHMM.

This may sound complicated, but the procedure is actually very simple -- there is no real calculation involved, just cherry-picking appropriate values from the table of per-read likelihoods of haplotypes into a new table that will contain per-read likelihoods of alleles. This is how it happens. For a given site, we list all the alleles observed in the data (including the reference allele). Then, for each read, we look at the haplotypes that support each allele; we select the haplotype that has the highest likelihood for that read, and we write that likelihood in the new table. And that's it! For a given allele, the total likelihood will be the product of all the per-read likelihoods.

At the end of this step, sites where there is sufficient evidence for at least one of the variant alleles considered will be called variant, and a genotype will be assigned to the sample in the next (final) step.

↧

Assigning per-sample genotypes (HaplotypeCaller)

December 28, 2017, 4:08 pm

≫ Next: Allele Depth (AD) is lower than expected

≪ Previous: Evaluating the evidence for haplotypes and variant alleles (HaplotypeCaller & Mutect2)

This document describes the procedure used by HaplotypeCaller to assign genotypes to individual samples based on the allele likelihoods calculated in the previous step. For more context information on how this fits into the overall HaplotypeCaller method, please see the more general HaplotypeCaller documentation. See also the documentation on the QUAL score as well as the one on PL and GQ.

This procedure is NOT applied by Mutect2 for somatic short variant discovery. See this article for a direct comparison between HaplotypeCaller and Mutect2.

Overview
Preliminary assumptions / limitations
Calculating genotype likelihoods using Bayes' Theorem
Selecting a genotype and emitting the call record

1. Overview

Note that this describes the regular mode of HaplotypeCaller, which does not emit an estimate of reference confidence. For details on how the reference confidence model works and is applied in GVCF modes (-ERC GVCF and -ERC BP_RESOLUTION) please see the reference confidence model documentation.

2. Preliminary assumptions / limitations

Quality

Keep in mind that we are trying to infer the genotype of each sample given the observed sequence data, so the degree of confidence we can have in a genotype depends on both the quality and the quantity of the available data. By definition, low coverage and low quality will both lead to lower confidence calls. The GATK only uses reads that satisfy certain mapping quality thresholds, and only uses “good” bases that satisfy certain base quality thresholds (see documentation for default values).

Ploidy

Both the HaplotypeCaller and GenotypeGVCFs assume that the organism of study is diploid by default, but the desired ploidy can be set using the -ploidy argument. The ploidy is taken into account in the mathematical development of the Bayesian calculation using a generalized form of the genotyping algorithm that can handle ploidies other than 2. Note that using ploidy for pooled experiments is subject to some practical limitations due to the number of possible combinations resulting from the interaction between ploidy and the number of alternate alleles that are considered. There are some arguments that aim to mitigate those limitations but they are not fully documented yet.

Paired end reads

Reads that are mates in the same pair are not handled together in the reassembly, but if they overlap, there is some special handling to ensure they are not counted as independent observations.

Single-sample vs multi-sample

We apply different genotyping models when genotyping a single sample as opposed to multiple samples together (as done by HaplotypeCaller on multiple inputs or GenotypeGVCFs on multiple GVCFs). The multi-sample case is not currently documented for the public but is an extension of previous work by Heng Li and others.

3. Calculating genotype likelihoods using Bayes' Theorem

We use the approach described in Li 2011 to calculate the posterior probabilities of non-reference alleles (Methods 2.3.5 and 2.3.6) extended to handle multi-allelic variation.

The basic formula we use for all types of variation under consideration (SNPs, insertions and deletions) is:

$$ P(G|D) = \frac{ P(G) P(D|G) }{ \sum_{i} P(G_i) P(D|G_i) } $$

If that is meaningless to you, please don't freak out -- we're going to break it down and go through all the components one by one. First of all, the term on the left:

$$ P(G|D) $$

is the quantity we are trying to calculate for each possible genotype: the conditional probability of the genotype G given the observed data D.

Now let's break down the term on the right:

$$ \frac{ P(G) P(D|G) }{ \sum_{i} P(G_i) P(D|G_i) } $$

We can ignore the denominator (bottom of the fraction) because it ends up being the same for all the genotypes, and the point of calculating this likelihood is to determine the most likely genotype. The important part is the numerator (top of the fraction):

$$ P(G) P(D|G) $$

which is composed of two things: the prior probability of the genotype and the conditional probability of the data given the genotype.

The first one is the easiest to understand. The prior probability of the genotype G:

$$ P(G) $$

represents how probably we expect to see this genotype based on previous observations, studies of the population, and so on. By default, the GATK tools use a flat prior (always the same value) but you can input your own set of priors if you have information about the frequency of certain genotypes in the population you're studying.

The second one is a little trickier to understand if you're not familiar with Bayesian statistics. It is called the conditional probability of the data given the genotype, but what does that mean? Assuming that the genotype G is the true genotype,

$$ P(D|G) $$

is the probability of observing the sequence data that we have in hand. That is, how likely would we be to pull out a read with a particular sequence from an individual that has this particular genotype? We don't have that number yet, so this requires a little more calculation, using the following formula:

$$ P(D|G) = \prod{j} \left( \frac{P(D_j | H_1)}{2} + \frac{P(D_j | H_2)}{2} \right) $$

You'll notice that this is where the diploid assumption comes into play, since here we decomposed the genotype G into:

$$ G = H_1H_2 $$

which allows for exactly two possible haplotypes. In future versions we'll have a generalized form of this that will allow for any number of haplotypes.

Now, back to our calculation, what's left to figure out is this:

$$ P(D_j|H_n) $$

which as it turns out is the conditional probability of the data given a particular haplotype (or specifically, a particular allele), aggregated over all supporting reads. Conveniently, that is exactly what we calculated in Step 3 of the HaplotypeCaller process, when we used the PairHMM to produce the likelihoods of each read against each haplotype, and then marginalized them to find the likelihoods of each read for each allele under consideration. So all we have to do at this point is plug the values from that table into the equation above, and we can work our way back up to obtain:

$$ P(G|D) $$

for the genotype G.

4. Selecting a genotype and emitting the call record

We go through the process of calculating a likelihood for each possible genotype based on the alleles that were observed at the site, considering every possible combination of alleles. For example, if we see an A and a T at a site, the possible genotypes are AA, AT and TT, and we end up with 3 corresponding probabilities. We pick the largest one, which corresponds to the most likely genotype, and assign that to the sample.

Note that depending on the variant calling options specified in the command-line, we may only emit records for actual variant sites (where at least one sample has a genotype other than homozygous-reference) or we may also emit records for reference sites. The latter is discussed in the reference confidence model documentation.

Assuming that we have a non-ref genotype, all that remains is to calculate the various site-level and genotype-level metrics that will be emitted as annotations in the variant record, including QUAL as well as PL and GQ. For more information on how the other variant context metrics are calculated, please see the corresponding variant annotations documentation.

↧

Allele Depth (AD) is lower than expected

December 29, 2017, 11:26 pm

≫ Next: Missing annotations in the output callset VCF

≪ Previous: Assigning per-sample genotypes (HaplotypeCaller)

The problem:

You're trying to evaluate the support for a particular call, but the numbers in the DP (total depth) and AD (allele depth) fields aren't making any sense. For example, the sum of all the ADs doesn't match up to the DP, or even more baffling, the AD for an allele that was called is zero!

For example, sometimes a VCF may contain a variant call that looks like this:

2 151214 . G A 673.77 . AN=2;DP=10;FS=0.000;MLEAF=0.500;MQ=56.57;MQ0=0;NCC=0;SOR=0.693 GT:AD:DP:GQ:PL 0/1:0,0:10:38:702,0,38

You can see in the Format field the AD values are 0 for both of the alleles. However, in the Info and FORMAT fields, the DP is 10. Because the DP in the INFO field is unfiltered and the DP in the FORMAT field is filtered, you know none of the reads were filtered out by the engine's built-in read filters. And if you look at the "bamout", you see 10 reads covering the position! So why is the VCF reporting an AD value of 0?

The explanation: uninformative reads

This is not actually a bug -- the program is doing what we expect; this is an interpretation problem. The answer lies in uninformative reads.

We call a read “uninformative” when it passes the quality filters, but the likelihood of the most likely allele given the read is not significantly larger than the likelihood of the second most likely allele given the read. Specifically, the difference between the Phred scaled likelihoods must be greater than 0.2 to be considered significant. In other words, that means the most likely allele must be 60% more likely than the second most likely allele.

Let’s walk through an example to make this clearer. Let’s say we have 2 reads and 2 possible alleles at a site. All of the reads have passed HaplotypeCaller’s quality filters, and the likelihoods of the alleles given the reads are in the table below.

Reads	Likelihood of A	Likelihood of T
1	3.8708e-7	3.6711e-7
2	4.9992e-7	2.8425e-7

Note: Keep in mind that HaplotypeCaller marginalizes the likelihoods of the haplotypes given the reads to get the likelihoods of the alleles given the reads. The table above shows the likelihoods of the alleles given the reads. For additional details, please see the HaplotypeCaller method documentation).

Now, let’s convert the likelihoods into Phred-scaled likelihoods. To do this, we simply take the log of the likelihoods.

Reads	Phred-scaled likelihood of A	Phred-scaled likelihood of T
1	-6.4122	-6.4352
2	-6.3011	-6.5463

Now, we want to determine if read 1 is informative. To do this, we simply look at the Phred scaled likelihoods of the most likely allele and the second most likely allele. The Phred scaled likelihood of the most likely allele (A) is -6.4122.The Phred-scaled likelihood of the second most likely allele (T) is -6.4352. Taking the difference between the two likelihoods gives us 0.023. Because 0.023 is Less than 0.2, read 1 is considered uninformative.

To determine if read 2 is informative, we take -6.3011-(-6.5463). This gives us 0.2452, which is greater than 0.2. Read 2 is considered informative.

How does a difference of 0.2 mean the most likely allele is ~60% more likely than the second most likely allele? Well, because the likelihoods are Phred-scaled, 0.2 = 10^0.2 = 1.585 which is approximately 60% greater.

Conclusion

So, now that we know the math behind determining which reads are informative, let’s look at how this affects the record output to the VCF. If a read is considered informative, it gets counted toward the AD and DP of the variant allele in the output record. If a read is considered uninformative, it is counted towards the DP, but not the AD. That way, the AD value reflects how many reads actually contributed support for a given allele at the site. We would not want to include uninformative reads in the AD value because we don’t have confidence in them.

Please note, however, that although an uninformative read is not reported in the AD, it is still used in calculations for genotyping. In future we may add an annotation to indicate counts of reads that were considered informative vs. uninformative. Let us know in the comments if you think that would be helpful.

In most cases, you will have enough coverage at a site to disregard small numbers of uninformative reads. Unfortunately, sometimes uninformative reads are the only reads you have at a site. In this case, we report the potential variant allele, but keep the AD values 0. The uncertainty at the site will be reflected in the QG and PL values.

↧

Missing annotations in the output callset VCF

December 29, 2017, 11:52 pm

≫ Next: Expected variant at a specific site was not called

≪ Previous: Allele Depth (AD) is lower than expected

The problem

You specified -A <some annotation> in a command line invoking one of the annotation-capable tools (HaplotypeCaller, MuTect2, GenotypeGVCFs and VariantAnnotator), but that annotation did not show up in your output VCF.

Keep in mind that all annotations that are necessary to run our Best Practices are annotated by default, so you should generally not need to request annotations unless you're doing something a bit special.

Why this happens & solutions

There can be several reasons why this happens, depending on the tool, the annotation, and you data. These are the four we see most often; if you encounter another that is not listed here, let us know in the comments.

1. You requested an annotation that cannot be calculated by the tool

For example, you're running Mutect2 but requested an annotation that is specific to HaplotypeCaller. There should be an error message to that effect in the output log. It's not possible to override this; but if you believe the annotation should be available to the tool, let us know in the forum and we'll consider putting in a feature request.

2. You requested an annotation that can only be calculated if an optional input is provided

For example, you're running HaplotypeCaller and you want InbreedingCoefficient, but you didn't specify a pedigree file. There should be an error message to that effect in the output log. The solution is simply to provide the missing input file. Another example: you're running VariantAnnotator and you want to annotate Coverage, but you didn't specify a BAM file. The tool needs to see the read data in order to calculate the annotation, so again, you simply need to provide the BAM file.

3. You requested an annotation that has requirements which are not met by some or all sites

For example, you're looking at RankSumTest annotations, which require heterozygous sites in order to perform the necessary calculations, but you're running on haploid data so you don't have any het sites. There is no workaround; the annotation is not applicable to your data. Another example: you requested InbreedingCoefficient, but your population includes fewer than 10 founder samples, which are required for the annotation calculation. There is no workaround; the annotation is not applicable to your data.

4. You requested an annotation that is already applied by default by the tool you are running

For example, you requested Coverage from HaplotypeCaller, which already annotates this by default. There is currently a bug that causes some default annotations to be dropped from the list if specified on the command line. This will be addressed in an upcoming version. For now the workaround is to check what annotations are applied by default and NOT request them with -A.

↧

Expected variant at a specific site was not called

December 29, 2017, 11:58 pm

≫ Next: Best strategy to "fix" the Haplotype Caller - GenotypeGVCF "missing DP field" bug??

≪ Previous: Missing annotations in the output callset VCF

This can happen when you expect a call to be made based on the output of other variant calling tools, or based on examination of the data in a genome browser like IGV.

There are several possibilities, and among them, it is possible that GATK may be missing a real variant. But we are generally very confident in the calculations made by our tools, and in our experience, most of the time, the problem lies elsewhere. So, before you post this issue in our support forum, please follow these troubleshooting guidelines, which hopefully will help you figure out what's going on.

In all cases, to diagnose what is happening, you will need to look directly at the sequencing data at the position in question.

This article may contain argument names that have not yet been updated for GATK4.Let us know if you run into any problems and we'll fix them.

Generate the bamout and compare it to the input bam
Check the base qualities of the non-reference bases
Check the mapping qualities of the reads that support the non-reference allele(s)
Check how many alternate alleles are present
Check for systematic biases introduced by your sequencing technology
Try fiddling with graph arguments (ADVANCED)

1. Generate the bamout and compare it to the input bam

If you are using HaplotypeCaller to call your variants (as you nearly always should) you'll need to run an extra step first to produce a file called the "bamout file". See this tutorial for step-by-step instructions on how to do this.

What often happens is that when you look at the reads in the original bam file, it looks like a variant should be called. However, once HaplotypeCaller has performed the realignment, the reads may no longer support the expected variant. Generating the bamout file and comparing it to the original bam will allow you to elucidate such cases.

In the example below, you see the original bam file on the top, and on the bottom is the bam file after reassembly. In this case, there seem to be many SNPs present, however, after reassembly, we find there is really a large deletion!

2. Check the base qualities of the non-reference bases

The variant callers apply a minimum base quality threshold, under which bases will not be counted as supporting evidence for a variant. This is because low base qualities mean that the sequencing machine was not confident that it called the right bases. If your expected variant is only supported by low-confidence bases, it is probably a false positive.

Keep in mind that the depth reported in the DP field of the VCF is the unfiltered depth. You may believe you have good coverage at your site of interest, but since the variant callers ignore bases that fail the quality filters, the actual coverage seen by the variant callers may be lower than you think.

3. Check the mapping qualities of the reads that support the non-reference allele(s)

The quality of a base is capped by the mapping quality of the read that it is on. This is because low mapping qualities mean that the aligner had little confidence that the read was mapped to the correct location in the genome. You may be seeing mismatches because the read doesn't belong there -- in fact, you may be looking at the sequence of some other locus in the genome!

Keep in mind also that reads with mapping quality 255 ("unknown") are ignored.

4. Check how many alternate alleles are present

By default the variant callers will only consider a certain number of alternate alleles. This parameter can be relaxed using the --max-alternate-alleles argument (see the HaplotypeCaller documentation page to find out what is the default value for this argument). Note however that genotyping sites with many alternate alleles increases the computational cost of the processing, scaling exponentially with the number of alternate alleles, which means it will use more resources and take longer. Unless you have a really good reason to change the default value, we highly recommend that you not modify this parameter.

5. Check for systematic biases introduced by your sequencing technology

Some sequencing technologies introduce particular sources of bias. For example,
in data produced by the SOLiD platform, alignments tend to have reference bias and it can be severe in some cases. If the SOLiD reads have a lot of mismatches (no-calls count as mismatches) around the the site, you are probably seeing false positives.

6. Try fiddling with graph arguments (ADVANCED)

This is highly experimental, but if all else fails, worth a shot (with HaplotypeCaller and Mutect2).

Fiddle with kmers

In some difficult sequence contexts (e.g. repeat regions), when some default-sized kmers are non-unique, cycles get generated in the graph. By default the program increases the kmer size automatically to try again, but after several attempts it will eventually quit trying and fail to call the expected variant (typically because the variant gets pruned out of the read-threading assembly graph, and is therefore never assembled into a candidate haplotype). We've seen cases where it's still possible to force a resolution using -allowNonUniqueKmersInRef and/or increasing the --kmer-size (or range of permitted sizes: 10, 25, 35 for example).

Note: While --allowNonUniqueKmersInRef allows missed calls to be made in repeat regions, it should not be used in all regions as it may increase false positives. We have plans to improve variant calling in repeat regions, but for now please try this flag if you notice calls being missed in repeat regions.

Fiddle with pruning

Decreasing the value of -minPruning and/or -minDanglingBranchLength (i.e. increasing the amount of evidence necessary to keep a path in the graph) can recover variants, at the risk of taking on more false positives.

↧

Best strategy to "fix" the Haplotype Caller - GenotypeGVCF "missing DP field" bug??

April 7, 2016, 9:31 am

≫ Next: Can HaplotypeCaller be used on drug treated samples?

≪ Previous: Expected variant at a specific site was not called

Hi,

I've run into the (already reported http://gatkforums.broadinstitute.org/dsde/discussion/5598/missing-depth-dp-after-haplotypecaller ) bug of the missing DP format field in my callings.

I've run the following (relevant) commands:

Haplotype Caller -> Generate GVCF:

    java -Xmx${xmx} ${gct} -Djava.io.tmpdir=${NEWTMPDIR} -jar ${gatkpath}/GenomeAnalysisTK.jar \
       -T HaplotypeCaller \
       -R ${ref} \
       -I ${NEWTMPDIR}/${prefix}.realigned.fixed.recal.bam \
       -L ${reg} \
       -ERC GVCF \
       -nct ${nct} \
       --genotyping_mode DISCOVERY \
       -stand_emit_conf 10 \
       -stand_call_conf 30  \
       -o ${prefix}.raw_variants.annotated.g.vcf \
       -A QualByDepth -A RMSMappingQuality -A MappingQualityRankSumTest -A ReadPosRankSumTest -A FisherStrand -A StrandOddsRatio -A Coverage

That generates GVCF files that DO HAVE the DP field for all reference positions, but DO NOT HAVE the DP format field for any called variant (but still keep the DP in the INFO field):

18      11255   .       T       <NON_REF>       .       .       END=11256       GT:DP:GQ:MIN_DP:PL      0/0:18:48:18:0,48,720
18      11257   .       C       G,<NON_REF>     229.77  .       BaseQRankSum=1.999;DP=20;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQRankSum=-1.377;ReadPosRankSum=0.489      GT:AD:GQ:PL:SB  0/1:10,8,0:99:258,0,308,288
18      11258   .       G       <NON_REF>       .       .       END=11260       GT:DP:GQ:MIN_DP:PL      0/0:17:48:16:0,48,530

Later, I ran Genotype GVCF joining all the samples with the following command:

java -Xmx${xmx} ${gct} -Djava.io.tmpdir=${NEWTMPDIR} -jar ${gatkpath}/GenomeAnalysisTK.jar \
   -T GenotypeGVCFs \
   -R ${ref} \
   -L ${pos} \
   -o ${prefix}.raw_variants.annotated.vcf \
   --variant ${variant} [...]

This generated vcf files where the DP field is present in the format description, it IS present in the Homozygous REF samples, but IS MISSING in any Heterozygous or HomoALT samples.

22  17280388    .   T   C   18459.8 PASS    AC=34;AF=0.340;AN=100;BaseQRankSum=-2.179e+00;DP=1593;FS=2.526;InbreedingCoeff=0.0196;MLEAC=34;MLEAF=0.340;MQ=60.00;MQRankSum=0.196;QD=19.76;ReadPosRankSum=-9.400e-02;SOR=0.523    GT:AD:DP:GQ:PL  0/0:29,0:29:81:0,81,1118    0/1:20,22:.:99:688,0,682    1/1:0,27:.:81:1018,81,0 0/0:22,0:22:60:0,60,869 0/1:20,10:.:99:286,0,664    0/1:11,17:.:99:532,0,330    0/1:14,14:.:99:431,0,458    0/0:28,0:28:81:0,81,1092    0/0:35,0:35:99:0,99,1326    0/1:14,20:.:99:631,0,453    0/1:13,16:.:99:511,0,423    0/1:38,29:.:99:845,0,1231   0/1:20,10:.:99:282,0,671    0/0:22,0:22:63:0,63,837 0/1:8,15:.:99:497,0,248 0/0:32,0:32:90:0,90,1350    0/1:12,12:.:99:378,0,391    0/1:14,26:.:99:865,0,433    0/0:37,0:37:99:0,105,1406   0/0:44,0:44:99:0,120,1800   0/0:24,0:24:72:0,72,877 0/0:30,0:30:84:0,84,1250    0/0:31,0:31:90:0,90,1350    0/1:15,25:.:99:827,0,462    0/0:35,0:35:99:0,99,1445    0/0:29,0:29:72:0,72,1089    1/1:0,32:.:96:1164,96,0 0/0:21,0:21:63:0,63,809 0/1:21,15:.:99:450,0,718    1/1:0,40:.:99:1539,120,0    0/0:20,0:20:60:0,60,765 0/1:11,9:.:99:293,0,381 1/1:0,35:.:99:1306,105,0    0/1:18,14:.:99:428,0,606    0/0:32,0:32:90:0,90,1158    0/1:24,22:.:99:652,0,816    0/0:20,0:20:60:0,60,740 1/1:0,30:.:90:1120,90,0 0/1:15,13:.:99:415,0,501    0/0:31,0:31:90:0,90,1350    0/1:15,18:.:99:570,0,480    0/1:22,13:.:99:384,0,742    0/1:19,11:.:99:318,0,632    0/0:28,0:28:75:0,75,1125    0/0:20,0:20:60:0,60,785 1/1:0,27:.:81:1030,81,0 0/0:30,0:30:90:0,90,1108    0/1:16,16:.:99:479,0,493    0/1:14,22:.:99:745,0,439    0/0:31,0:31:90:0,90,1252
22  17280822    .   G   A   5491.56 PASS    AC=8;AF=0.080;AN=100;BaseQRankSum=1.21;DP=1651;FS=0.000;InbreedingCoeff=-0.0870;MLEAC=8;MLEAF=0.080;MQ=60.00;MQRankSum=0.453;QD=17.89;ReadPosRankSum=-1.380e-01;SOR=0.695   GT:AD:DP:GQ:PL  0/0:27,0:27:72:0,72,1080    0/0:34,0:34:90:0,90,1350    0/1:15,16:.:99:528,0,491    0/0:27,0:27:60:0,60,900 0/1:15,22:.:99:699,0,453    0/0:32,0:32:90:0,90,1350    0/0:37,0:37:99:0,99,1485    0/0:31,0:31:87:0,87,1305    0/0:40,0:40:99:0,108,1620   0/1:20,9:.:99:258,0,652 0/0:26,0:26:72:0,72,954 0/1:16,29:.:99:943,0,476    0/0:27,0:27:69:0,69,1035    0/0:19,0:19:48:0,48,720 0/0:32,0:32:81:0,81,1215    0/0:36,0:36:99:0,99,1435    0/0:34,0:34:99:0,99,1299    0/0:35,0:35:99:0,102,1339   0/0:38,0:38:99:0,102,1520   0/0:36,0:36:99:0,99,1476    0/0:31,0:31:81:0,81,1215    0/0:31,0:31:75:0,75,1125    0/0:35,0:35:99:0,99,1485    0/0:37,0:37:99:0,99,1485    0/0:35,0:35:90:0,90,1350    0/0:20,0:20:28:0,28,708 0/1:16,22:.:99:733,0,474    0/0:32,0:32:90:0,90,1350    0/0:35,0:35:99:0,99,1467    0/1:27,36:.:99:1169,0,831   0/0:28,0:28:75:0,75,1125    0/0:36,0:36:81:0,81,1215    0/0:35,0:35:90:0,90,1350    0/0:28,0:28:72:0,72,1080    0/0:31,0:31:81:0,81,1215    0/0:37,0:37:99:0,99,1485    0/0:31,0:31:84:0,84,1260    0/0:39,0:39:99:0,101,1575   0/0:37,0:37:96:0,96,1440    0/0:34,0:34:99:0,99,1269    0/0:30,0:30:81:0,81,1215    0/0:36,0:36:99:0,99,1485    0/1:17,17:.:99:567,0,530    0/0:26,0:26:72:0,72,1008    0/0:18,0:18:45:0,45,675 0/0:33,0:33:84:0,84,1260    0/0:25,0:25:61:0,61,877 0/1:9,21:.:99:706,0,243 0/0:35,0:35:81:0,81,1215    0/0:35,0:35:99:0,99,1485

I've just discovered this issue, and I need to run an analysis trying on the differential depth of coverage in different regions, and if there is a DP bias between called/not-called samples.

I have thousands of files and I've spent almost 1 year generating all these callings, so redoing the callings is not an option.

What would be the best/fastest strategy to either fix my final vcfs with the DP data present in all intermediate gvcf files (preferably) or, at least, extracting this data for all snps and samples?

Thanks in advance,

Txema

PS: Recalling the individual samples from bamfiles is not an option. Fixing the individual gvcfs and redoing the joint GenotypeGVCFs could be.

↧

Can HaplotypeCaller be used on drug treated samples?

January 9, 2018, 9:21 pm

≫ Next: What are the differences between Mutect2 and HaplotypeCaller?

≪ Previous: Best strategy to "fix" the Haplotype Caller - GenotypeGVCF "missing DP field" bug??

Hello, I am working on a RNASeq data which consists of liver samples from donors. It is a case-control study where 12 samples are divided as Normal (control) and Rifampin Treated (case). I want to create a sample specific VCF file. I was going through the documentation and I got a bit confused between HaplotypeCaller and Mutect2. Which one should I use to get my VCF file.

In addition, is there a decent way to add gene name, symbol and other annotations to the INFO field of the VCF file?

Any help is much appreciated.

Regards,
Anurag

↧

What are the differences between Mutect2 and HaplotypeCaller?

January 12, 2018, 10:05 am

≫ Next: VariantFiltration | HaplotypeCaller - ignoring variants close (5bp) from 3´and 5´end.

≪ Previous: Can HaplotypeCaller be used on drug treated samples?

They share graph assembly and haplotype determination -- but the similarities end there

Operationally, Mutect2 works similarly to HaplotypeCaller in that they share the active region-based processing, assembly-based haplotype reconstruction and pairHMM alignment of reads to haplotypes. However, they use fundamentally different models for estimating variant likelihoods and genotypes. The HaplotypeCaller model uses ploidy in its genotype likelihood calculations. The Mutect2 model does not. We explain why this is so.

Germline caller versus Somatic caller

The main difference is that HaplotypeCaller is designed to call germline variants, while Mutect2 is designed to call somatic variants. Neither is appropriate for the other use case.

Germline variants are straightforward. They vary against the reference. Germline calling typically assumes a fixed ploidy and calling includes genotyping sites. HaplotypeCaller allows setting a different ploidy than diploid with the -ploidy argument. HaplotypeCaller can call germline variants on one or multiple samples and the tool can use evidence of variation across the samples to increase confidence in a variant call.

Somatic variants contrast between two samples against the reference. What do we mean by somatic? The Greek word soma refers to parts of an organism other than the reproductive cells. For example, our skin cells are soma-tic and accumulate mutations from sun exposure that presumably our seed or germ cells are protected from. In this example, variants in skin cells that are not variant in the blood cells are somatic.

Mutect2 works primarily by contrasting the presence or absence of evidence for variation between two samples, the tumor and matched normal, from the same individual. The tool can run on unmatched tumors but this produces high rates of false positives. Technically speaking, somatic variants are both (i) different from the control sample and (ii) different from the reference. What this means is that if a site is variant in the control but in the somatic sample reverts to the reference allele, then it is not a somatic variant.

Here are some more specific differences

Mutect2 is incapable of calculating reference confidence, which is a feature in HaplotypeCaller that is key to producing GVCFs. As a result, there is currently no way to perform joint calling for somatic variant discovery.
Because a somatic callset is based on a single individual rather than a cohort, annotations in the INFO column of a Mutect2 VCF only refer to the ALT alleles and do not include values for the REF allele. This differs from a germline cohort callset, in which annotations in the INFO field are typically derived from data related to all observed alleles including the reference.
While HaplotypeCaller relies on a fixed ploidy assumption to calculate the genotype likelihoods that are the basis for genotype probabilities (PL), Mutect2 allows for varying ploidy in the form of allele fractions for each variant. Varying allele fractions are often seen within a tumor sample due to fractional purity, multiple subclones and copy number variation.
Mutect2 also differs from HaplotypeCaller in that it can apply various prefilters to sites and alleles depending on the use of a matched normal, a panel of normals (PoN) and a common population variant resource containing allele-specific frequencies. If a PoN or matched normal is provided, Mutect2 can use either to filter sites before reassembly, and it can use a germline resource to filter alleles.
The variant site annotations that HaplotypeCaller and Mutect2 apply by default are very different; see their respective tool documentation for details.
Finally, Mutect2 has additional parameters not available to HaplotypeCaller. These parameters factor towards the decision to perform reassembly on a region, towards whether to emit a variant and towards whether to filter a site:
- For one, the frequency of alleles not in the germline resource (--af-of-alleles-not-in-resource) defines in the germline variant prior, which Mutect2 uses in likelihood calculations of a variant being germline.
- Second, the log somatic prior (--log-somatic-prior) defines the somatic variant prior, which Mutect2 uses in likelihood calculations of a variant being somatic.
- Third, the normal log odds ratio (--normal-lod) defines the filter threshold for variants in the tumor not being in the normal, i.e. the germline risk factor.
- Fourth, the tumor log odds ratio for emission (–-tumor-lod-to-emit) defines the cutoff for a tumor variant to appear in a callset.

Historical perspective explains some quirks of somatic calling

Somatic calling is NOT a simple subtraction of control variant alleles from case sample variant alleles. The reason for this stems from the original intent for somatic callsets.

Somatic calling was originally designed for cancer research--specifically, computational research that focuses on triangulating driver mutation loci in cancer cohorts. Analyses require callsets with high specificity. What this means is that researchers prefer to remove false positives even at the expense of losing some true positives.
Another consideration is patient privacy. Germline variants, in particular those in untranslated regions or noncoding regions of the genome, deidentify individuals. To protect patient identities, somatic calling was designed to avoid passing on any identifying germline variation.

Somatic callers reflect these two preferences in their stringent filtering, either upfront such that a variant call is not emitted or downstream such that a site is annotated in the FILTER column with the filter name.

A somatic caller should detect low fraction alleles, can make no explicit ploidy assumption and omits genotyping in the traditional sense. Mutect2 adheres to all of these criteria. A number of cancer sample characteristics necessitate such caller features. For one, biopsied tumor samples are commonly contaminated with normal cells, and the normal fraction can be much higher than the tumor fraction of a sample. Second, a tumor can be heterogeneous in its mutations. Third, these mutations not uncommonly include aneuploid events that change the copy number of a cell's genome in patchwork fashion.

A variant allele in the case sample is not called if the site is variant in controls. We explain an exception for GATK4 Mutect2 in a bit.

Historically, somatic callers have called somatic variants at the site-level. That is, if a variant site in the case is also variant in the matched control or in a population resource, e.g. dbSNP, even if the variant allele is different than the control or resource it is discounted from the somatic callset. This practice stems in part from cancer study designs where the control normal sample is sequenced at much lower depth than the case tumor sample. Because of the assumption mutations strike randomly, cancer geneticists view mutations at sites of common germline variation with skepticism. Remember for humans, common germline variant sites occur roughly on average one in a thousand reference bases. So if a commonly variant site accrues additional mutations, we must weigh the chance of it having arisen from a true somatic event or it being something else that will likely not add value to downstream analyses. For most sites and typical analyses, the latter is the case. The variant is unlikely to have arisen from a somatic event and more likely to be some artifact or germline variant, e.g. from mapping or cross-sample contamination.

GATK4 Mutect2 still applies this practice in part. The tool discounts variant sites shared with the panel of normals or with a matched normal control's unambiguously variant site. If the matched normal's variant allele is supported by few reads, at low allele fraction, then the tool accounts for the possibility of the site not being a germline variant.

When it comes to the population germline resource, GATK4 Mutect2 distinguishes between the variant alleles in the germline resource and the case sample. That is, Mutect2 will call a variant site somatic if the allele differs from that in the germline resource. Blog#10911 explains this in a bit more detail and explains how Mutect2 factors germline variant allele frequencies in calling.

Somatic workflows filter case sites with multiple variant alleles. By a similar logic to that outlined above, and with the assumption that common variant sites are biallelic, any site that presents multiple variant alleles in the case sample is suspect. Mutect2 still calls such sites and the contrasting variant alleles; however, in the next step of the workflow, FilterMutectCalls filters such sites with the multiallelic filter. It is possible a multiallelic site in the case sample represents a somatic event, but it is more likely the site is a germline variant site or an artifactual site.

Tutorial#2801 outlines how to call germline short variants with HaplotypeCaller.
Tutorial#11136 outlines the GATK4 somatic short variant discovery workflow.
For differences between GATK4 Mutect2 and GATK3 MuTect2, see Blog#10911.
HaplotypeCaller tool documentation is here.
GATK4 Mutect2 tool documentation is here.

↧

VariantFiltration | HaplotypeCaller - ignoring variants close (5bp) from 3´and 5´end.

January 17, 2018, 4:34 am

≫ Next: Differences between GATK 4.beta.5 vs 4.0.0.0 HaplotypeCaller results

≪ Previous: What are the differences between Mutect2 and HaplotypeCaller?

Hi,

I am currently working with data from HaloPlex Target Enrichment System. HaloPlex is using retriction enzymes to digest the DNA, thus producing non-random reads and often have false mutations in 3´and 5´ends caused by adapter remnant. The problem with the adpater remnant mutations has previously been handled using custom scripts as described in Geéen et al. (https://doi.org/10.1016/j.jmoldx.2014.09.006 ) : _First, the cleaned index-sorted paired-end reads were scanned for flanking HaloPlex adapter sequences, ie, 5′-AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC-3′ and 5′-AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT-3′. However, the adapter 5′-recognition motif was restricted to 6 to 13 bp depending on the position of the adapter in the read. A perfect match was required in each case, which is simpler and faster compared with the procedure recommended by the HaloPlex development team. The minimal sequence for identification of an adapter at the 3′ end of a read was set to AGATCG. The adapter sequences were removed in the following way: i) five bases were removed from the 3′ end of all reads lacking identified adapter sequence (resulting in approximately 146-bp reads), ii) reads with adapter sequence within 50 bp of the 5′ end were discarded, and iii) reads with flanking adapter sequence in the 3′ end were trimmed by removal of the corresponding number of nucleotides. _

My question: Is there a option in variantFiltration or HaplotypeCaller that can mask/ignore variants that are detected in X (fx. 5) bp distance from the 3´and 5´ end of the reads?

Thank you!

↧

Differences between GATK 4.beta.5 vs 4.0.0.0 HaplotypeCaller results

January 17, 2018, 9:26 am

≫ Next: A logical problem with SplitCommonSuffices and MergeCommonSuffices

≪ Previous: VariantFiltration | HaplotypeCaller - ignoring variants close (5bp) from 3´and 5´end.

Hi!
I'd like to perform short germline variant calling on human DNA-seq samples (separate analysis of WES cohort and PCR-free WGS cohort, both paired end). The plan is to follow GATK best practices of short variant discovery with joint genotyping, starting with single-sample gVCF creation through HaplotypeCaller.

I have analyzed some samples with HaplotypeCaller 4.beta.5, and was unsure whether there had been any fixes between 4.beta.5 and 4.0.0.0 that would necessitate re-running the samples.

To check, I ran the 4.beta.5 and 4.0.0.0 HaplotypeCaller on chr21 of 1000Genomes sample NA11992 link.

I ran the same command, on the same machine, in different conda environments:

4.beta.5

gatk4                     4.0b5                    py27_0    bioconda
picard                    2.16.0                   py27_0    bioconda
setuptools                38.2.4                   py27_0    conda-forge
wheel                     0.30.0                     py_1    conda-forge

4.0.0.0

gatk4                     4.0.0.0                  py27_0    bioconda
picard                    2.17.2                   py27_0    bioconda
setuptools                38.4.0                   py27_0    conda-forge
wheel                     0.30.0                   py27_2    conda-forge

Java

$ java -version
openjdk version "1.8.0_121"
OpenJDK Runtime Environment (Zulu 8.20.0.5-linux64) (build 1.8.0_121-b15)
OpenJDK 64-Bit Server VM (Zulu 8.20.0.5-linux64) (build 25.121-b15, mixed mode)

GATK command

gatk-launch HaplotypeCaller -R $refs/GRCh37.71.nochr.fa -I $data/NA11992.mapped.ILLUMINA.bwa.CEU.exome.20130415.bam -O ${version}.NA11992.21.default.vcf.gz -ERC GVCF -L 21

Results showed 9 differences using diff (10 including the the different ##GATKCommandLine in the gVCF header). Sometimes PL or SB fields change, sometimes non-variant blocks are subdivided differently, and sometimes indel calls change. Three differences are below:

$diff 4.0.0.0.NA11992.21.default.vcf 4.beta.5.NA11992.21.default.vcf
...
36482c36480
< 21    19701769        .       AT      A,<NON_REF>     33.73   .       BaseQRankSum=-0.253;ClippingRankSum=0.000;DP=6;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=0.842;RAW_MQ=19369.00;ReadPosRankSum=0.524 GT:AD:DP:GQ:PGT:PID:PL:SB
       0/1:2,3,0:5:66:0|1:19701769_AT_A:71,0,66,77,75,152:0,2,1,2
---
> 21    19701769        .       AT      A,<NON_REF>     33.73   .       BaseQRankSum=-0.253;ClippingRankSum=0.000;DP=6;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=0.842;RAW_MQ=19369.00;ReadPosRankSum=0.524 GT:AD:DP:GQ:PL:SB       0/1:2,3,0:5:66:71,0,66,77,76,152:0,2,1,2

36485,36486c36483,36485
< 21    19701776        .       T       <NON_REF>       .       .       END=19701778    GT:DP:GQ:MIN_DP:PL      0/0:6:12:6:0,12,180
< 21    19701779        .       TG      T,<NON_REF>     19.78   .       BaseQRankSum=0.431;ClippingRankSum=0.000;DP=6;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=-0.967;RAW_MQ=19369.00;ReadPosRankSum=1.282 GT:AD:DP:GQ:PGT:PID:PL:SB
       0/1:4,2,0:6:57:1|0:19701769_AT_A:57,0,146,69,152,221:1,3,0,2
---
> 21    19701776        .       T       <NON_REF>       .       .       END=19701777    GT:DP:GQ:MIN_DP:PL      0/0:6:12:6:0,12,180
> 21    19701778        .       TTG     T,<NON_REF>     0       .       DP=6;ExcessHet=3.0103;MLEAC=0,0;MLEAF=0.00,0.00;RAW_MQ=19369.00 GT:AD:DP:GQ:PL:SB       0/0:6,0,0:6:18:0,18,203,18,203,203:1,5,0,0
> 21    19701779        .       TG      T,<NON_REF>     3.96    .       BaseQRankSum=0.431;ClippingRankSum=0.000;DP=6;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=-0.967;RAW_MQ=19369.00;ReadPosRankSum=1.282 GT:AD:DP:GQ:PL:SB       0/1:4,2,0:6:39:39,0,146,51,152,204:1,3,0,2

54434c54433,54435
< 21    26959938        .       A       <NON_REF>       .       .       END=26959949    GT:DP:GQ:MIN_DP:PL      0/0:9:24:8:0,24,260
---
> 21    26959938        .       A       <NON_REF>       .       .       END=26959943    GT:DP:GQ:MIN_DP:PL      0/0:8:24:8:0,24,260
> 21    26959944        .       C       <NON_REF>       .       .       END=26959944    GT:DP:GQ:MIN_DP:PL      0/0:9:27:9:0,27,275
> 21    26959945        .       A       <NON_REF>       .       .       END=26959949    GT:DP:GQ:MIN_DP:PL      0/0:9:24:9:0,24,360
...

Overall, can I use 4.beta.5 HaplotypeCaller results in downstream analysis, or should results be re-analyzed with 4.0.0.0 HaplotypeCaller? This analysis showed relatively few differences, but I'm still unsure about 4.beta.5 HaplotypeCaller.

↧

A logical problem with SplitCommonSuffices and MergeCommonSuffices

January 17, 2018, 5:45 pm

≫ Next: How to call SNP without confidence SNP ？

≪ Previous: Differences between GATK 4.beta.5 vs 4.0.0.0 HaplotypeCaller results

@Sheila @valentin @depristo
For example ：
A+x -> y (A+x,y is a point)
B+x -> y
after SplitCommonSuffices
A -> x -> y (A,B,x,y is a point)
B -> x -> y
after MergeCommonSuffices
A -> x => A+x
B -> x => B+x
then after SplitCommonSuffices after MergeCommonSuffices ...
of course some time after SplitCommonSuffices SeqVertex id is bigger than before but this may occured in unintended(unmind) circumstances.

↧

Important notes

General comparison of VCF vs. GVCF

The two types of GVCFs

Example GVCF file

Header:

Records

Contents

1. Overview

2. Structure of a VCF file

3. Interpreting the header information

VCF spec version

FILTER lines

FORMAT and INFO lines

GATKCommandLine

Contig lines and Reference

4. Structure of variant call records

Site-level properties and annotations

CHROM and POS

ID

REF and ALT

QUAL

FILTER

INFO

Sample-level annotations

5. Interpreting genotype and other sample-level information

GT

AD and DP

PL

GQ

A few examples

6. Basic operations: validating, subsetting and exporting from a VCF

Validate your VCF

Subset records from your VCF

Important notes about subsetting operations

Extract information from a VCF in a sane, (mostly) straightforward way

7. Merging VCF files

Poplin et al. 2017 : Detailed description of HaplotypeCaller; best reference for germline joint calling

Van der Auwera et al. 2013 : Hands-on tutorial with step-by-step explanations

DePristo et al. 2011 : First incarnation of the Best Practices workflow

McKenna et al. 2010 : Original description of the GATK framework

Example

Local realignment

Read filtering

Downsampling

Overview

1. Define active regions

2. Determine haplotypes by local assembly of the active region.

3. Evaluating the evidence for haplotypes and variant alleles

4. Assigning per-sample genotypes

Contents

1. The basic math

2. Example and interpretation

Calculate the raw PL values

Normalize

Genotype quality

3. Special case: non-reference confidence model (GVCF mode)

Contents

1. Overview

2. Reference graph assembly

3. Threading reads through the graph

Note on graph complexity, cycles and non-unique kmers

4. Graph refinement

5. Select best haplotypes

6. Identify potential variation sites

Contents

1. Overview

2. Calculating the raw activity profile

3. Smoothing the activity profile

4. Setting the ActiveRegion thresholds and intervals

Contents

1. Overview

2. Evaluating the evidence for each candidate haplotype

3. Evaluating the evidence for each candidate site and corresponding alleles

Contents

1. Overview

2. Preliminary assumptions / limitations

Quality

Ploidy

Paired end reads

Single-sample vs multi-sample