Quantcast
Channel: haplotypecaller — GATK-Forum
Viewing all 1335 articles
Browse latest View live

when to apply assembly-regon-padding step

$
0
0

Hi, I find that there are multiple steps in determining active regions in haplotypecaller, so I wonder when is assembly-region-padding is applied, is it applied during steps in determining active regions or after determining active regions


HaplotypeCaller pooled sequence problem

$
0
0

Hi,

I have a number of samples that consist of multiple individuals from the same population pooled together, and have been truing to use HaplotypeCaller to call the variants. I have set the (ploidy to 2 * number of individuals) but keep getting the same or similar error message, after running for several hours or days:

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.3-0-g37228af):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: the combination of ploidy (180) and number of alleles (9) results in a very large number of genotypes (> 2147483647). You need to limit ploidy or the number of alternative alleles to analyze this locus
ERROR ------------------------------------------------------------------------------------------

and I'm not sure what I can do to rectify it... Obviously I can't limit the ploidy, it is what it is, and I thought that HC only allows a maximum of six alleles anyway?

My code is below, and any help would be appreciated.

java -Xmx24g -jar ~/bin/GenomeAnalysisTK-3.3-0/GenomeAnalysisTK.jar -T HaplotypeCaller
-nct 6 \
-R ~/my_ref_sequence \
--intervals ~/my_intervals_file \
-ploidy 180 \
-log my_log_file \
-I ~/my_input_bam \
-o ~/my_output_vcf

How to run GATK directly on SRA files

$
0
0

Hello , I recently saw a webinar by NCBI "Advanced Workshop on SRA and dbGaP Data Analysis" (ftp://ftp.ncbi.nlm.nih.gov/pub/education/public_webinars/2016/03Mar23_Advanced_Workshop/). They mentioned that they were able to run GATK directly on SRA files.

I downloaded GenomeAnalysisTK-3.5 jar file to my computer. I tried both these commands:

java -jar /path/GenomeAnalysisTK-3.5/GenomeAnalysisTK.jar -T HaplotypeCaller -R SRRFileName -I SRRFileName -stand_call_conf 30 -stand_emit_conf 10 -o SRRFileName.vcf

java -jar /path/GenomeAnalysisTK-3.5/GenomeAnalysisTK.jar -T SRRFileName -R SRR1718738 -I SRRFileName -stand_call_conf 30 -stand_emit_conf 10 -o SRRFileName.vcf

For both these commands, I got this error:
ERROR MESSAGE: Invalid command line: The GATK reads argument (-I, --input_file) supports only BAM/CRAM files with the .bam/.cram extension and lists of BAM/CRAM files with the .list extension, but the file SRR1718738 has neither extension. Please ensure that your BAM/CRAM file or list of BAM/CRAM files is in the correct format, update the extension, and try again.

I don't see any documentation here about this, so wanted to check with you or anyone else has had any experience with this.

Thanks
K

Haplotypecaller calls variants at a deletion region

$
0
0

Hi,
I'm having a confusing problem when using haplotypecaller.

Basically, I'm using haplotypecaller calling variants among more than 400 M. tuberculosis samples, sequenced with Hiseq2500 platform. I followed the workflow for calling variants on cohort samples as described here: https://gatkforums.broadinstitute.org/gatk/discussion/3893/calling-variants-on-cohorts-of-samples-using-the-haplotypecaller-in-gvcf-mode

I find a problem with some samples when checking the SNPs called by this procedure. For example, as in Sample1, as show in this figure

,there seems to be a deletion at the position 2866805. However, the GATK3.8 called a SNP at this position, as shown in the excerpt from the vcf file below:

NC_000962.3 2866805 . C G 8160 . AC=1;AF=1.00;AN=1;DP=182;FS=0.000;GQ_MEAN=8190.00;MLEAC=1;MLEAF=1.00;MQ=50.38;MQ0=0;NCC=0;QD=31.09;SOR=0.917 GT:AD:GQ:PL 1:0,176:99:8190,0

In total, haplotypecaller called 11 snps at this deletion region.

So I'm confused that why haplotypecaller called a snp variant when bam file shows there is a deletion? I would really appreciate if you could help me to figure this out. Thank you in advance!

P.S. after finding this problem, we also tried UnifiedGenotyper on Sample1, and the variants at the deletion region were not called this time.

VCF - Variant Call Format

$
0
0

This document describes "regular" VCF files produced for GERMLINE short variant (SNP and indel) calls (e.g. by HaplotypeCaller in "normal" mode and by GenotypeGVCFs). For information on the special kind of VCF called GVCF produced by HaplotypeCaller in -ERC GVCF mode, please see the GVCF entry. For information specific to SOMATIC calls, see the Mutect2 documentation.


Contents

  1. Overview
  2. Structure of a VCF file
  3. Interpreting the header information
  4. Structure of variant call records
  5. Interpreting genotype and other sample-level information
  6. Basic operations: validating, subsetting and exporting from a VCF
  7. Merging VCF files

1. Overview

VCF stands for Variant Call Format. It is a standardized text file format for representing SNP, indel, and structural variation calls. The VCF specification used to be maintained by the 1000 Genomes Project, but its management and further development has been taken over by the Genomic Data Toolkit team of the Global Alliance for Genomics and Health. The full format spec can be found in the Samtools/Hts-specs repository along with other useful specifications like SAM/BAM/CRAM. We highly encourage you to take a look at those documents, as they contain a lot of useful information that we don't go over in this document.

VCF is the primary (and only well-supported) format used by the GATK for variant calls. We prefer it above all others because while it can be a bit verbose, the VCF format is very explicit about the exact type and sequence of variation as well as the genotypes of multiple samples for this variation.

That being said, this highly detailed information can be challenging to understand. The information provided by the GATK tools that infer variation from high-throughput sequencing data, such as the HaplotypeCaller, is especially complex. This document describes the key features and annotations that you need to know about in order to understand VCF files output by the GATK tools.

Note that VCF files are plain text files, so you can open them for viewing or editing in any text editor, with the following caveats:

  • Some VCF files are very large, so your personal computer may struggle to load the whole file into memory. In such cases, you may need to use a different approach, such as using UNIX tools to access the part of the dataset that is relevant to you, or subsetting the data using tools like GATK's SelectVariants.

  • NEVER EDIT A VCF IN A WORD PROCESSOR SUCH AS MICROSOFT WORD BECAUSE IT WILL SCREW UP THE FORMAT! You have been warned :)

  • Don't write home-brewed VCF parsing scripts. It never ends well.


2. Structure of a VCF file

A valid VCF file is composed of two main parts: the header, and the variant call records.

image

The header contains information about the dataset and relevant reference sources (e.g. the organism, genome build version etc.), as well as definitions of all the annotations used to qualify and quantify the properties of the variant calls contained in the VCF file. The header of VCFs generated by GATK tools also include the command line that was used to generate them. Some other programs also record the command line in the VCF header, but not all do so as it is not required by the VCF specification. For more information about the header, see the next section.

The actual data lines will look something like this:

[HEADER LINES]
#CHROM  POS ID      REF ALT QUAL    FILTER  INFO          FORMAT          NA12878
20  10001019    .   T   G   364.77  .   AC=1;AF=0.500;AN=2;BaseQRankSum=0.699;ClippingRankSum=0.00;DP=34;ExcessHet=3.0103;FS=3.064;MLEAC=1;MLEAF=0.500;MQ=42.48;MQRankSum=-3.219e+00;QD=11.05;ReadPosRankSum=-6.450e-01;SOR=0.537   GT:AD:DP:GQ:PL  0/1:18,15:33:99:393,0,480
20  10001298    .   T   A   884.77  .   AC=2;AF=1.00;AN=2;DP=30;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=29.49;SOR=1.765    GT:AD:DP:GQ:PL  1/1:0,30:30:89:913,89,0
20  10001436    .   A   AAGGCT  1222.73 .   AC=2;AF=1.00;AN=2;DP=29;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=25.36;SOR=0.836    GT:AD:DP:GQ:PL  1/1:0,28:28:84:1260,84,0
20  10001474    .   C   T   843.77  .   AC=2;AF=1.00;AN=2;DP=27;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=31.25;SOR=1.302    GT:AD:DP:GQ:PL  1/1:0,27:27:81:872,81,0
20  10001617    .   C   A   493.77  .   AC=1;AF=0.500;AN=2;BaseQRankSum=1.63;ClippingRankSum=0.00;DP=38;ExcessHet=3.0103;FS=1.323;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.00;QD=12.99;ReadPosRankSum=0.170;SOR=1.179   GT:AD:DP:GQ:PL  0/1:19,19:38:99:522,0,480

After the header lines and the field names, each line represents a single variant, with various properties of that variant represented in the columns. Note that all the lines shown in the example above describe SNPs and indels, but other variation types could be described (see the VCF specification for details). Depending on how the callset was generated, there may only be records for sites where a variant was identified, or there may also be "invariant" records, ie records for sites where no variation was identified.

You will sometimes come across VCFs that have only 8 columns, and contain no FORMAT or sample-specific information. These are called "sites-only" VCFs, and represent variation that has been observed in a population. Generally, information about the population of origin should be included in the header.


3. Interpreting the header information

The following is a valid VCF header produced by GenotypeGVCFs on an example data set (derived from our favorite test sample, NA12878). You can download similar test data from our resource bundle and try looking at it yourself.

##fileformat=VCFv4.2
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location">
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another">
##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=RGQ,Number=1,Type=Integer,Description="Unconditional reference genotype confidence, encoded as a phred quality -10*log10 p(genotype call is wrong)">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
##GATKCommandLine.HaplotypeCaller=<ID=HaplotypeCaller,Version=3.7-0-gcfedb67,Date="Fri Jan 20 11:14:15 EST 2017",Epoch=1484928855435,CommandLineOptions="[command-line goes here]">
##GATKCommandLine=<ID=GenotypeGVCFs,CommandLine="[command-line goes here]",Version=4.beta.6-117-g4588584-SNAPSHOT,Date="December 23, 2017 5:45:56 PM EST">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
##INFO=<ID=ExcessHet,Number=1,Type=Float,Description="Phred-scaled p-value for exact test of excess heterozygosity">
##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect strand bias">
##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
##INFO=<ID=RAW_MQ,Number=1,Type=Float,Description="Raw data for RMS Mapping Quality">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
##INFO=<ID=SOR,Number=1,Type=Float,Description="Symmetric Odds Ratio of 2x2 contingency table to detect strand bias">
##contig=<ID=20,length=63025520>
##reference=file:///data/ref/ref.fasta
##source=GenotypeGVCFs

That's a lot of lines, so let's break it down into digestible bits. Note that the header lines are always listed in alphabetical order.

VCF spec version

The first line:

##fileformat=VCFv4.2

tells you the version of the VCF specification to which the file conforms. This may seem uninteresting but it can have some important consequences for how to handle and interpret the file contents. As genomics is a fast moving field, the file formats are evolving fairly rapidly, so some of the encoding conventions change. If you run into unexpected issues while trying to parse a VCF file, be sure to check the version and the spec for any relevant format changes.

FILTER lines

The FILTER lines tell you what filters have been applied to the data. In our test file, one filter has been applied:

##FILTER=<ID=LowQual,Description="Low quality">

Records that fail any of the filters listed here will contain the ID of the filter (here, LowQual) in its FILTER field (see how records are structured further below).

FORMAT and INFO lines

These lines define the annotations contained in the FORMAT and INFO columns of the VCF file, which we explain further below. If you ever need to know what an annotation stands for, you can always check the VCF header for a brief explanation (at least if you're using a civilized program that writes definition lines to the header).

GATKCommandLine

The GATKCommandLine lines contain all the parameters that went used by the tool that generated the file. Here, GATKCommandLine.HaplotypeCaller refers to a command line invoking HaplotypeCaller. These parameters include all the arguments that the tool accepts, along with the values that were applied (if you don't pass one, a default is applied); so it's not just the arguments specified explicitly by the user in the command line.

Contig lines and Reference

These contain the contig names, lengths, and which reference assembly was used with the input BAM file. This can come in handy when someone gives you a callset but doesn't tell you which reference it was derived from -- remember that for many organisms, there are multiple reference assemblies, and you should always make sure to use the appropriate one!

For more information on genome references, see the corresponding Dictionary entry.


4. Structure of variant call records

For each site record, the information is structured into columns (also called fields) as follows:

#CHROM  POS ID  REF ALT     QUAL    FILTER  INFO    FORMAT  NA12878 [other samples...]

The first 8 columns of the VCF records (up to and including INFO) represent the properties observed at the level of the variant (or invariant) site. Keep in mind that when multiple samples are represented in a VCF file, some of the site-level annotations represent a summary or average of the values obtained for that site from the different samples.

Sample-specific information such as genotype and individual sample-level annotation values are contained in the FORMAT column (9th column) and in the sample-name columns (10th and beyond). In the example above, there is one sample called NA12878; if there were additional samples there would be additional columns to the right. Most programs order the sample columns alphabetically by sample name, but this is not always the case, so be aware that you can't depend on ordering rules for parsing VCF output!

Site-level properties and annotations

These first 7 fields are required by the VCF format and must be present, although they can be empty (in practice, there has to be a dot, ie . to serve as a placeholder).

CHROM and POS

The contig and genomic coordinates on which the variant occurs. Note that for deletions the position given is actually the base preceding the event.

ID

An optional identifier for the variant. Based on the contig and position of the call and whether a record exists at this site in a reference database such as dbSNP. A typical identifier is the dbSNP ID, which in human data would look like rs28548431, for example.

REF and ALT

The reference allele and alternative allele(s) observed in a sample, set of samples, or a population in general (depending how the VCF was generated). The REF and ALT alleles are the only required elements of a VCF record that tell us whether the variant is a SNP or an indel (or in complex cases, a mixed-type variant). If we look at the following two sites, we see the first is a SNP, the second is an insertion and the third is a deletion:

20  10001298    .   T   A   884.77  .   [CLIPPED]   GT:AD:DP:GQ:PL  1/1:0,30:30:89:913,89,0
20  10001436    .   A   AAGGCT  1222.73 .   [CLIPPED]   GT:AD:DP:GQ:PL  1/1:0,28:28:84:1260,84,0
20  10004769    .   TAAAACTATGC T   622.73  .   [CLIPPED]   GT:AD:DP:GQ:PL  0/1:18,17:35:99:660,0,704

Note that REF and ALT are always given on the forward strand. For insertions, the ALT allele includes the inserted sequence as well as the base preceding the insertion so you know where the insertion is compared to the reference sequence. For deletions, the ALT allele is the base before the deletion.

QUAL

The Phred-scaled probability that a REF/ALT polymorphism exists at this site given sequencing data. Because the Phred scale is -10 * log(1-p), a value of 10 indicates a 1 in 10 chance of error, while a 100 indicates a 1 in 10^10 chance (see the Dictionary entry). These values can grow very large when a large amount of data is used for variant calling, so QUAL is not often a very useful property for evaluating the quality of a variant call. See our documentation on filtering variants for more information on this topic.

Not to be confused with the sample-level annotation GQ; see this FAQ article for an explanation of the differences in what they mean and how they should be used.

FILTER

This field contains the name(s) of any filter(s) that the variant fails to pass, or the value PASS if the variant passed all filters. If the FILTER value is ., then no filtering has been applied to the records. It is extremely important to apply appropriate filters before using a variant callset in downstream analysis. See our documentation on filtering variants for more information on this topic.

INFO

Various site-level annotations. This field is not required to be present in the VCF.

The annotations contained in the INFO field are represented as tag-value pairs, where the tag and value are separated by an equal sign, ie =, and pairs are separated by colons, ie ; as in this example: MQ=99.00;MQ0=0;QD=17.94. They typically summarize context information from the samples, but can also include information from other sources (e.g. population frequencies from a database resource). Some are annotated by default by the GATK tools that produce the callset, and some can be added on request. They are always defined in the VCF header, so that's an easy way to check what an annotation means if you don't recognize it. You can also find additional information on how they are calculated and how they should be interpreted in the "Annotations" section of the Tool Documentation.

Sample-level annotations

At this point you've met all the fields up to INFO in this lineup:

#CHROM  POS ID  REF ALT     QUAL    FILTER  INFO    FORMAT  NA12878 [other samples...]

All the rest is going to be sample-level information. Sample-level annotations are tag-value pairs, like the INFO annotations, but the formatting is a bit different. The short names of the sample-level annotations are recorded in the FORMAT field. The annotation values are then recorded in corresponding order in each sample column (where the sample names are the SM tags identified in the read group data). Typically, you will at minimum have information about the genotype and confidence in the genotype for the sample at each site. See the next section on genotypes for more details.


5. Interpreting genotype and other sample-level information

The sample-level information contained in the VCF (also called "genotype fields") may look a bit complicated at first glance, but they're actually not that hard to interpret once you understand that they're just sets of tags and values.

Let's take a look at three of the records shown earlier, simplified to just show the key genotype annotations:

20  10001019    .   T   G   364.77  .   [CLIPPED]   GT:AD:DP:GQ:PL  0/1:18,15:33:99:393,0,480
20  10001298    .   T   A   884.77  .   [CLIPPED]   GT:AD:DP:GQ:PL  1/1:0,30:30:89:913,89,0
20  10001436    .   A   AAGGCT  1222.73 .   [CLIPPED]   GT:AD:DP:GQ:PL  1/1:0,28:28:84:1260,84,0

Looking at that last column, here is what the tags mean:

GT

The genotype of this sample at this site. For a diploid organism, the GT field indicates the two alleles carried by the sample, encoded by a 0 for the REF allele, 1 for the first ALT allele, 2 for the second ALT allele, etc. When there's a single ALT allele (by far the more common case), GT will be either:

- 0/0 : the sample is homozygous reference
- 0/1 : the sample is heterozygous, carrying 1 copy of each of the REF and ALT alleles
- 1/1 : the sample is homozygous alternate

In the three sites shown in the example above, NA12878 is observed with the allele combinations T/G, A/A and AAGGCT/AAGGCT respectively. For non-diploids, the same pattern applies; in the haploid case there will be just a single value in GT (e.g. 1); for polyploids there will be more, e.g. 4 values for a tetraploid organism (e.g. 0/0/1/1).

AD and DP

Allele depth (AD) and depth of coverage (DP). These are complementary fields that represent two important ways of thinking about the depth of the data for this sample at this site.

AD is the unfiltered allele depth, i.e. the number of reads that support each of the reported alleles. All reads at the position (including reads that did not pass the variant caller’s filters) are included in this number, except reads that were considered uninformative. Reads are considered uninformative when they do not provide enough statistical evidence to support one allele over another.

DP is the filtered depth, at the sample level. This gives you the number of filtered reads that support each of the reported alleles. You can check the variant caller’s documentation to see which filters are applied by default. Only reads that passed the variant caller’s filters are included in this number. However, unlike the AD calculation, uninformative reads are included in DP.

See the Tool Documentation for more details on AD (DepthPerAlleleBySample) and DP (Coverage) for more details.

PL

"Normalized" Phred-scaled likelihoods of the possible genotypes. For the typical case of a monomorphic site (where there is only one ALT allele) in a diploid organism, the PL field will contain three numbers, corresponding to the three possible genotypes (0/0, 0/1, and 1/1). The PL values are "normalized" so that the PL of the most likely genotype (assigned in the GT field) is 0 in the Phred scale. We use "normalized" in quotes because these are not probabilities. We set the most likely genotype PL to 0 for easy reading purpose.The other values are scaled relative to this most likely genotype.

Keep in mind, if you're not familiar with the statistical lingo, that when we say PL is the "Phred-scaled likelihood of the genotype", we mean it is "How much less likely that genotype is compared to the best one". Have a look at this article for an example of how PL is calculated.

GQ

The Genotype Quality represents the Phred-scaled confidence that the genotype assignment (GT) is correct, derived from the genotype PLs. Specifically, the GQ is the difference between the PL of the second most likely genotype, and the PL of the most likely genotype. As noted above, the values of the PLs are normalized so that the most likely PL is always 0, so the GQ ends up being equal to the second smallest PL, unless that PL is greater than 99. In GATK, the value of GQ is capped at 99 because larger values are not more informative, but they take more space in the file. So if the second most likely PL is greater than 99, we still assign a GQ of 99.

Basically the GQ gives you the difference between the likelihoods of the two most likely genotypes. If it is low, you can tell there is not much confidence in the genotype, i.e. there was not enough evidence to confidently choose one genotype over another. See the FAQ article on the Phred scale to get a sense of what would be considered low.

Not to be confused with the site-level annotation QUAL; see this FAQ article for an explanation of the differences in what they mean and how they should be used.

A few examples

With all the definitions out of the way, let's interpret the genotype information for a few records from our NA12878 callset, starting with at position 10001019 on chromosome 20:

20  10001019    .   T   G   364.77  .   [CLIPPED]   GT:AD:DP:GQ:PL  0/1:18,15:33:99:393,0,480

At this site, the called genotype is GT = 0/1, which corresponds to a heterozygous genotype with alleles T/G. The confidence indicated by GQ = 99 is very good; there were a total of 33 informative reads at this site (DP=33), 18 of which supported the REF allele (=had the reference base) and 15 of which supported the ALT allele (=had the alternate base) (indicated by AD=18,15). The degree of certainty in our genotype is evident in the PL field, where PL(0/1) = 0 (the normalized value that corresponds to a likelihood of 1.0) as is always the case for the assigned allele; the next PL is PL(0/0) = 393, corresponding to 10^(-39.3), or 5.0118723e-40 which is a very small number indeed; and the next one will be even smaller. The GQ ends up being 99 because of the capping as explained above.

Now let's look at a site where our confidence is quite a bit lower:

20  10024300    .   C   CTT 43.52   .   [CLIPPED]   GT:AD:DP:GQ:PL  0/1:1,4:6:20:73,0,20

Here we have an indel -- specifically an insertion of TT after the reference C base at position 10024300. The called genotype is GT = 0/1 again, but this time the GQ = 20 indicates that even though this is probably a real variant (the QUAL is not too bad), we're not sure we have the right genotype. Looking at the coverage annotations, we see we only had 6 reads there, of which 1 supported REF and 4 supported ALT (and one read must have been considered uninformative, possibly due to quality issues). With so little coverage, we can't be sure that the genotype shouldn't in fact be homozygous variant.

Finally, let's look at a more complicated example:

20  10009875    .   A   G,AGGGAGG   1128.77 .   [CLIPPED]   GT:AD:DP:GQ:PL  1/2:0,11,5:16:99:1157,230,161,487,0,434

This site is a doozy; two credible ALT alleles were observed, but the REF allele was not -- so technically this is a biallelic site in our sample, but will be considered multiallelic because there are more than two alleles notated in the record. It's also a mixed-type record, since one of the ALTs by itself would make it an A->G SNP, and the other would make it an insertion of GGGAGG after the reference A. The called genotype is GT = 1/2, which means it's a heterozygous genotype composed of two different ALT alleles. The coverage wasn't great, and wasn't all that balanced between the two ALTs (since one was supported by 11 reads and the other by 5) but it was sufficient for the program to have high confidence in its call.


6. Basic operations: validating, subsetting and exporting from a VCF

These are a few common things you may want to do with your VCFs that don't deserve their own tutorial. Let us know if there are other operations you think we should cover here.

Validate your VCF

By that I mean check that the format of the file is correct, follows the specification, and will therefore not break any well-behave tool you choose to run on it. You can do this very simply with ValidateVariants. Note that ValidateVariants can also be used on GVCFs if you use the --gvcf argument.

Subset records from your VCF

Sometimes you want to subset just one or a few samples from a big cohort. Sometimes you want to subset to just a genomic region. Sometimes you want to do both at the same time! Well, the same tool can do both, and more; it's called SelectVariants and has a lot of options for doing this like that (including operating over intervals in the usual way). There are many options for setting the selection criteria, depending on what you want to achieve. For example, given a single VCF file, one or more samples can be extracted from the file, based either on a complete sample name, or on a pattern match. Variants can also be selected based on annotated properties, such as depth of coverage or allele frequency. This is done using JEXL expressions. Other VCF files can also be used to modify the selection based on concordance or discordance between different callsets (see --discordance / --concordance arguments in the Tool Doc.

Important notes about subsetting operations

  • In the output VCF, some annotations such as AN (number of alleles), AC (allele count), AF (allele frequency), and DP (depth of coverage) are recalculated as appropriate to accurately reflect the composition of the subset callset.

  • By default, SelectVariants will keep all ALT alleles, even if they are no longer supported by any samples after subsetting. This is the correct behavior, as reducing samples down shouldn't change the character of the site, only the AC in the subpopulation. In some cases this will produce monomorphic records, i.e. where no ALT alleles are supported. The tool accepts flags that exclude unsupported alleles and/or monomorphic records from the output.

Extract information from a VCF in a sane, (mostly) straightforward way

Use VariantsToTable.

No, really, don't write your own parser if you can avoid it. This is not a comment on how smart or how competent we think you are -- it's a comment on how annoyingly obtuse and convoluted the VCF format is.

Seriously. The VCF format lends itself really poorly to parsing methods like regular expressions, and we hear sob stories all the time from perfectly competent people whose home-brewed parser broke because it couldn't handle a more esoteric feature of the format. We know we broke a bunch of people's scripts when we introduced a new representation for spanning deletions in multisample callsets. OK, we ended up replacing it with a better representation a month later that was a lot less disruptive and more in line with the spirit of the specification -- but the point is, that first version was technically legal according to the 4.2 spec, and that sort of thing can happen at any time. So yes, the VCF is a difficult format to work with, and one way to deal with that safely is to not home-brew parsers.

(Why are we sticking with it anyway? Because, as Winston Churchill famously put it, VCF is the worst variant call representation, except for all the others.)


7. Merging VCF files

There are three main reasons why you might want to combine variants from different files into one, and the tool to use depends on what you are trying to achieve.

  1. The most common case is when you have been parallelizing your variant calling analyses, e.g. running HaplotypeCaller per-chromosome, producing separate VCF files (or GVCF files) per-chromosome. For that case, you can use the Picard tool MergeVcfs to merge the files. See the relevant Tool Doc page for usage details.

  2. The second case is when you have been using HaplotypeCaller in -ERC GVCF or -ERC BP_RESOLUTION to call variants on a large cohort, producing many GVCF files. You then need to consolidate them before joint-calling variants with GenotypeGVCFs (for performance reasons). This can be done with either CombineGVCFs or ImportGenomicsDB tools, both of which are specifically designed to handle GVCFs in this way. See the relevant Tool Doc pages for usage details and the Best Practices workflow documentation to learn more about the logic of this workflow.

  3. The third case is when you want to compare variant calls that were produced from the same samples but using different methods, for comparison. For example, if you're evaluating variant calls produced by different variant callers, different workflows, or the same but using different parameters. For this case, we recommend taking a different approach; rather than merging the VCF files (which can have all sorts of complicated consequences), you can us the VariantAnnotator tool to annotate one of the VCFs with the other treated as a resource. See the relevant Tool Doc page for usage details.

There is actually one more reason why you might want to combine variants from different files into one, but we do not recommend doing it: you have produced variant calls from various samples separately, and want to combine them for analysis. This is how people used to do variant analysis on large numbers of samples, but we don't recommend proceeding this way because that workflow suffers from serious methodological flaws. Instead, you should follow our recommendations as laid out in the Best Practices documentation.

How to best minimize variation between runs of HaplotypeCaller in GVCF mode?

$
0
0

I am using a combination of HaplotypeCaller local (non-spark), in GVCF mode, followed by GatherVcfs to merge them, and I get very different call results across runs. I would expect the probabilities/confidence values to change slightly, but not so much the number of calls. Is this normal?

I'm using the gatk from docker://broadinstitute/gatk:4.beta.6 . My BAM/BAI files pass validation.

I see other posts about results being non-deterministic. But I'm not passing any of the -nt or -nct flags in this case.

I'm splitting all my contigs (bed file) in roughly equal-sized chunks, and calling HaplotypeCaller, like so. The VCF file produced changes a lot if I do 8 chunks, vs 128. I'm not sure whether that makes things worse.

# chunk 000
java -jar /gatk/gatk.jar HaplotypeCaller -R ANN0859.bam --emitRefConfidence GVCF -L bed_chunk_000.bed -O ANN0859.bam_000.g.vcf -hets 0.010000
# chunk 001
java -jar /gatk/gatk.jar HaplotypeCaller -R ANN0859.bam --emitRefConfidence GVCF -L bed_chunk_001.bed -O ANN0859.bam_001.g.vcf -hets 0.010000
...

I merge them like so (passing all the chunks in order):

java -jar /gatk/gatk.jar GatherVcfs -I ANN0859.bam_000.g.vcf -I ANN0859.bam_001.g.vcf ...

The entire bed is sorted, and the chunks are not overlapping. I've made sure that I'm not losing any contigs when I split my bed file.

To provide an example difference for one of the chromosomes, I get the following calls (for 128 chunks) in the final output gVCF:

HanXRQChr00c0117        2497    .       G       <NON_REF>       .       .       END=2580        GT:DP:GQ:MIN_DP:PL      0/0:0:0:0:0,0,0
HanXRQChr00c0117        10708   .       G       <NON_REF>       .       .       END=25539       GT:DP:GQ:MIN_DP:PL      0/0:0:0:0:0,0,0
(EOF)

And if I divide the work in 8 (longer) chunks, that last section just explodes into 1960 different calls:

HanXRQChr00c0117        10708   .       G       <NON_REF>       .       .       END=14265       GT:DP:GQ:MIN_DP:PL      0/0:0:0:0:0,0,0
HanXRQChr00c0117        14266   .       C       <NON_REF>       .       .       END=14267       GT:DP:GQ:MIN_DP:PL      0/0:1:3:1:0,3,42
...
HanXRQChr00c0117        14309   .       T       C,<NON_REF>     0.13    .       DP=2;MLEAC=0,0;MLEAF=nan,nan;RAW_MQ=7200        GT:PGT:PID      ./.:0|1:14309_T_C
HanXRQChr00c0117        14310   .       T       <NON_REF>       .       .       END=14315       GT:DP:GQ:MIN_DP:PL      0/0:1:3:1:0,3,45
HanXRQChr00c0117        14316   .       T       C,<NON_REF>     0.13    .       DP=2;MLEAC=0,0;MLEAF=nan,nan;RAW_MQ=7200        GT:PGT:PID      ./.:0|1:14309_T_C
HanXRQChr00c0117        14317   .       T       <NON_REF>       .       .       END=14321       GT:DP:GQ:MIN_DP:PL      0/0:1:3:1:0,3,45
...
HanXRQChr00c0117        14358   .       T       <NON_REF>       .       .       END=14359       GT:DP:GQ:MIN_DP:PL      0/0:4:12:4:0,12,180
HanXRQChr00c0117        14360   .       A       G,<NON_REF>     30.02   .       DP=4;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.5,0;RAW_MQ=14400        GT:AD:DP:GQ:PL:SB       1/1:0,1,0:1:3:45,3,0,4
5,3,45:0,0,1,0
...
HanXRQChr00c0117        25479   .       T       <NON_REF>       .       .       END=25484       GT:DP:GQ:MIN_DP:PL      0/0:8:24:8:0,24,296
HanXRQChr00c0117        25485   .       T       <NON_REF>       .       .       END=25485       GT:DP:GQ:MIN_DP:PL      0/0:8:21:8:0,21,315
HanXRQChr00c0117        25486   .       T       <NON_REF>       .       .       END=25521       GT:DP:GQ:MIN_DP:PL      0/0:6:18:6:0,18,217
HanXRQChr00c0117        25522   .       A       <NON_REF>       .       .       END=25524       GT:DP:GQ:MIN_DP:PL      0/0:7:15:7:0,15,225
HanXRQChr00c0117        25525   .       T       <NON_REF>       .       .       END=25539       GT:DP:GQ:MIN_DP:PL      0/0:5:9:3:0,9,133
(EOF)

I thought at first that maybe the chunk boundaries were at play, but those contigs are in the middle of a chunk file.

(How to) generate a complete realigned bam file using -bamout argument in HaplotypeCaller?

$
0
0

Hello, I want to get a realigned bam file for other tools to call variants, so I used the -bamout argument in HaplotypeCaller. I found that bam file is incomplete when I used only -bamout argument. When I set --disable-optimizations and -bamout arguments and added -forceActive and -dontTrimActiveRegions flags, error messages said that " A USER ERROR has occurred: f is not a recognized option". Maybe the program didn't recognize these flags. My command line is shown below:
'''
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /software/bin/gatk-package-4.0.3.0-local.jar HaplotypeCaller -R /root/data/reference/gatk_bundle/Homo_sapiens_assembly38.fasta -I /root/data/output/6_BQSR/SRR2188163.bqsr.bam --dbsnp /root/data/reference/gatk_bundle/dbsnp_146.hg38.vcf.gz -O SRR2188163.raw.2.vcf -bamout SRR2188163.bamout.2.bam --disable-optimizations true -forceActive -dontTrimActiveRegions
'''
Could you tell me how I can use HplotypeCaller to get a complet realigned bam file? Thanks a lot.
The first picture is the screenshot of the output bam file which I used -bamout to generate. I used -bamout and --disable-optimizations arguments without add any flags to get the result in second picture. And I failed to add flags.

free of reference bias priors in HaplotypeCaller

$
0
0

Hello,

I would like to replicate the behaviour of gakt described in Mallick et al. 2016 for the Simon's genomes data set. They explain in the supplementary information the following:

"GATK UnifiedGenotyper has a built-in prior for Bayesian SNP calling that assumes that the site is more likely to be homozygous for the reference allele than homozygous for the variant allele. For a diploid sample, the default priors for a homozygous reference, heterozygote and homozygous non-reference genotypes are (0.9985, 0.001, 0.0005), respectively. When there is ambiguity in a heterozygote, GATK prefers the reference homozygote. This is a reference bias, and while this bias is not typically problematic for medical studies, it can complicate interpretation of population genetics signals. With the Genome Sequencing and Analysis Group at the Broad Institute, we developed an alternative model that was integrated into the UnifiedGenotyper, allowing reference-bias free priors to be specified. We are using a prior (0.4995, 0.001, 0.4995). Details are at: https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_ genotyper_UnifiedGenotyper.php#--input_prior."

I think these two examples might just do the thing:
(using either 3.x or 4.0.x)

java -jar ~/software/GenomeAnalysisTK-3.8-0-ge9d806836/GenomeAnalysisTK.jar -T HaplotypeCaller --emitRefConfidence GVCF --reference_sequence ~/hs37d5.fasta --input_file ~/file.bam --input_prior 0.001 --input_prior 0.4995

java -jar ~/software/gatk-package-4.0.3.0-local.jar -T HaplotypeCaller --emitRefConfidence GVCF--reference_sequence ~/hs37d5.fasta --input_file ~/file.bam --input_prior 0.001 --input_prior 0.4995

Does this makes sense, sorry?
These examples assume the two prior options have positional assingments to AC=1 -> 0/1 , and AC=2 -> 1/1 , ... and that as stated in the documentation about priors, AC=0 becomes 1 minus the sum of the two previous, thus effectively:

prior(0/0)=0.4995, prior(0/1)=0.001, prior(1/1)=0.4995

To understand the whole thing I'm building on these previous posts from @tommycarstensen , @magicDGS and @saeschba . Thanks guys too, and any info or extra feedback you may have, please let me know.

https://gatkforums.broadinstitute.org/gatk/discussion/8787/input-prior-default-value
https://gatkforums.broadinstitute.org/gatk/discussion/5877/caller-input-prior-option
https://gatkforums.broadinstitute.org/gatk/discussion/9489/should-it-say-ac-0-in-the-input-prior-documentation-for-the-haplotypecaller

This last question/topic makes me wonder too if AC should not be better understood here in terms of GT. I'm mostly familiar with the VCF format and AC stands there for allele count, which is a property of a site across many samples. Here in HaplotypeCaller we go over one sample at a time, not many. Maybe some inheritance from UnifiedGenotyper?

Best regards and many thanks for your comments,
Rodrigo


HaplotypeCaller: Alternate allele get called or not depending on -ip option

$
0
0

Hi, I'm currently analyzing some data (exome-seq) using HaplotypeCaller and get what seems to me an odd behaviour:
The problem is that I've got a position which is clearly bi-allelic in IGV and that is said to have only the reference allele in the gVCF I'm generating with HaplotypeCaller.

Here is the command line I used:

nohup java -jar PATH/GenomeAnalysisTK-3.4-46/GenomeAnalysisTK.jar \
-T HaplotypeCaller \
-R PATH/ucsc.hg19_noHaps.fasta \
-I PATH/JLCL254.realigned.recalibrated.bam \
-L PATH/merged.bed \
-ip 50 \
--emitRefConfidence GVCF \
--variant_index_type LINEAR \
--variant_index_parameter 128000 \
-o JLCL254.vcf 

Here is the gVCF line of the variant of interest:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT JLCL254

chr1 912049 . T NON_REF . . END=912049 GT:DP:GQ:MIN_DP:PL 0/0:68:0:68:0,0,0

The variant of interest is located 19bp away from the captured region but with "-ip 50" it should be detected.

To check what is really analyzed, I output the bamout for all analyzed regions (-L PATH/merged.bed, -ip 50) and saw that the location of the variant is not analyzed (If I'm correct: as there is no coverage, this is not an active region).

Then I forced the bamout at the location +/-20nt around my variant to check whether some reads with the alternate allele are still kept. I used:
-L chr1:912029-912069 \
-forceActive \
-disableOptimizations

Doing so I've being able to see that many reads with the alternate allele are indeed still kept. The gVCF file generated along with the bamout file contains now my variant of interest with the alternate allele:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT JLCL254

chr1 912049 rs9803103 T C,NON_REF 1235.77 . BaseQRankSum=-0.677;ClippingRankSum=-0.942;DB;DP=61;MLEAC=1,0;MLEAF=0.500,0.00;MQ=54.88;MQRankSum=-0.600;ReadPosRankSum=-0.195 GT:AD:DP:GQ:PL:SB 0/1:19,42,0:61:99:1264,0,448,1321,574,1895:3,16,6,36

Please see an IGV screenshot:

Tracks are (from top to bottom):
* the original bam file
* the bamout for all captured regions (known from the file -L PATH/merged.bed)
* the forced bamout (at the location of the variant i.e -L chr1:912029-912069)
* merged.bed is the file used with the -L option.

Finally, I tried to call variants changing the -ip option to 100 and got the alternate allele called.

Please note that:
If I manually add/subtract 50bp to the closest target region boundaries, I've got the same result as with -ip 50.
If I manually add/subtract 100bp to the closest target region boundaries, I've got the same result as with -ip 100.

I tried several versions of GATK (3.46, 3.7, 4.0.4.0) and always got the same results.

I may have miss something but so far I can't explain myself what's happening. Do you see any explanation for what I observe? Do you see any options I should use to overcome this?
Many thanks in advance for your help.

NB: java -version
openjdk version "1.8.0_171"
OpenJDK Runtime Environment (build 1.8.0_171-b10)
OpenJDK 64-Bit Server VM (build 25.171-b10, mixed mode)

Should I provide the exome target list (-L argu) even while calling gVCF file using Haplotypecaller?

$
0
0

Hi,

Recently we performed exome sequencing using Nextera Illumina platform for three samples (Father, Mother and Son). I downloaded the exome interval list from Illumina's website.

1) Trimmed the raw reads
2) Aligned the trimmed reads against the human reference hg19 as recommended for exome-sequencing
3) Then sorted, deduped, recalibrated the bam file.
4) Then performed variant calling in two steps process for all three samples individually
4.1) Used the GATK Haplotype Caller tool in GVCF mode
Command: java -Xmx16g -jar GenomeAnalysisTK.jar - T Haplotypecaller -R /GATK_bundle/hg19.fa -I sample1.sorted.dedup.recal.bam --emitRefConfidence GVCF --dbsnp /GATK_bundle/dbsnp.138.hg19.vcf -o sample1.raw.g.vcf
4.2) Used GenotypeGVCFs (Joint SNP calling) for all three samples together
Command: java -Xmx16g -jar GenomeAnalysisTK.jar - T GenotypeGVCFs -R /GATK_bundle/hg19.fa --variant sample1.raw.g.vcf --variant sample2.raw.g.vcf --variant sample3.raw.g.vcf --dbsnp /GATK_bundle/dbsnp.138.hg19.vcf -o sample1.2.3.trio.raw.vcf

In the above command, I didn't use the Illumina's exome interval list used for targeting the exomes in sequencing process.

As per this link "https://software.broadinstitute.org/gatk/documentation/article.php?id=4669", under the example section of GATK command lines, for exome sequencing the article suggests us to provide the exome targets using -L argument.

I have following queries,as per the aforementioned article
1) Should I provide the exome target list (-L argument) only while calling regular VCF file using Haplotype caller?
or
2) Should I provide the exome target list (-L argument) even while calling gVCF file using Haplotype caller?

Is there a paper describing the »Haplotype Caller algorithm?

$
0
0

Hi,

I'd like to ask you if there is a paper describing the Haplotype Caller algorithm, if you could please send me the reference. I have tried to find it, but I only found the paper on GATK which is great, but it doesn't describe in detail the Haplotype Caller algorithm.

thank you,

HaplotypeCaller sensitivity in large(ish) cohorts

$
0
0

One of my projects currently has ~150 patients (exomes) that I've been processing through the standard pipeline (2.8-1, including ReduceReads). In my most recent run through HC, I split the cohort in half for the sake of time. A subset of these patients have undergone targeted genotyping in the clinic, and I have a list of 36 validated variants in 28 samples. When I checked these variants in the final VCF, 5 of 36 were not called by HaplotypeCaller and have moderate to excellent support in the BAM. Several of these (possibly all of them? Not sure) were present in previous HC and UG runs with fewer samples, and I verified that the one I'm focusing on is called correctly when I only use five samples.

Debugging runs on a small region have revealed the following:

  1. ReduceReads does not seem to be the culprit, my variant is still uncalled when using the un-reduced bams
  2. My variant is not inside an Active Region
  3. When I force it to be with -forceActive, it's not in the trimmed ActiveRegion
  4. I've tried increasing -maxNumHaplotypesInPopulation as high as 1024, and the trimmed region still doesn't include my variant
  5. I've also tried running with -dontTrimActiveRegions, but haven't successfully finished yet (runtime increases from 30 seconds to over an hour, I keep trying to run it in short queues while I'm doing other stuff and getting killed by the scheduler)

A couple of other random notes that may or may not be applicable: These are rare variants that I only expect to see in 1 or 2 samples. My testing region is ~400bp around the variant in question. There is a variant in another sample at an immediately adjacent nucleotide that is also not called (and, perhaps obviously, is also outside the active regions).

Do you have any suggestions for approaching this? I haven't messed with -minPruning yet, as increasing that value should result in a loss of sensitivity and reducing it seems like a bad idea. I suppose I could split my cohort into subsets of 30 or 40 samples, but that doesn't seem like the best approach

Phantom indels from HaplotypeCaller?

$
0
0

Dear GATK users and developers,

I am running HaplotypeCaller followed by ValidateVariants and the latter complains about variants that have called alternative allele without any observation for it.

ERROR MESSAGE: File /storage/rafal.gutaker/NEXT_test/work/4f/6f8738a66d1c9d12651b76b7ef8819/IRIS_313-15896.g.vcf fails strict validation: one or more of the ALT allele(s) for the record at position LOC_Os01g01010:6190 are not observed at all in the sample genotypes |
ERROR ------------------------------------------------------------------------------------------

Here is an example of site that ValidateVariant complains about:

LOC_Os01g01010 6190 . GT G, 0 . DP=4;ExcessHet=3.0103;MLEAC=0,0;MLEAF=0.00,0.00;RAW_MQ=14400.00 GT:AD:DP:GQ:PL:SB 0/0:4,0,0:4:12:0,12,135,12,135,135:4,0,0,0
LOC_Os01g01010 6192 . T . . END=6192 GT:DP:GQ:MIN_DP:PL 0/0:8:0:8:0,0,254

In general, it seems not dangerous so i am thinking of removing this check, but why is HaplotypeCaller finding phanotm variants is a mystery to me.

Thank you and

Best!
Rafal

Allele Depth (AD) / Allele Balance (AB) Filtering in GATK 4

$
0
0

Hi,

I am trying to filter my GATK 4.0.3 - HaplotypeCaller generated multi-sample VCF for allele depth (AD) annotation at sample genotype-level (so available in "FORMAT" fields of each sample).

I think prior to GATK 4, this annotation was available as "Allele Balance" (AB) ratios (generated by AlleleBalanceBySample), but it is not available anymore in GATK 4. So I tried to filter genotypes based on AD field, that is exactly the same thing but indicated in "X,Y" format, so in an array format of integers. This array format makes it difficult to filter based on depth of alternative allele divided by depth of all alleles at a specific site.

Can you please recommend any solution to this problem? If I could turn this array into a ratio, I could easily filter genotypes using VariantFiltration or other tools such as vcflib/vcffilter. I also tried the below code (following https://gatkforums.broadinstitute.org/gatk/discussion/1255/what-are-jexl-expressions-and-how-can-i-use-them-with-the-gatk):

gatk VariantFiltration -R $ref -V $vcf -O $output --genotype-filter-expression 'vc.getGenotype("Sample1").getAD().1 / vc.getGenotype("Sample1").getAD().0 > 0.33' --set-filtered-genotype-to-no-call --genotype-filter-name 'ABfilter'

This worked, but strangely it filters the variant for all samples if only one of the sample have allele depths that are not in balance (defined by the filter). If it worked only for Sample1, I was planning to write a quick loop for all the samples for instance. I tried the same with GATK 3.8, but still it filters whole variant for all the samples if it is filtered in just one sample.

SNP calling using pooled RNA-seq data

$
0
0

Hello,

First of all, thank you for your detailed best practice pipeline for SNP calling from RNA-seq data.

I have pooled RNA seq data which I need to call SNP from. Each library consists of a pooled sample of 2-3 individuals of the same sex-tissue combination.

I was wondering if Haplotype caller can handle SNP calling from pooled sequences or is it better if I use FreeBayes?

I understand that these results come from experimenting with the data but it would be great if you could share your experiences with me on this.

Cheers,
Homa


HaplotypeCaller on whole genome or chromosome by chromosome: different results

$
0
0

Hi,

I'm working on targeted resequencing data and I'm doing a multi-sample variant calling with the HaplotypeCaller. First, I tried to call the variants in all the targeted regions by doing the calling at one time on a cluster. I thus specified all the targeted regions with the -L option.

Then, as it was taking too long, I decided to cut my interval list, chromosome by chromosome and to do the calling on each chromosome. At the end, I merged the VCFs files that I had obtained for the callings on the different chromosomes.

Then, I compared this merged VCF file with the vcf file that I obtained by doing the calling on all the targeted regions at one time. I noticed 1% of variation between the two variants lists. And I can't explain this stochasticity. Any suggestion?

Thanks!

Maguelonne

Haplotype caller not picking up variants for HiSeq Runs

$
0
0

Hello,
We were sequencing all our data in HiSeq and now moved to nextseq. We have sequenced the same batch of samples on both the sequencers. Both are processed using the same pipeline/parameters.
What I have noticed is, GATK 3.7 HC is not picking up variants, even though the coverage is good and is evidently present in the BAM file.

For example the screenshot below shows the BAM files for both NextSeq and HiSeq sample. There are atleast 3
variants in the region 22:29885560-29885861(NEPH, exon 5) that is expected to be picked up for HiSeq.

These variants are picked up for NextSeq samples (even though the coverage for hiSeq is much better).

The command that I have used for both samples is

java -Xmx32g -jar GATK_v3_7/GenomeAnalysisTK.jar -T HaplotypeCaller -R GRCh37.fa --dbsnp GATK_ref/dbsnp_138.b37.vcf -I ${i}.HiSeq_Run31.variant_ready.bam -L NEPH.bed -o ${i}.HiSeq_Run31.NEPH.g.vcf

Any idea why this can happen ?

Many thanks,

GATK HaplotypeCaller missing SNPs at the terminals of the segment when calling SNPs for Influenza A

$
0
0

We are trying to call variants for Influenza A virus sequenced by MiSeq using HaplotypeCaller following GATK best practices (GATK version 3.7). However, when checking in IGV the called variants with BAM file, we frequently identify snps that are missed by HaplotypeCaller at the beginning or the end of a segment. The missing ones are well supported by the reads, and are called by samtools and UnifiedGenotyper with high confidence.

As one example (showing below), there are three rows of called variants at the top, from top to bottom, called by UnifiedGenotyper, samtools, and HaplotypeCaller. The right most snp is called by first two tools but missed by HaplotypeCaller, although the support reads show consistent snp readouts.

Just to show that this snp is well supported by the reads, here is the vcf record reporting this snp in VCF generated by UnifiedGenotyper:

A-New_Jersey-NHRC_93408-2016-H3N2(KY078630)-HA 15 . A T 166598 . AC=1;AF=1.00;AN=1;DP=3970;Dels=0.00;FS=0.000;HaplotypeScore=26.7856;MLEAC=1;MLEAF=1.00;MQ=59.99;MQ0=0;QD=34.24;SOR=4.823 GT:AD:DP:GQ:PL 1:0,3969:3970:99:166628,0

A close check in the HaplotypeCaller generated BAM file for debugging, we noticed that the variant is consistently missing from the de novo generated Haplotypes.

There are also other cases of missing snps. The similarity is that they are always at the terminal of the segment, well supported by reads, and only HaplotypeCaller misses them. However, for some samples, similar variants at the terminal are called by HaplotypeCaller.

My question is following:

  • is this a bug of HaplotypeCaller? If so, has it been fixed?
  • if not a bug, is there a parameter of HaplotypeCaller that can be set to guarantee that it will not miss the good quality variants at the terminal?

Many thanks.

i am running haplotypcaller in one bam file

$
0
0

java -jar GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T HaplotypeCaller -R reference/GRCh37/hs37d5.fa -I output.bam --dbsnp reference/gatkbundle/dbsnp_138.b37.vcf -o output.g.vcf -ERC GVCF

i am trying add one more bam file to my cohort. i ran one bam file separately, but while applying gvcf mode it is not running . it is throwing the below error.

MESSAGE: Invalid command line: Argument emitRefConfidence has a bad value: Can only be used in single sample mode currently. Use the sample_name argument to run on a single sample out of a multi-sample BAM file

Phased Heterozygous SNP

$
0
0

Dear all,

I have difficulties in understanding the genotypes of the phased SNPs. Here i have a SNP where only one read has a reference allele and 11 reads have an alternate allele and is called as heterozygous SNP.

 chr15  8485088 .   G   T   4936.33 PASS     
 BaseQRankSum=1.82;ClippingRankSum=0;ExcessHet=0;FS=2.399;InbreedingCoeff=0.721;
 MQ=60;MQRankSum=0;QD=32.86;ReadPosRankSum=0.267;SOR=1.167;
 DP=10789;AF=0.013;MLEAC=13;MLEAF=0.012;AN=1300;AC=28    
GT:AD:DP:GQ:PGT:PID:PL  0/1:1,12:13:3:0|1:8485088_G_T:485,0,3

The genotype for a single sample from a multi-sample VCF is shown here. Could someone throw light on how to interpret the genotype as heterozygous as only one read has reference allele. It should have been called as homozygous SNP. Is this a bug or am i missing something also IGV does not show the reference read.(GATK Version=3.7-0-gcfedb67).

Viewing all 1335 articles
Browse latest View live