HaplotypeCaller may fail to detect variant with the same reads with a different composition.

February 14, 2018, 6:02 pm

≫ Next: Loss of RSIDs for GenotypeGVCFs, possibly an issue with dpsnp filter.vcf file from SelectVariants

≪ Previous: HC step 4: Assigning per-sample genotypes

I have experienced a variant detection issue with confusion. The png file attached is the result of exact same NextSeq experiment but the read extraction range is different.

NextSeq2_point.bam: bam is composed of the reads which cover chr16: 89100686 position only.

NextSeq2_region.bam: bam is composed of the reads which cover the region of chr16: 89100686 +-100bp.
On position chr16:89100686, I presume T>C should be detected, but HaplotypeCaller failed to detect the variant with NextSeq2_region.bam.

NextSeq2_point.vcf:
chr16 89100686 . T C,<NON_REF> 7397.77 . DP=199;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;RAW_MQ=716400.00 GT:AD:DP:GQ:PL:SB 1/1:0,199,0:199:99:7426,599,0,7426,599,7426:0,0,155,44

NextSeq2_region.vcf:
chr16 89100686 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:0,199:199:0:0,0,0

What causes the difference and why?

--- GATK Version (Docker latest)
    Using GATK wrapper script /gatk/build/install/gatk/bin/gatk
    Running:
        /gatk/build/install/gatk/bin/gatk HaplotypeCaller --version
    Version:4.0.1.2
---

--- Command used
gatk HaplotypeCaller -I /temp/NextSeq2_region.bam -O /temp/NextSeq2_region.vcf -R /temp/genome.fa -L /temp/only16.bed --debug true --output-mode EMIT_ALL_SITES --all-site-pls true --dont-trim-active-regions true --emit-ref-confidence BP_RESOLUTION
---
--- Genome Version: hg38
--- bed
chr16   89100681    89101347    NM_174917.4_cds_2_0_chr16_89100682_f    0   +

If you need the bams and vcfs, I can post them here.

↧

Loss of RSIDs for GenotypeGVCFs, possibly an issue with dpsnp filter.vcf file from SelectVariants

November 30, 2018, 6:39 am

≫ Next: GATK HaplotypeCaller stalls after initializing bam readers

≪ Previous: HaplotypeCaller may fail to detect variant with the same reads with a different composition.

Hello,

I've run into this problem a few times now having attempted to debug the issue in various ways. The first time it occurred I was using a vcf file containing the rsids I wished to genotype on a given gvcf file. There was no error message that I could discern, however the resultant vcf file only contained <1/5th of the rsids of interest in the filter file.

I assumed this issue related to the gvcf file and so acquired the bam file to begin the workflow from the beginning and stick to best practices, however I have now run into the same issue using haplotype caller and the same filter.vcf file as the -L and -D argument.

This leads me to believe the filter file is the issue, but when I look at the file I can see all of the rsids there so I'm not sure what is causing the loss. Of lesser concern is the loss of 7 rsids when I used SelectVariants to produce the filter initially, but I will also have to address that.

I have attempted to reduce the minimum base quality reads to 1 but it has not resulted in any increase in variant calling. I also looked into generating a bamout file which I will return to when possible, however I am currently working remotely and unable to install IGV on this device.

Many thanks for any tips

↧

GATK HaplotypeCaller stalls after initializing bam readers

December 6, 2018, 2:19 pm

≫ Next: Question about the alignment performed in the Haplotypecaller (pairHMM)

≪ Previous: Loss of RSIDs for GenotypeGVCFs, possibly an issue with dpsnp filter.vcf file from SelectVariants

Hello, I am running GATK's HaplotypeCaller on 290 bam files. GATK starts fine and finishes the step stating "Done initializing BAM readers". However, it never proceeds to outputting a vcf file and has been stuck in this state for 10+ hours.

Is this normal behavior and are there any steps you would recommend other than waiting indefinitely for this step to finish or an actual error message? These are relatively small scaffolds (only a couple megabases long). Bam files are split by scaffold, so each run of GATK is for 290 files that are 3-7 Mbs.

Thanks in advance for your help! Here is my code for running HaplotypeCaller and the output it produces; please let me know if there is anymore information which would be helpful!

code:
chrom=head -$index scaffolds.txt | tail -1 ; java -jar ./Programs/GenomeAnalysisTK.jar -T HaplotypeCaller -R $path_genome -I list.$chrom.list -o calls.raw.$chrom.vcf -stand_call_conf 20.0 -stand_emit_conf 20.0 -L $chrom

output:
INFO 16:47:19,176 HelpFormatter - --------------------------------------------------------------------------------
INFO 16:47:19,178 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.3-0-g37228af, Compiled 2014/10/24 01:07:22
INFO 16:47:19,178 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO 16:47:19,178 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO 16:47:19,181 HelpFormatter - Program Args: -T HaplotypeCaller -R ./genomes/assembly.fasta -I list.scaffold5.list -o calls.raw.scaffold5.vcf -stand_call_conf 20.0 -stand_emit_conf 20.0 -L scaffold5
INFO 16:47:19,200 HelpFormatter - Executing as v3@xxx on Linux 2.6.32-642.13.1.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_45-b14.
INFO 16:47:19,200 HelpFormatter - Date/Time: 2018/12/06 16:47:19
INFO 16:47:19,201 HelpFormatter - --------------------------------------------------------------------------------
INFO 16:47:19,201 HelpFormatter - --------------------------------------------------------------------------------
INFO 16:47:19,718 GenomeAnalysisEngine - Strictness is SILENT
INFO 16:47:20,174 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 250
INFO 16:47:20,181 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 16:47:27,330 SAMDataSource$SAMReaders - Init 50 BAMs in last 7.15 s, 50 of 290 in 7.15 s / 0.12 m (7.00 tasks/s). 240 remaining with est. completion in 34.31 s / 0.57 m
INFO 16:47:33,970 SAMDataSource$SAMReaders - Init 50 BAMs in last 6.64 s, 100 of 290 in 13.79 s / 0.23 m (7.25 tasks/s). 190 remaining with est. completion in 26.20 s / 0.44 m
INFO 16:47:40,313 SAMDataSource$SAMReaders - Init 50 BAMs in last 6.34 s, 150 of 290 in 20.13 s / 0.34 m (7.45 tasks/s). 140 remaining with est. completion in 18.79 s / 0.31 m
INFO 16:47:47,006 SAMDataSource$SAMReaders - Init 50 BAMs in last 6.69 s, 200 of 290 in 26.82 s / 0.45 m (7.46 tasks/s). 90 remaining with est. completion in 12.07 s / 0.20 m
INFO 16:47:53,468 SAMDataSource$SAMReaders - Init 50 BAMs in last 6.46 s, 250 of 290 in 33.29 s / 0.55 m (7.51 tasks/s). 40 remaining with est. completion in 5.33 s / 0.09 m
INFO 16:47:56,553 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 36.37

↧

Question about the alignment performed in the Haplotypecaller (pairHMM)

December 11, 2018, 12:26 am

≫ Next: Should I use UnifiedGenotyper or HaplotypeCaller to call variants on my data?

≪ Previous: GATK HaplotypeCaller stalls after initializing bam readers

Hello.

After looking at the implementation of the pair-HMM algorithm used in the haplotype caller I was left with a question.
From the source code comments, it seems that local alignment is supposed to be used.
link to the comment on github : (note : the links are missing the 'h' of https because I am not allowed to post links for now).
* ttps://github.com/broadinstitute/gatk/blob/67f0f0f2e59185b721398b17c24eba487a2ac76c/src/main/java/org/broadinstitute/hellbender/utils/pairhmm/PairHMM.java#L23
* ttps://github.com/broadinstitute/gatk/blob/67f0f0f2e59185b721398b17c24eba487a2ac76c/src/main/java/org/broadinstitute/hellbender/utils/pairhmm/Log10PairHMM.java#L11

This cites Figure 4.3 in Durbin 1998 book. This is the figure of the Finite State Automaton (FSA) for local alignment but when reading the implementation (of the forward algorithm) it seems that the FSA of figure 4.2 is used instead. (The model with M, X, and Y).
* ttps://github.com/broadinstitute/gatk/blob/67f0f0f2e59185b721398b17c24eba487a2ac76c/src/main/java/org/broadinstitute/hellbender/utils/pairhmm/PairHMMModel.java

The forward algorithm used seems to be the one presented in section 4.2 of the book which implements Figure 4.2 (and not 4.3). This algorithm is for global alignment and not local alignment.

My question is : which alignment is actually computed ? (score of the best alignment) local or global ?

I cannot see the random model RX/RY states (and associated transition probabilities) present in the local alignment FSA of figure 4.3 in the implementation. I may simply have missed them, if so could you please point them out to me ?

I'd like to be sure, because the difference between the two FSAs is as important as the difference between running the Needleman–Wunsch and Smith-Waterman algorithm (global vs local).

Thank you for your insights.
Regards. Rick

↧

Should I use UnifiedGenotyper or HaplotypeCaller to call variants on my data?

August 23, 2013, 2:34 pm

≫ Next: Haplotypecaller: SNPs with three genotypes have higher missing rates.

≪ Previous: Question about the alignment performed in the Haplotypecaller (pairHMM)

Use HaplotypeCaller!

The HaplotypeCaller is a more recent and sophisticated tool than the UnifiedGenotyper. Its ability to call SNPs is equivalent to that of the UnifiedGenotyper, its ability to call indels is far superior, and it is now capable of calling non-diploid samples. It also comprises several unique functionalities such as the reference confidence model (which enables efficient and incremental variant discovery on ridiculously large cohorts) and special settings for RNAseq data.

As of GATK version 3.3, we recommend using HaplotypeCaller in all cases, with no exceptions.

Caveats for older versions

If you are limited to older versions for project continuity, you may opt to use UnifiedGenotyper in the following cases:

If you are working with non-diploid organisms (UG can handle different levels of ploidy while older versions of HC cannot)
If you are working with pooled samples (also due to the HC’s limitation regarding ploidy)
If you want to analyze more than 100 samples at a time (for performance reasons) (versions 2.x)

↧

Haplotypecaller: SNPs with three genotypes have higher missing rates.

December 20, 2018, 2:22 am

≫ Next: Variant calling: high (and strange) number of alternative allele

≪ Previous: Should I use UnifiedGenotyper or HaplotypeCaller to call variants on my data?

Dear GATK team,

I called SNPs out of 150 samples of WGS data on a non-model species (coral). The reference is a draft genome of 500 MB, each sample has roughly 15 M paired end reads. Species is diploid.

I have ran the GATK pipeline following the best practices and called genotypes using the haplotypecaller. Since I needed to speed up calculations and be as conservative as possible, I set the min-pruning flag =10. Next, I only kept bi-allelic SNPs.

Now, I noticed that the missing rates per SNP are generally higher for SNPs with 3 genotypes (the most frequent case), compared to those with 2 genotypes.

In other words, if I filter SNPs for missing-rate I end up with a genotype matrix in which most of the SNPS have only two genotypes (1 homozygote + the heterozygote).

Any idea on what could be the case? Is it possible that a min-pruning flag of 10 can systematically create more missing calls on SNPs with three genotypes (particularly on homozygotes) ?

thank you in advance

best

OS

↧

Variant calling: high (and strange) number of alternative allele

December 20, 2018, 8:23 am

≫ Next: I have two bams with equivalent sams; one yields a validation error and the other does not.

≪ Previous: Haplotypecaller: SNPs with three genotypes have higher missing rates.

Deat GATK team,

I am calling variant on a trio (mother, father and offspring) of Macaca mulatta. I have whole genome sequencing 60X for each individual. I use GATK 4.0.7.0, I call variant with HaplotypeCaller BP-RESOLUTION, combine with GenomicDBimport per chromosomes and genotype with GenotypeGVCF.

I am interested in the number of sites where I have only reference allele (AD=0 for the alternative) and the number of sites where I have some reads supporting ALT allele (AD > 0) in the parents.

I found a lot of sites (for each individuals) where I have AD>0 in the gvcf file (per indiviuals, the combined one and after genotyping). I looked at each site that are HomRef and for each individuals less than 30% of the HomRef sites have AD=0 for the alternative allele. I know that HaplotypeCaller does a realignement step that may change the positions of the reads, but 70% of the sites that have AD>0 seems a lot. I looked back at the BAM file and those alternative alleles don’t seem to be there. I try to call again using the bam.out option, and here again I don’t see so many alternative alleles. However, I see that sometimes on a read where there were no alternative allele on the bam input there is an alternative allele on the output.
Also I have tried samtools mpileup and in this case almost 90% of the HomRef sites are AD=0 for alternative allele.

Just as an example bellow is the VCF output from HaplotypeCaller for one individual and then there is a picture of both the input bam file and the output bam file.
For chr1 pos 24203380 the ref is A and I have:
Vcf --> DP=96 AD=92,4
Bam input --> DP 93, 92,1 (N)
Bam out --> DP=80, 79,1 (N)

chr1 24203380 . A <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:92,4:96:57:0,57,5771 chr1 24203381 . G <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:90,5:95:0:0,0,5897 chr1 24203382 . C <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:92,3:95:78:0,78,6075 chr1 24203383 . A <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:92,3:95:68:0,68,6127

Just in case here is my code:
gatk --java-options "-XX:ParallelGCThreads=16 -Xmx64g" HaplotypeCaller -R /PATH/rheMac8.fa -I /PATH/R01068_sorted.merged.addg.uniq.rmdup.bam -O /PATH/R01068_res.g.vcf -ERC BP_RESOLUTION \

I don’t know why I have this high number of alternative alleles and how to get read of them to have the 'real' number of alternative allele per position. The problem persists on the genotyping vcf files with some alternative alleles that are not present on any bam (input or HaplotypeCaller output).

I hope I gave you enough details so you have a clear idea of my problem and will be able to help me.
Best,

↧

I have two bams with equivalent sams; one yields a validation error and the other does not.

January 9, 2019, 7:37 am

≫ Next: HaplotypeCaller Incompatible Contigs DNASeq

≪ Previous: Variant calling: high (and strange) number of alternative allele

Hi! I have two bam files whose sam equivalents are identical-- as in:

diff <(samtools view -h small.bam) <(samtools view -h smalltest.bam)

yields nothing, and when I run haplotype caller on one file I get errors that say (for every read):

Ignoring SAM validation error: ERROR: Record 1, Read name RSRS1, bin field of BAM record does not equal value computed based on alignment start and end, and length of sequence to which read is aligned

and no SNPs are generated, while the other file processes just fine.

Needless to say, the bin fields are the same.

To be clear, I generated one of the files, it generated the error, and when I converted from bam->sam->bam, GATK processed it correctly.
Ideas?

I'm using gatk-4.0.11.0, samtools verion 1.7 (htslib 1.9)

-August

↧

HaplotypeCaller Incompatible Contigs DNASeq

January 11, 2019, 9:50 am

≫ Next: Coverage bias in HaplotypeCaller

≪ Previous: I have two bams with equivalent sams; one yields a validation error and the other does not.

I'm using GATK 4.0.11 and I'm getting the following error message when I run HaplotypeCaller on DNAseq data:

10:19:17.089 INFO HaplotypeCaller - ------------------------------------------------------------

10:19:17.089 INFO HaplotypeCaller - ------------------------------------------------------------

10:19:17.090 INFO HaplotypeCaller - HTSJDK Version: 2.16.1

10:19:17.090 INFO HaplotypeCaller - Picard Version: 2.18.13

10:19:17.091 INFO HaplotypeCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2

10:19:17.091 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false

10:19:17.091 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true

10:19:17.091 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false

10:19:17.091 INFO HaplotypeCaller - Deflater: IntelDeflater

10:19:17.091 INFO HaplotypeCaller - Inflater: IntelInflater

10:19:17.091 INFO HaplotypeCaller - GCS max retries/reopens: 20

10:19:17.092 INFO HaplotypeCaller - Requester pays: disabled

10:19:17.092 INFO HaplotypeCaller - Initializing engine

10:19:17.536 INFO HaplotypeCaller - Shutting down engine

[January 5, 2019 10:19:17 AM EST] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 0.12 minutes.

Runtime.totalMemory()=311427072

A USER ERROR has occurred: Input files reference and reads have incompatible contigs: No overlapping contigs found.

reference contigs = [chr17:c43125483-43044295]

reads contigs = []

I then tried another file from NCBI:

A USER ERROR has occurred: Input files reference and reads have incompatible contigs: No overlapping contigs found.

reference contigs = [chr17:c43125483-43044295]

reads contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM]

The proceeding steps were FastqToSam, BWA, and MarkDuplicates.

Any suggestions?

↧

Coverage bias in HaplotypeCaller

April 28, 2015, 1:17 am

≫ Next: GVCF - Genomic Variant Call Format

≪ Previous: HaplotypeCaller Incompatible Contigs DNASeq

Hi,

I am doing joint variant calling for Illumina paired end data of 150 monkeys. Coverage varies from 3-30 X with most individuals having around 4X coverage.

I was doing all the variant detection and hard-filtering (GATK Best Practices) process with both UnifiedGenotyper and Haplotype caller.

My problem is that HaplotypeCaller shows a much stronger bias for calling the reference allele in low coverage individuals than UnifiedGenotyper does. Is this a known issue?

In particular, consider pairwise differences across individuals:
The absolute values are lower for low coverage individuals than for high coverage, for both methods, since it is more difficult to make calls for them.
However, for UnifiedGenotyper, I can correct for this by calculating the "accessible genome size" for each pair of individuals by substracting from the total reference length all the filtered sites and sites where one of the two individuals has no genotype call (./.). If I do this, there is no bias in pairwise differences for UnifiedGenotyper. Values are comparable for low and high coverage individuals (If both pairs consist of members of similar populations).

However, for HaplotypeCaller, this correction does not remove bias due to coverage. Hence, it seems that for UnifiedGenotyper low coverage individuals are more likely to have no call (./.) but if there is a call it is not biased towards reference or alternative allele (at least compared to high coverage individuals). For HaplotypeCaller, on the other hand, it seems that in cases of doubt the genotype is more likely to be set to reference. I can imagine that this is an effect of looking for similar haplotypes in the population.

Can you confirm this behaviour? For population genetic analysis this effect is highly problematic. I would trade in more false positive if this removed the bias. Note that when running HaplotypeCaller, I used a value of 3*10^(-3) for the expected heterozygosity (--heterozygosity) which is the average cross individuals diversity and thus already at the higher-end for within individual heterozygosity. I would expect the problem to be even worse if I chose lower values.

Can you give me any recommendation, should I go back using UnifiedGenotyper or is there any way to solve this problem?

Many thanks in advance,
Hannes

↧

GVCF - Genomic Variant Call Format

December 23, 2017, 2:16 pm

≫ Next: Analysis Pipeline Discrepancy in SNP Calling and Coverage

≪ Previous: Coverage bias in HaplotypeCaller

GVCF stands for Genomic VCF. A GVCF is a kind of VCF, so the basic format specification is the same as for a regular VCF (see the spec documentation here), but a Genomic VCF contains extra information.

This document explains what that extra information is and how you can use it to empower your variant discovery analyses.

Important notes

What we're covering here is strictly limited to GVCFs produced by HaplotypeCaller in GATK versions 3.0 and above. The term GVCF is sometimes used simply to describe VCFs that contain a record for every position in the genome (or interval of interest) regardless of whether a variant was detected at that site or not (such as VCFs produced by UnifiedGenotyper with --output_mode EMIT_ALL_SITES). GVCFs produced by HaplotypeCaller in GATK versions 3.x and 4.x contain additional information that is formatted in a very specific way. Read on to find out more.

GVCF files produced by HaplotypeCaller from GATK versions 3.x and 4.x are not substantially different. While we don't recommend mixing versions, and we have not tested this ourselves, it should be okay to use GVCFs made by different versions if the annotations and the GVCFBlock definitions (see below) are the same.

General comparison of VCF vs. GVCF

The key difference between a regular VCF and a GVCF is that the GVCF has records for all sites, whether there is a variant call there or not. The goal is to have every site represented in the file in order to do joint analysis of a cohort in subsequent steps. The records in a GVCF include an accurate estimation of how confident we are in the determination that the sites are homozygous-reference or not. This estimation is generated by the HaplotypeCaller's built-in reference model.

Note that some other tools (including the GATK's own UnifiedGenotyper) may output an all-sites VCF that looks superficially like the BP_RESOLUTION GVCFs produced by HaplotypeCaller, but they do not provide an accurate estimate of reference confidence, and therefore cannot be used in joint genotyping analyses.

The two types of GVCFs

As you can see in the figure above, there are two options you can use with -ERC: GVCF and BP_RESOLUTION. With BP_RESOLUTION, you get a GVCF with an individual record at every site: either a variant record, or a non-variant record. With GVCF, you get a GVCF with individual variant records for variant sites, but the non-variant sites are grouped together into non-variant block records that represent intervals of sites for which the genotype quality (GQ) is within a certain range or band. The GQ ranges are defined in the ##GVCFBlock line of the GVCF header. The purpose of the blocks (also called banding) is to keep file size down, so we recommend using the -GVCF option over BP_RESOLUTION.

Example GVCF file

This is a banded GVCF produced by HaplotypeCaller with the -GVCF option.

Header:

As you can see in the first line, the basic file format is a valid version 4.2 VCF:

##fileformat=VCFv4.2
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location">
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

One FORMAT annotation is unique to the GVCF format:

##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">

This defines what was the minimum amount of coverage observed at any one site within a block of records.

The header goes on:

##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another">
##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
##GATKCommandLine=<ID=HaplotypeCaller,CommandLine="[full command line goes here]",Version=4.beta.6-117-g4588584-SNAPSHOT,Date="December 23, 2017 4:04:34 PM EST">

At this point in the header we see the GVCFBlock definitions, which indicate the GQ ranges used for banding:

[individual blocks from 1 to 55]
##GVCFBlock55-56=minGQ=55(inclusive),maxGQ=56(exclusive)
##GVCFBlock56-57=minGQ=56(inclusive),maxGQ=57(exclusive)
##GVCFBlock57-58=minGQ=57(inclusive),maxGQ=58(exclusive)
##GVCFBlock58-59=minGQ=58(inclusive),maxGQ=59(exclusive)
##GVCFBlock59-60=minGQ=59(inclusive),maxGQ=60(exclusive)
##GVCFBlock60-70=minGQ=60(inclusive),maxGQ=70(exclusive)
##GVCFBlock70-80=minGQ=70(inclusive),maxGQ=80(exclusive)
##GVCFBlock80-90=minGQ=80(inclusive),maxGQ=90(exclusive)
##GVCFBlock90-99=minGQ=90(inclusive),maxGQ=99(exclusive)
##GVCFBlock99-100=minGQ=99(inclusive),maxGQ=100(exclusive)

In recent versions of GATK, the banding strategy has been tuned to provide high resolution at lower values of GQ (59 and below) and more compression at high values (60 and above). Note that since GQ is capped at 99, records where the corresponding PL is greater than 99 are lumped into the 99-100 band.

After that, the header goes on:

##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
##INFO=<ID=ExcessHet,Number=1,Type=Float,Description="Phred-scaled p-value for exact test of excess heterozygosity">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=RAW_MQ,Number=1,Type=Float,Description="Raw data for RMS Mapping Quality">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
##contig=<ID=20,length=63025520,assembly=GRCh37>
##source=HaplotypeCaller

Records

The first thing you'll notice, hopefully, is the <NON_REF> symbolic allele listed in every record's ALT field. This provides us with a way to represent the possibility of having a non-reference allele at this site, and to indicate our confidence either way.

The second thing to look for is the END tag in the INFO field of non-variant block records. This tells you at what position the block ends. For example, the first line is a non-variant block that starts at position 20:10001567 and ends at 20:10001616.

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NA12878
20  10001567    .   A   <NON_REF>   .   .   END=10001616    GT:DP:GQ:MIN_DP:PL  0/0:38:99:34:0,101,1114
20  10001617    .   C   A,<NON_REF> 493.77  .   BaseQRankSum=1.632;ClippingRankSum=0.000;DP=38;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=0.000;RAW_MQ=136800.00;ReadPosRankSum=0.170    GT:AD:DP:GQ:PL:SB   0/1:19,19,0:38:99:522,0,480,578,538,1116:11,8,13,6
20  10001618    .   T   <NON_REF>   .   .   END=10001627    GT:DP:GQ:MIN_DP:PL  0/0:39:99:37:0,105,1575
20  10001628    .   G   A,<NON_REF> 1223.77 .   DP=37;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;RAW_MQ=133200.00   GT:AD:DP:GQ:PL:SB   1/1:0,37,0:37:99:1252,111,0,1252,111,1252:0,0,21,16
20  10001629    .   G   <NON_REF>   .   .   END=10001660    GT:DP:GQ:MIN_DP:PL  0/0:43:99:38:0,102,1219
20  10001661    .   T   C,<NON_REF> 1779.77 .   DP=42;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;RAW_MQ=151200.00   GT:AD:DP:GQ:PGT:PID:PL:SB   1/1:0,42,0:42:99:0|1:10001661_T_C:1808,129,0,1808,129,1808:0,0,26,16
20  10001662    .   T   <NON_REF>   .   .   END=10001669    GT:DP:GQ:MIN_DP:PL  0/0:44:99:43:0,117,1755
20  10001670    .   T   G,<NON_REF> 1773.77 .   DP=42;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;RAW_MQ=151200.00   GT:AD:DP:GQ:PGT:PID:PL:SB   1/1:0,42,0:42:99:0|1:10001661_T_C:1802,129,0,1802,129,1802:0,0,25,17
20  10001671    .   G   <NON_REF>   .   .   END=10001673    GT:DP:GQ:MIN_DP:PL  0/0:43:99:42:0,120,1800
20  10001674    .   A   <NON_REF>   .   .   END=10001674    GT:DP:GQ:MIN_DP:PL  0/0:42:96:42:0,96,1197
20  10001675    .   A   <NON_REF>   .   .   END=10001695    GT:DP:GQ:MIN_DP:PL  0/0:41:99:39:0,105,1575
20  10001696    .   A   <NON_REF>   .   .   END=10001696    GT:DP:GQ:MIN_DP:PL  0/0:38:97:38:0,97,1220

Note that toward the end of this snippet, you see multiple consecutive non-variant block records. These were not merged into a single record because the sites they contain belong to different ranges of GQ (which are defined in the header).

↧

Analysis Pipeline Discrepancy in SNP Calling and Coverage

January 31, 2019, 12:45 pm

≫ Next: Detailed documentation of how GATK tools employ SPARK

≪ Previous: GVCF - Genomic Variant Call Format

Hi, All,

So I am new to GATK so please bear with me... Essentially, I have developed a unix script to analyze the fastq sequencing output for a novel targeting technique. I am only targeting 27 SNPs with a small amplicon size and the coverage is much more than traditional sequencing methods. I want to report the genotype and coverage at each location (even the homozygous reference sites). One major issue that I have witnessed is that at a given SNP in IGV, I have approx 20,000X coverage with perfectly paired reads (paired end) with MQ 60; however, following my analysis pipeline, my VCF reports 10X. I cannot figure out what I am doing wrong! Also, at other SNP sites, the VCF is reporting balanced allele depth (AD) for a given heterozygous genotype (which is correct), but the genotype call (GT) in the VCF reports as 0/0 (homozygous reference). Below is the general script, a screenshot of IGV for the SNP with 20,000X coverage, and screenshot of IGV for the SNP that is reporting as homozygous when it is a true heterozygous. Please help!

Thank you all! You are great!
Rachel

#!/bin/bash

fastqDir='DirectorywithfastqR1andR2'
refGenome='referencegenome.fa'

for i in "$fastqDir/*R1*.fastq"

do
sample=`basename $i .fastq`
file2="$fastqDir"`echo $sample | sed 's/_R1_/_R2_/'`.fastq

# look for a read-pair
if [ ! -f "$file2" ] # none detected
then
$file2 = ""
fi

bwa mem -R "@RG\tID:$sample\tSM:$sample\tPL:ILLUMINA\tLB:$sample" -t 10 $refGenome $i $file2 | samtools view -bSh | samtools sort -m 10G -o sample.bam -T Temp
samtools index sample.bam

gatk HaplotypeCaller --arguments_file gatkArgumentsFile.txt --reference $refGenome --input sample.bam --output sample.vcf --intervals SNPCoordinates.bed --emit-ref-confidence BP_RESOLUTION

#gatk HaplotypeCaller --arguments_file gatkArgumentsFile.txt --input sample.bam --output sample.2.vcf --intervals SNPCoordinates.bed --disable-tool-default-read-filters true --emit-ref-confidence BP_RESOLUTION
#bcftools mpileup -q 5 -d 9999999 -f reference.fasta sample.bam | bcftools call -g 10 -a FORMAT/DP -f GQ,GP -m -T SNPCoordinates.bed -o sample.vcf

done

↧

Detailed documentation of how GATK tools employ SPARK

February 5, 2019, 6:35 am

≫ Next: minor error in the documentation regarding --genotyping-mode

≪ Previous: Analysis Pipeline Discrepancy in SNP Calling and Coverage

Good afternoon,

It's been a while since GATK4 is out and Spark tools got introduced (yeyyy:)), but so far I haven't been able to find a good link to read on how exactly GATK employs it.

If you could fill these pages with some content would be great (single multi core,spark cluster). Particularly, I'm interested to know how the jobs are managed like: if running locally with for instance local[40], how does Haplotype Caller traverses the data ? Does the Active Region traversal still applies for the SPARK tools ? What about the concept of Walkers? How many blocks of data each Spark RDD contains ? Have you done some tests to improve performance, or mostly rely on default Spark settings to manage parallelism ?

Best regards,
Pedro

↧

minor error in the documentation regarding --genotyping-mode

February 11, 2019, 5:09 am

≫ Next: HaplotypeCaller Missing variant with stand_call_conf 30

≪ Previous: Detailed documentation of how GATK tools employ SPARK

Hi GATK team,

I just wanted to let you know, that there is a minor mistake in the documentation here:
https://software.broadinstitute.org/gatk/documentation/tooldocs/4.1.0.0/org_broadinstitute_hellbender_tools_walkers_haplotypecaller_HaplotypeCaller.php#--pcr-indel-model

The file in which you want to replace --genotyping_mode with --genotyping-mode is this one:
gatk/src/main/java/org/broadinstitute/hellbender/tools/walkers/genotyper/StandardCallerArgumentCollection.java

I don't seem to be able to create a pull request, so I'll leave it to you.

Thanks,
Tommy

↧

HaplotypeCaller Missing variant with stand_call_conf 30

February 12, 2019, 2:35 am

≫ Next: GATK resource bundles scattered_calling_intervals exclude small contigs

≪ Previous: minor error in the documentation regarding --genotyping-mode

Dear GATK users,

I have a strange case to debug with HaplotypeCaller GATKv3.7 and -stand_call_conf 30 parameter. In essence, the below variant is found when -stand_call_conf 30 is not used and the variant is missing when -stand_call_conf 30 is used.

 chr12  54677628    .   G   A   6505.60 .   AC=1;AF=0.500;AN=2;BaseQRankSum=-12.044;DP=486;ExcessHet=3.0103;FS=1.754;MLEAC=1;MLEAF=0.500;MQ=53.96;MQRankSum=-3.227;QD=13.41;ReadPosRankSum=0.672;SOR=0.835  GT:AD:DP:GQ:PL  0/1:261,224:485:99:6513,0,8151

Could someone explain how -stand_call_conf 30 is missing this variant despite of high depth and QUAL?

↧

GATK resource bundles scattered_calling_intervals exclude small contigs

February 13, 2019, 12:06 pm

≫ Next: HaplotypeCaller, vcf mode, ./.

≪ Previous: HaplotypeCaller Missing variant with stand_call_conf 30

Hi there,

I was just going over some Haplotypecaller and VQSR results generated using your best practices Cromwell workflows, and found that the scattered_calling_intervals files you provide (and which those workflows use to operate over) do not cover the whole genome. For hg38, chrM and all of the alt/unplaced contigs are excluded. For b37, chrY is also excluded.

https://console.cloud.google.com/storage/browser/gatk-legacy-bundles/b37
https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0/

This seems like a fairly major bug that would cause people running your best practices to lose a good number of potentially important variants.

↧

HaplotypeCaller, vcf mode, ./.

February 16, 2019, 9:41 am

≫ Next: Variants found in GVCF but not in VCF

≪ Previous: GATK resource bundles scattered_calling_intervals exclude small contigs

Hi,

maybe it is an old question but I can not find it in the forum...

I was using HaplotypeCaller (linux server, v4.1.0.0) in the CNN pipeline and I saw that it calls variants with ./.

My vcf has 31127 variants but 46 are like the following...

chr1 154590149 . G C 0 . FS=0.000;MBQ=0,0;MFRL=0,0;MLEAC=0;MLEAF=NaN;MMQ=0,0;MPOS=0;SOR=0.693 GT ./.

The code is:

${GATK4} --java-options "${javaOpt3}" HaplotypeCaller \
-R ${hg38} -I ${bqsr_BAM} -O ${VCF} -L ${INTERVAL} \
-bamout ${bamout_BAM} \
--dont-trim-active-regions -stand-call-conf 0 \
-A Coverage -A ChromosomeCounts -A BaseQuality -A FragmentLength -A MappingQuality -A ReadPosition \
--tmp-dir ${tmp}/

How/Why HaplotypeCaller calls variants with ./. genotype (without genotype?)

At the end all ./. variants are excluded with the FilterVariantTranches (CNN_2D) tool with a value ranging from -10.410 to -3.456.

Many thanks

↧

Variants found in GVCF but not in VCF

February 27, 2019, 9:32 am

≫ Next: How to find HaplotypeScore?

≪ Previous: HaplotypeCaller, vcf mode, ./.

Suppose the main difference between GVCF and VCF is the non-variant record.
Curiously, I found some variants were called in GVCF but not in VCF.
This is the variant I found in GVCF but couldn't found in VCF:

19 55665584 . A C,<NON_REF> 10.01 . DP=4;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;RAW_MQ=14400.00 GT:AD:DP:GQ:PL:SB 1/1:0,2,0:2:6:36,6,0,36,6,36:0,0,2,0

Although I can filter it out by its low DP but I still want to know why it happened.
Is there some different calling step or algorithm in HaplotypeCaller between calling GVCF and VCF that make this difference?
Many thanks

↧

How to find HaplotypeScore?

February 28, 2019, 3:31 am

≫ Next: How is the phasing done in single sample HaplotypeCaller?

≪ Previous: Variants found in GVCF but not in VCF

I was running haplotype score and got a warning message :Annotation will not be calculated, must be called from UnifiedGenotyper.

Can you please tell the command to calculate Haplotype score using Unifiedgenotyper.
I have seen a paramter -A (for annotaion but not not what are the variables that can be passed in it)

↧

How is the phasing done in single sample HaplotypeCaller?

March 1, 2019, 8:17 am

≫ Next: Haplotype phasing somatic mutations from MuTect2 using read-backed phasing and parental data

≪ Previous: How to find HaplotypeScore?

Hi, after running HaplotypeCaller with this commands
gatk --java-options "-Xmx4g" HaplotypeCaller -R $refGenome -I /home/ready.bam -ERC GVCF -O /home/GATK4-HC.g.vcf

I obtain positions in the gvcf file like these:
1 1243896 . C T, 0.01 . MLEAC=0,0;MLEAF=NaN,NaN GT ./.
1 1243929 . G T, 0.01 . MLEAC=0,0;MLEAF=NaN,NaN GT:PGT:PID ./.:0|1:1243929_G_T
1 4204648 . CTACCA C, 0 . MLEAC=0,0;MLEAF=NaN,NaN GT:PGT:PID ./.:0|1:4204601_T_C
1 6292991 . C . . END=6293126 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0

How is possible that those positions have 0|1 if I have not used any database nor any other samples?

How is ./. different of 0/0:0:0:0:0,0,0 in the GT field?

Thank you very much

↧