Recommended protocol for bootstrapping HaplotypeCaller and BaseRecalibrator outputs?

December 4, 2014, 11:17 am

≫ Next: How to get exact allele frequency using HaplotypeCaller-GATK4 and speed up the running?

≪ Previous: Inference of genotype likelihoods for lower ploidy based on genotyping at higher ploidy using GATK

I am identifying new sequence variants/genotypes from RNA-Seq data. The species I am working with is not well studied, and there are no available datasets of reliable SNP and INDEL variants.

For BaseRecallibrator, it is recommended that when lacking a reliable set of sequence variants:
"You can bootstrap a database of known SNPs. Here's how it works: First do an initial round of SNP calling on your original, unrecalibrated data. Then take the SNPs that you have the highest confidence in and use that set as the database of known SNPs by feeding it as a VCF file to the base quality score recalibrator. Finally, do a real round of SNP calling with the recalibrated data. These steps could be repeated several times until convergence."

Setting up a script to run HaplotypeCaller and BaseRecallibrator in a loop should be fairly strait forward. What is a good strategy for comparing VCF files and assessing convergence?

↧

How to get exact allele frequency using HaplotypeCaller-GATK4 and speed up the running?

October 2, 2018, 9:21 pm

≫ Next: Variant not being called by HC GATK v3.7-0-gcfedb67

≪ Previous: Recommended protocol for bootstrapping HaplotypeCaller and BaseRecalibrator outputs?

Hi there,

I have performed HaplotypeCaller in GATK4 (version:4.0.9.0) for variant calling of germline DNA. Here are the results in the vcf file.

chr1 17365 rs369606208 C G 146.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=-1.215;DB;DP=53;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=40.27;MQRankSum=-0.719;QD=2.77;ReadPosRankSum=-0.581;SOR=0.664 GT:AD:DP:GQ:PL 0/1:43,10:53:99:175,0,1354
chr1 17407 rs372841554 G A 249.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=-0.924;DB;DP=106;ExcessHet=3.0103;FS=2.034;MLEAC=1;MLEAF=0.500;MQ=46.70;MQRankSum=-0.797;QD=2.36;ReadPosRankSum=0.729;SOR=0.433 GT:AD:DP:GQ:PL 0/1:84,22:106:99:278,0,2088
chr1 981931 rs2465128 A G 2859.77 . AC=2;AF=1.00;AN=2;BaseQRankSum=0.989;DB;DP=98;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQRankSum=0.000;QD=29.18;ReadPosRankSum=-0.013;SOR=1.313 GT:AD:DP:GQ:PL 1/1:2,96:98:99:2888,239,0

As you can see, all the AF are the estimated 0.5 or 1.0, but not exact number. At the meantime, I ran the Mutect2-GATK4 for another sample, and I could get the exact AF from Mutect2. Would you please help me to figure this out?

Here is the script for HaplotypeCaller:
java-options "-Xmx40g" HaplotypeCaller -R /data3/IonProton/bwa-index_hg19/hg19_forGATK_sinica.fa -I 13029-10ng_F0x100_intersect.bam -O 13029-10ng_F0x100_intersect_HC.vcf.gz --dbsnp /data3/IonProton/bwa-index_hg19/dbsnp_138.hg19.vcf

In addition, is there any way to speed up the running of HaplotypeCaller and Mutect2? On single job(sample) of HaplotypeCaller and Mutect2 took me about 14hr and 19hr, respectively.

Thank you!

↧

Variant not being called by HC GATK v3.7-0-gcfedb67

October 4, 2018, 6:12 am

≫ Next: ActiveRegion determination (HaplotypeCaller & Mutect2)

≪ Previous: How to get exact allele frequency using HaplotypeCaller-GATK4 and speed up the running?

Hello,
We are calling variants on data that has been sequenced on NextSeq platform. We have been using the same pipeline , with the same commands since a year and in all our runs we have a control sample to check if the variant calling and sequencing has been done right. For this particular run, one of a known SNP 7:143013285 that was called in the same sample in the previous 5 runs (over the last year) was missed by haplotype caller. On looking at the bam file, the variant seems to be present (highlighted BAM file). Above two bam files are from the same sample that were called in previous runs and HC was able to pick it up. The command I use are as follows

trim_galore -q 0 --paired --fastqc $R1_fastq $R2_fastq --output_dir $FASTQ

bwa mem -M -t 8 $ind.fa $FASTQ/${s_id}.R1_val_1.fq.gz $FASTQ/${s_id}.R2_val_2.fq.gz | sambamba_v0.6.6 view -t 8 -S -h -f bam -o $s_id.bam /dev/stdin
sambamba_v0.6.6 sort -t 8 -o $s_id.sorted.bam $s_id.bam
sambamba_v0.6.6 index -t 8 $s_id.sorted.bam

java -jar $picard AddOrReplaceReadGroups I=$s_id.sorted.bam O=$s_id.sorted.RG.bam SORT_ORDER=coordinate RGID=$s_id RGLB=$flowcell RGPL=illumina RGPU=U RGSM=$RUNNAME
sambamba_v0.6.6 index $s_id.sorted.RG.bam 
sambamba_v0.6.6 markdup -t 8 $s_id.sorted.RG.bam $s_id.markdup.bam 
sambamba_v0.6.6 index $s_id.markdup.bam 

java -Xmx8g -Djava.io.tmpdir=/ionng/tmp -jar $gatk -T BaseRecalibrator \
    -I $s_id.markdup.bam \
    -R $ind.fa \
        -knownSites dbsnp_138.b37.vcf \
        -knownSites Mills_and_1000G_gold_standard.indels.b37.vcf \
        -knownSites 1000G_phase1.indels.b37.vcf \
    -o $s_id.recal_data.table \
    -L $bed 

#Apply the Recalibration
java -Xmx8g -Djava.io.tmpdir=$TMPDIR -jar $gatk -T PrintReads \
    -I $s_id.markdup.bam \
    -R $ind.fa \
    -BQSR $s_id.recal_data.table \
    -o $s_id.${RUNNAME}.variant_ready.bam 

java -Xmx32g -jar $gatk -T HaplotypeCaller \
    -R $ind.fa --dbsnp $dbsnp_138.b37.vcf \
    -I $s_id.${RUNNAME}.variant_ready.bam \
    -stand_call_conf 30.0 \
    -L $bed \
    -o $s_id.${RUNNAME}.g.vcf

Things that I have tried which did not work:-
1. tried running HC with the options -allowNonUniqueKmers
2. also tried options to change parameters -stand_call_conf 2.0 -mmq 5
3. tried -ERC BP_RESOLUTION that results in
7 143013285 . C <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:9,2:11:0:0,0,152

NOTE: The variant was picked up when I ran FREEBAYES and VARSCAN using default parameters.

FREEBAYES
7 143013285 . C T 206.73 . AB=0.394737;ABP=6.66752;AC=1;AF=0.5;AN=2;AO=15;CIGAR=1X;DP=38;DPB=38;DPRA=0;EPP=20.5268;EPPR=37.093;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=1;NUMALT=1;ODDS=47.6012;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=495;QR=811;RO=23;RPL=2;RPP=20.5268;RPPR=37.093;RPR=13;RUN=1;SAF=15;SAP=35.5824;SAR=0;SRF=23;SRP=52.9542;SRR=0;TYPE=snp;technology.illumina=1 GT:DP:AD:RO:QR:AO:QA:GL 0/1:38:23,15:23:811:15:495:-33.4265,0,-61.8717

VARSCAN
7 143013285 . C T . PASS ADP=38;WT=0;HET=1;HOM=0;NC=0 GT:GQ:SDP:DP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR 0/1:52:38:38:23:15:39.47%:5.4463E-6:35:33:23:0:15:0

I can email the bamout file if required (though I am not allowed to upload it publicly.)

Any suggestions will be helpful. Thanks.

Thankyou

↧

ActiveRegion determination (HaplotypeCaller & Mutect2)

December 28, 2017, 3:48 pm

≫ Next: Meaning of --min_base_quality_score

≪ Previous: Variant not being called by HC GATK v3.7-0-gcfedb67

This document details the procedure used by HaplotypeCaller to define ActiveRegions on which to operate as a prelude to variant calling. For more context information on how this fits into the overall HaplotypeCaller method, please see the more general HaplotypeCaller documentation.

This procedure is also applied by Mutect2 for somatic short variant discovery. See this article for a direct comparison between HaplotypeCaller and Mutect2.

Note that some of the command line argument names in this article may not be up to date. If you encounter any problems, please let us know in the comments so we can fix them.

Overview
Calculating the raw activity profile
Smoothing the activity profile
Setting the ActiveRegion thresholds and intervals

1. Overview

To define active regions, the HaplotypeCaller operates in three phases. First, it computes an activity score for each individual genome position, yielding the raw activity profile, which is a wave function of activity per position. Then, it applies a smoothing algorithm to the raw profile, which is essentially a sort of averaging process, to yield the actual activity profile. Finally, it identifies local maxima where the activity profile curve rises above the preset activity threshold, and defines appropriate intervals to encompass the active profile within the preset size constraints.

2. Calculating the raw activity profile

Active regions are determined by calculating a profile function that characterizes “interesting” regions likely to contain variants. The raw profile is first calculated locus by locus.

In the normal case (no special mode is enabled) the per-position score is the probability that the position contains a variant as calculated using the reference-confidence model applied to the original alignment.

If using the mode for genotyping given alleles (GGA) or the advanced-level flag -useAlleleTrigger, and the site is overlapped by an allele in the VCF file provided through the -alleles argument, the score is set to 1. If the position is not covered by a provided allele, the score is set to 0.

This operation gives us a single raw value for each position on the genome (or within the analysis intervals requested using the -L argument).

3. Smoothing the activity profile

The final profile is calculated by smoothing this initial raw profile following three steps. The first two steps consist in spreading individual position raw profile values to contiguous bases. As a result each position will have more than one raw profile value that are added up in the third and last step to obtain a final unique and smoothed value per position.

Unless one of the special modes is enabled (GGA or allele triggering), the position profile value will be copied over to adjacent regions if enough high quality soft-clipped bases immediately precede or follow that position in the original alignment. At time of writing, high-quality soft-clipped bases are those with quality score of Q29 or more. We consider that there are enough of such a soft-clips when the average number of high quality bases per soft-clip is 7 or more. In this case the site profile value is copied to all bases within a radius of that position as large as the average soft-clip length without exceeding a maximum of 50bp.
Each profile value is then divided and spread out using a Gaussian kernel covering up to 50bp radius centered at its current position with a standard deviation, or sigma, set using the -bandPassSigma argument (current default is 17 bp). The larger the sigma, the broader the spread will be.
For each position, the final smoothed value is calculated as the sum of all its profile values after steps 1 and 2.

4. Setting the ActiveRegion thresholds and intervals

The resulting profile line is cut in regions where it crosses the non-active to active threshold (currently set to 0.002). Then we make some adjustments to these boundaries so that those regions that are to be considered active, with a profile running over that threshold, fall within the minimum (fixed to 50bp) and maximum region size (customizable using -activeRegionMaxSize).

If the region size falls within the limits we leave it untouched (it's good to go).
If the region size is shorter than the minimum, it is greedily extended forward ignoring that cut point and we come back to step 1. Only if this is not possible because we hit a hard-limit (end of the chromosome or requested analysis interval) we will accept the small region as it is.
If it is too long, we find the lowest local minimum between the maximum and minimum region size. A local minimum is a profile value preceded by a large one right up-stream (-1bp) and an equal or larger value down-stream (+1bp). In case of a tie, the one further downstream takes precedence. If there is no local minimum we simply force the cut so that the region has the maximum active region size.

Of the resulting regions, those with a profile that runs over this threshold are considered active regions and progress to variant discovery and or calling whereas regions whose profile runs under the threshold are considered inactive regions and are discarded except if we are running HC in reference confidence mode.

There is a final post-processing step to clean up and trim the ActiveRegion:

Remove bases at each end of the read (hard-clipping) until there a base with a call quality equal or greater than minimum base quality score (customizable parameter -mbq, 10 by default).
Include or exclude remaining soft-clipped ends. Soft clipped ends will be used for assembly and calling unless the user has requested their exclusion (using -dontUseSoftClippedBases), if the read and its mate map to the same chromosome, and if they are in the correct standard orientation (i.e. LR and RL).
Clip off adaptor sequences of the read if present.
Discard all reads that no longer overlap with the ActiveRegion after the trimming operations described above.
Downsample remaining reads to a maximum of 1000 reads per sample, but respecting a minimum of 5 reads starting per position. This is performed after any downsampling by the traversal itself (-dt, -dfrac, -dcov etc.) and cannot be overriden from the command line.

↧

Meaning of --min_base_quality_score

October 14, 2018, 7:33 pm

≫ Next: HaplotypeCaller says dict does not exist, but it does!

≪ Previous: ActiveRegion determination (HaplotypeCaller & Mutect2)

What is the --min_base_quality_score mean ?
Is it base on the mapping quality in the sam/bam file or the sequencing base quality?

I'm a little bit confused by the description of the tool document in UnifiedGenotyper or HaplotypeCaller.

↧

HaplotypeCaller says dict does not exist, but it does!

April 5, 2016, 12:39 pm

≫ Next: HaplotyperCaller optional arguments:

≪ Previous: Meaning of --min_base_quality_score

Hi All,

I am running HaplotypeCaller and getting the error:

ERROR MESSAGE: Fasta dict file /net/rcnfs02/srv/export/duraisingh_lab/share_root/data/Plasmodium_knowlesi/jva/PlasmoDB-26_PknowlesiH_Genome_02.dict for reference /net/rcnfs02/srv/export/duraisingh_lab/share_root/data/Plasmodium_knowlesi/jva/PlasmoDB-26_PknowlesiH_Genome_02.fasta does not exist

BUT... the dictionary DOES exist! I made it with CreateSequenceDictionary.jar and it looks OK.

The reference dict and fasta are symbolically linked to the working directory. I did some googling on this but no luck.

Best,

Jon

↧

HaplotyperCaller optional arguments:

October 17, 2018, 2:03 am

≫ Next: HaplotypeCaller reassembles reads so that no variants are called in a specific region

≪ Previous: HaplotypeCaller says dict does not exist, but it does!

Hello GATK,

I'm using GATK 4.0.6 HaplotypeCaller to call wild bird SNP and InDel sites.

Command
nohup java -jar /opt/GATK-4.0.6/GATK-4.0.6-local.jar HaplotypeCaller -R bird_genome_reference.fa -I input.bam --minimum-mapping-quality 25 -mbq 13 -O VCF_two_type/HC.vcf --output-mode EMIT_VARIANTS_ONLY &

When I check the output in nohup, I notice that there are many optional arguments in the nohup file.
But I didn't see some of them in the ToolDoc of HaplotypeCaller.

Why do not put all the available arguments in the ToolDoc?

I want to set the threshold condition of two short variants caller(bcftools mpileup & HaplotypeCaller) the same as possible and then get the consistent short variants site with SelectVariants to improve my SNP & Indels quality.

↧

HaplotypeCaller reassembles reads so that no variants are called in a specific region

October 17, 2018, 7:27 am

≫ Next: Getting out-of-memory errors while running the worflow for germline short variant discovery

≪ Previous: HaplotyperCaller optional arguments:

Hi.

gatk-4.0.7.0
java openjdk version "1.8.0_181"

I've encountered a problem while running HaplotypeCaller in -ERC BP_RESOLUTION --genotyping-mode DISCOVERY --output-mode EMIT_ALL_CONFIDENT_SITES mode on a single-sample paired-end WGS data. The problem is that HC reassembles reads so that a particular region of interest contains no supporting reads in the corresponding bamout file while the original BAM file does contain them. The region of interest is the CYP2D6 gene on chr22.

Commands

After trimming the reads I align them to GRCh38 with Bowtie2:

bowtie2 --threads 16 --trim5 0 --trim3 0 --phred33 --no-mixed --no-discordant --no-unal --local -x /hg38/bowtie2/hg38 -q -1 /trimmomatic_output/sample_R1.paired.fastq.gz -2 /trimmomatic_output/sample_R2.paired.fastq.gz -t -S /bowtie2_output/sample_raw.sam --met-file /path_to_met_file

This is followed by sorting the resulting SAM file and its conversion to BAM, running Picard's MarkDuplicates on the resulting BAM and further filtering using samtools with flags -q 10 -f 0x2 -F 0x4 -F 0x100 -F 0x400 -F 0x800. BQSR is applied next followed by calling on separate chromosomes/ROIs:

/home/ubuntu/Tools/gatk-4.0.7.0/gatk  HaplotypeCaller --reference /hg38.fa --input /bqsr_output/sample.bqsr.bam --genotyping-mode DISCOVERY --output-mode EMIT_ALL_CONFIDENT_SITES --read-filter NotSecondaryAlignmentReadFilter --read-filter NotDuplicateReadFilter --read-filter MappingQualityAvailableReadFilter  --read-filter NotSupplementaryAlignmentReadFilter --smith-waterman FASTEST_AVAILABLE --min-base-quality-score 10 --annotation-group StandardAnnotation --annotation-group StandardHCAnnotation --output /gatk_gvcf_output/sample.diploid.raw.g.vcf.gz --emit-ref-confidence BP_RESOLUTION --sample-ploidy 2 --native-pair-hmm-threads 16 -L chr22

The resulting files are combined with CombineGVCFs followed by genotype calling using GenotypeGVCFs (gatk 3.8-1-0-gf15c1c3ef). This old version is used due to the fact that --includeNonVariantSites option is not implemented in up-to-date package (as far as I know), while I need to get the reference calls for further analysis.

Problems

The sample output from final VCF file from GenotypeGVCFs for all positions in a region of interest looks like this (zero AD, DP, which seems reasonable as bamout contains no supporting reads):

chr22   42130000    .   G   .   .   .   .   GT:AD:DP:RGQ    ./.:0,0:0:0

The IGV screenshot of the original BAM file after samtools filtering and bamout BAM file from HC is attached below.

I have digged into some of the old threads for similar problems (link, link, link, link) and found no appropriate solution. I tried to run HC as above but with adding --dont-trim-active-regions True --min-dangling-branch-length 1 --min-pruning 1 --disable-optimizations True --allow-non-unique-kmers-in-ref parameters. It resulted in complete absence of supporting reads in the bamout (the upper empty panel is for bamout output, the lower is for original BAM) as in the previous case:

Finally I tried running HC with --dont-trim-active-regions True --min-dangling-branch-length 1 --min-pruning 1 --disable-optimizations True --allow-non-unique-kmers-in-ref and --read-filter AllowAllReadsReadFilter --disable-tool-default-read-filters True. The IGV screenshot is below:

↧

Getting out-of-memory errors while running the worflow for germline short variant discovery

October 21, 2018, 7:17 am

≫ Next: Clarification of parameters for HaplotypeCaller

≪ Previous: HaplotypeCaller reassembles reads so that no variants are called in a specific region

I'm trying to run the wdl posted on the gatk-workflows Github page, under the gatk4-germline-snps-indels repository. The wdl is "haplotypecaller-gvcf-gatk4.wdl"

I'm attempting to run this wdl locally on my computer. This wdl script makes use of GATK in a docker containers to execute tools such as HaplotypeCaller, and MergeVcfs. I'm using Cromwell in "run mode" to run the wdl script. I'm using the exact inputs listed in the haplotypecaller-gvcf-gatk4.hg38.wgs.inputs.json file.

The bam file is the NA12878_24RG_small.hg38.bam, which is about 5 gigs in size.
The fasta file is the Homo_sapiens_assembly38.fasta, which is about 3 gigs in size

Anytime I run this I eventually get out-of-memory errors. It seems like 50 GATK docker containers are getting spun up and run HaplotypeCaller in parallel. This is due to the number of interval lists declared in hg38_wgs_scattered_calling_intervals.txt I think?

I'm running it on a machine with 32G of RAM and 512GB of disk space. My questions are basically:

How much RAM is needed to run this workflow?
Should I set a limit on how much memory each docker container can use in the Cromwell configuration file, and if so, how much should I set it to?
What should the Java heap size be set to?
It looks like it is using the "scatter-gather" technique for paralyzation. Does this require me to set up a cluster of servers to run the workflow? I'm not sure if I can run it like this on just my local computer.

Any insight would be greatly appreciated. Thank you!

↧

Clarification of parameters for HaplotypeCaller

October 23, 2018, 2:56 pm

≫ Next: Examples of exome fastq and vcfs using GATK4 HaplotypeCaller + bwa mem 0.7.17?

≪ Previous: Getting out-of-memory errors while running the worflow for germline short variant discovery

Hi I'm new to GATK and am looking into using HaplotypeCaller to call chloroplast variants in a plant. I am interested in both the variant sites and invariant sites in my sample (relative to my reference).

I know that in order to get confidence in homozygous reference sites, I should use:

--emit-ref-confidence BP_RESOLUTION

But how does this differ from

--output-mode EMIT_ALL_CONFIDENT_SITES

Also can someone please clarify the difference between

--base-quality-score-threshold

and

--min-base-quality-score?

Many thanks!

PS I am also looking into Mutect2 for calling chloroplast variation but am not quite convinced that a somatic caller is what I want here so for now would just like to understand HaplotypeCaller better. Thanks!

↧

Examples of exome fastq and vcfs using GATK4 HaplotypeCaller + bwa mem 0.7.17?

October 25, 2018, 11:46 am

≫ Next: How the HaplotypeCaller's reference confidence model works

≪ Previous: Clarification of parameters for HaplotypeCaller

Hi all,

I'm working on a GATK 4.0 fastq to vcf pipeline, and I'm wondering for validation purposes if anyone knows where I could find, or could share with me, exome fastq files and their corresponding vcfs after running HaplotypeCaller (GATK4) and bwa-mem (0.7.17)? Thanks so much.

↧

How the HaplotypeCaller's reference confidence model works

April 10, 2014, 2:57 pm

≫ Next: MNP and HaplotypeCaller GVCF mode

≪ Previous: Examples of exome fastq and vcfs using GATK4 HaplotypeCaller + bwa mem 0.7.17?

This document describes the reference confidence model applied by HaplotypeCaller to generate genomic VCFs (gVCFS), invoked by -ERC GVCF or -ERC BP_RESOLUTION (see the FAQ on gVCFs for format details).

Please note that this document may be expanded with more detailed information in the near future.

How it works

The mode works by assembling the reads to create potential haplotypes, realigning the reads to their most likely haplotypes, and then projecting these reads back onto the reference sequence via their haplotypes to compute alignments of the reads to the reference. For each position in the genome we have either an ALT call (via the standard calling mechanism) or we can estimate the chance that some (unknown) non-reference allele is segregating at this position by examining the realigned reads that span the reference base. At this base we perform two calculations:

Estimate the confidence that no SNP exists at the site by contrasting all reads with the ref base vs all reads with any non-reference base.
Estimate the confidence that no indel of size < X (determined by command line parameter) could exist at this site by calculating the number of reads that provide evidence against such an indel, and from this value estimate the chance that we would not have seen the allele confidently.

Based on this, we emit the genotype likelihoods (PL) and compute the GQ (from the PLs) for the least confidence of these two models.

We use a symbolic allele pair, <NON_REF>, to indicate that the site is not homozygous reference, and because we have an ALT allele we can provide allele-specific AD and PL field values.

For details of the gVCF format, please see the document that explains what is a gVCF.

↧

MNP and HaplotypeCaller GVCF mode

November 8, 2018, 5:26 am

≫ Next: HC overview: How the HaplotypeCaller works

≪ Previous: How the HaplotypeCaller's reference confidence model works

Hello

I am attempting to run HaplotypeCaller in a way that will merge adjacent SNPs into MNPs.
To do so I set --max-mnp-distance to 1 or 2.

This worked well when I did not used GVCF mode.
However, when I attempted this in GVCF mode I got the following error:
A USER ERROR has occurred: Illegal argument value: Non-zero maxMnpDistance is incompatible with GVCF mode.
(I am using GATK 4.0.8.1).

I am not sure I understand this conceptually:
If my callset contains two (or more) heterozygous SNPs that occur in adjacent genomic sites, they can only be determined to constitute part of a single MNP if both SNPs originate from the same chromsome/haplotype.
This is determined by phasing the callset, which as explained in the "Purpose and operation of Read-backed Phasing" page, is only enabled when HaplotypeCaller is run in GVCF or BP_RESOLUTION mode.

Following this reasoning it appears to me that merging SNPs into MNPs will only make sense in one of this modes since otherwise SNPs from different haplotypes can be merged erroneously.

Therefore I do not understand why in MNP merging possible without enabling GVCF mode, but is incompatible with GVCF mode.

I will be very glad for an explanation.

↧

HC overview: How the HaplotypeCaller works

May 8, 2014, 4:10 pm

≫ Next: java.lang.NullPointerException in HaplotypeCaller when generating gVCF with gatk 4.0.11.0

≪ Previous: MNP and HaplotypeCaller GVCF mode

This document describes the methods involved in variant calling as performed by the HaplotypeCaller. Please note that we are still working on producing supporting figures to help explain the sometimes complex operations involved.

Overview

The core operations performed by HaplotypeCaller can be grouped into these major steps:

1. Define active regions. The program determines which regions of the genome it needs to operate on, based on the presence of significant evidence for variation.

2. Determine haplotypes by re-assembly of the active region. For each ActiveRegion, the program builds a De Bruijn-like graph to reassemble the ActiveRegion and identifies what are the possible haplotypes present in the data. The program then realigns each haplotype against the reference haplotype using the Smith-Waterman algorithm in order to identify potentially variant sites.

3. Determine likelihoods of the haplotypes given the read data. For each ActiveRegion, the program performs a pairwise alignment of each read against each haplotype using the PairHMM algorithm. This produces a matrix of likelihoods of haplotypes given the read data. These likelihoods are then marginalized to obtain the likelihoods of alleles per read for each potentially variant site.

4. Assign sample genotypes. For each potentially variant site, the program applies Bayes’ rule, using the likelihoods of alleles given the read data to calculate the posterior likelihoods of each genotype per sample given the read data observed for that sample. The most likely genotype is then assigned to the sample.

1. Define active regions

In this first step, the program traverses the sequencing data to identify regions of the genomes in which the samples being analyzed show substantial evidence of variation relative to the reference. The resulting areas are defined as “active regions”, and will be passed on to the next step. Areas that do not show any variation beyond the expected levels of background noise will be skipped in the next step. This aims to accelerate the analysis by not wasting time performing reassembly on regions that are identical to the reference anyway.

To define these active regions, the program operates in three phases. First, it computes an activity score for each individual genome position, yielding the raw activity profile, which is a wave function of activity per position. Then, it applies a smoothing algorithm to the raw profile, which is essentially a sort of averaging process, to yield the actual activity profile. Finally, it identifies local maxima where the activity profile curve rises above the preset activity threshold, and defines appropriate intervals to encompass the active profile within the preset size constraints. For more details on how the activity profile is computed and processed, as well as what options are available to modify the active region parameters, please see this method article.

Note that the process for determining active region intervals is modified slightly when HaplotypeCaller is run in one of the special modes, e.g. the reference confidence mode (-ERC GVCF or ERC BP_RESOLUTION), Genotype Given Alleles (-gt_mode GENOTYPE_GIVEN_ALLELES) or when active regions are triggered using advanced arguments such as -allelesTrigger, --forceActive or --activeRegionIn. This is covered in the method article referenced above.

Once this process is complete, the program applies a few post-processing steps to finalize the the active regions (see detailed doc above). The final output of this process is a list of intervals corresponding to the active regions which will be processed in the next step.

2. Determine haplotypes by re-assembly of the active region.

The goal of this step is to reconstruct the possible sequences of the real physical segments of DNA present in the original sample organism. To do this, the program goes through each active region and uses the input reads that mapped to that region to construct complete sequences covering its entire length, which are called haplotypes. This process will typically generate several different possible haplotypes for each active region due to:

real diversity on polyploid (including CNV) or multi-sample data
possible allele combinations between variant sites that are not totally linked within the active region
sequencing and mapping errors

In order to generate a list of possible haplotypes, the program first builds an assembly graph for the active region using the reference sequence as a template. Then, it takes each read in turn and attempts to match it to a segment of the graph. Whenever portions of a read do not match the local graph, the program adds new nodes to the graph to account for the mismatches. After this process has been repeated with many reads, it typically yields a complex graph with many possible paths. However, because the program keeps track of how many reads support each path segment, we can select only the most likely (well-supported) paths. These likely paths are then used to build the haplotype sequences which will be used for scoring and genotyping in the next step.

The assembly and haplotype determination procedure is described in full detail in this method article.

Once the haplotypes have been determined, each one is realigned against the original reference sequence in order to identify potentially variant sites. This produces the set of sites that will be processed in the next step. A subset of these sites will eventually be emitted as variant calls to the output VCF.

3. Evaluating the evidence for haplotypes and variant alleles

Now that we have all these candidate haplotypes, we need to evaluate how much evidence there is in the data to support each one of them. So the program takes each individual read and aligns it against each haplotype in turn (including the reference haplotype) using the PairHMM algorithm, which takes into account the information we have about the quality of the data (i.e. the base quality scores and indel quality scores). This outputs a score for each read-haplotype pairing, expressing the likelihood of observing that read given that haplotype.

Those scores are then used to calculate out how much evidence there is for individual alleles at the candidate sites that were identified in the previous step. The process is called marginalization over alleles and produces the actual numbers that will finally be used to assign a genotype to the sample in the next step.

For further details on the pairHMM output and the marginalization process, see this document.

4. Assigning per-sample genotypes

The previous step produced a table of per-read allele likelihoods for each candidate variant site under consideration. Now, all that remains to do is to evaluate those likelihoods in aggregate to determine what is the most likely genotype of the sample at each site. This is done by applying Bayes' theorem to calculate the likelihoods of each possible genotype, and selecting the most likely. This produces a genotype call as well as the calculation of various metrics that will be annotated in the output VCF if a variant call is emitted.

For further details on the genotyping calculations, see this document.

This concludes the overview of how HaplotypeCaller works.

↧

java.lang.NullPointerException in HaplotypeCaller when generating gVCF with gatk 4.0.11.0

November 9, 2018, 4:15 pm

≫ Next: Variant discovery starting from gVCF file

≪ Previous: HC overview: How the HaplotypeCaller works

I am running the latest gatk 4.0.11.0 on aligned reads from whole exome sequencing from TCGA to generate gVCF files. After generating the gVCF file, gatk is crashing with a null pointer exception. I get this exception only when I try to generate gVCF, but not regular VCF, from the same exact input. I also get the exception when I use different reference genomes and input bam files. The generated gVCF looks okay, but it is still strange that the software crashes. I was wondering if you have any suggestions?

Here is how I run gatk and the relevant console output:

$ gatk HaplotypeCaller -R ../../hg38.canonical_chromosomes/hg38.canonical_chromosomes.fa -I C828.TCGA-EB-A3XB-10B-01D-A23B-08.1_gdc_realn.sorted.bam --emit-ref-confidence GVCF -O C828.TCGA-EB-A3XB-10B-01D-A23B-08.1_gdc_realn.sorted.bam.genomic.hg38_canonical_chromosomes.vcf.gz

Using GATK jar /home/pfiziev/software/gatk-4.0.11.0/gatk-package-4.0.11.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/pfiziev/software/gatk-4.0.11.0/gatk-package-4.0.11.0-local.jar HaplotypeCaller -R ../../hg38.canonical_chromosomes/hg38.canonical_chromosomes.fa -I C828.TCGA-EB-A3XB-10B-01D-A23B-08.1_gdc_realn.sorted.bam --emit-ref-confidence GVCF -O C828.TCGA-EB-A3XB-10B-01D-A23B-08.1_gdc_realn.sorted.bam.genomic.hg38_canonical_chromosomes.vcf.gz
11:17:56.245 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/pfiziev/software/gatk-4.0.11.0/gatk-package-4.0.11.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
11:17:57.952 INFO HaplotypeCaller - ------------------------------------------------------------
11:17:57.953 INFO HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.0.11.0
11:17:57.953 INFO HaplotypeCaller - For support and documentation go to
11:17:57.953 INFO HaplotypeCaller - Executing as pfiziev@node005 on Linux v3.10.0-693.11.6.el7.x86_64 amd64
11:17:57.953 INFO HaplotypeCaller - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_161-b14
11:17:57.953 INFO HaplotypeCaller - Start Date/Time: November 8, 2018 11:17:56 AM PST
11:17:57.953 INFO HaplotypeCaller - ------------------------------------------------------------
11:17:57.954 INFO HaplotypeCaller - ------------------------------------------------------------
11:17:57.954 INFO HaplotypeCaller - HTSJDK Version: 2.16.1
11:17:57.954 INFO HaplotypeCaller - Picard Version: 2.18.13
11:17:57.955 INFO HaplotypeCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2
11:17:57.955 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
11:17:57.955 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
11:17:57.955 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
11:17:57.955 INFO HaplotypeCaller - Deflater: IntelDeflater
11:17:57.955 INFO HaplotypeCaller - Inflater: IntelInflater
11:17:57.955 INFO HaplotypeCaller - GCS max retries/reopens: 20
11:17:57.955 INFO HaplotypeCaller - Requester pays: disabled
11:17:57.955 INFO HaplotypeCaller - Initializing engine
11:17:58.487 INFO HaplotypeCaller - Done initializing engine
11:17:58.489 INFO HaplotypeCallerEngine - Tool is in reference confidence mode and the annotation, the following changes will be made to any specified annotations: 'StrandBiasBySample' will be enabled. 'ChromosomeCounts', 'FisherStrand', 'StrandOddsRatio' and 'QualByDepth' annotations have been disabled
11:17:58.499 INFO HaplotypeCallerEngine - Standard Emitting and Calling confidence set to 0.0 for reference-model confidence output
11:17:58.499 INFO HaplotypeCallerEngine - All sites annotated with PLs forced to true for reference-model confidence output
11:17:58.512 INFO NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/home/pfiziev/software/gatk-4.0.11.0/gatk-package-4.0.11.0-local.jar!/com/intel/gkl/native/libgkl_utils.so
11:17:58.514 INFO NativeLibraryLoader - Loading libgkl_pairhmm_omp.so from jar:file:/home/pfiziev/software/gatk-4.0.11.0/gatk-package-4.0.11.0-local.jar!/com/intel/gkl/native/libgkl_pairhmm_omp.so
11:17:58.571 WARN IntelPairHmm - Flush-to-zero (FTZ) is enabled when running PairHMM
11:17:58.572 INFO IntelPairHmm - Available threads: 56
11:17:58.572 INFO IntelPairHmm - Requested threads: 4
11:17:58.572 INFO PairHMM - Using the OpenMP multi-threaded AVX-accelerated native PairHMM implementation
11:17:58.682 INFO ProgressMeter - Starting traversal
11:17:58.683 INFO ProgressMeter - Current Locus Elapsed Minutes Regions Processed Regions/Minute
11:18:03.297 WARN DepthPerSampleHC - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
11:18:03.297 WARN StrandBiasBySample - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null

…

14:10:58.160 INFO ProgressMeter - chrY:27588245 173.0 10830630 62608.0
14:11:08.290 INFO ProgressMeter - chrY:37023245 173.2 10862080 62728.5
14:11:18.292 INFO ProgressMeter - chrY:46002245 173.3 10892010 62840.9
14:11:28.292 INFO ProgressMeter - chrY:54567245 173.5 10920560 62945.1
14:11:32.729 WARN DepthPerSampleHC - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
14:11:32.729 WARN StrandBiasBySample - Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null
14:11:33.219 INFO VectorLoglessPairHMM - Time spent in setup for JNI call : 2.404777391
14:11:33.219 INFO PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 255.64041847700003
14:11:33.219 INFO SmithWatermanAligner - Total compute time in java Smith-Waterman : 417.38 sec
14:11:33.219 INFO HaplotypeCaller - Shutting down engine
[November 8, 2018 2:11:33 PM PST] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 173.62 minutes.
Runtime.totalMemory()=12706119680
java.lang.NullPointerException
at org.broadinstitute.hellbender.engine.AssemblyRegion.getReference(AssemblyRegion.java:443)
at org.broadinstitute.hellbender.engine.AssemblyRegion.getAssemblyRegionReference(AssemblyRegion.java:464)
at org.broadinstitute.hellbender.engine.AssemblyRegion.getAssemblyRegionReference(AssemblyRegion.java:450)
at org.broadinstitute.hellbender.tools.walkers.haplotypecaller.AssemblyBasedCallerUtils.createReferenceHaplotype(AssemblyBasedCallerUtils.java:149)
at org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCallerEngine.referenceModelForNoVariation(HaplotypeCallerEngine.java:682)
at org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCallerEngine.callRegion(HaplotypeCallerEngine.java:521)
at org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller.apply(HaplotypeCaller.java:240)
at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.processReadShard(AssemblyRegionWalker.java:291)
at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.traverse(AssemblyRegionWalker.java:267)
at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:966)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)

↧

Variant discovery starting from gVCF file

November 10, 2018, 9:13 am

≫ Next: Using HaplotypeCaller 3.5 vs HaplotypeCaller 4.0 joint calling large cohorts

≪ Previous: java.lang.NullPointerException in HaplotypeCaller when generating gVCF with gatk 4.0.11.0

Hello, as the title suggests I'm looking to use the variant discovery tools, specifically SNP discovery. However I am not starting with a FASTA or BAM file, indeed I do not currently have access to them, instead I'm starting with a gVCF file that many of the relevant commands such as haplotype caller do not accept gVCFs as input, and for those commands that do use gVCFs it is implied that they should have been produced by haplotype caller.

↧

Using HaplotypeCaller 3.5 vs HaplotypeCaller 4.0 joint calling large cohorts

November 12, 2018, 8:51 am

≫ Next: I got a question about the kmer length you parsed during the second step of HaplotypeCaller

≪ Previous: Variant discovery starting from gVCF file

We've been testing published Broad "production" workflow for paired-end single sample alignment and variant calling github.com/gatk-workflows/gatk4-germline-snps-indels ("paired-end single sample alignment and variant calling"). The workflow in the "production" pipeline appears to use HaplotypeCaller from GATK 3.5 while using other components from GATK 4. There is a separately published pipeline here github.com/gatk-workflows/gatk4-germline-snps-indels that uses GATK 4's HaplotypeCaller. We have tried running the joint discovery pipeline using both the HC 3.5 and HC 4 gvcfs as inputs, using the default parameters provided in the repositories. However, the results appear to be almost indistinguishable when run on the NIST NA12878 reference sample. This surprised us since the model/parameters have changed between 3.5 and 4.

Question: Why is the Broad "production" pipeline still using HaplotypeCaller 3.5 instead of some 4.x version?
Question:
We intend to align and joint-call upwards of 8000 WGS samples on Google Cloud It it recommended to use the output from HaplotypeCaller 3.5 (we were trying to use the Broad "production" pipeline with as few modifications as possible) or to use Haplotypecaller 4 instead before running joint calling?

↧

I got a question about the kmer length you parsed during the second step of HaplotypeCaller

November 14, 2018, 5:47 am

≫ Next: Inconsistent results with HaplotypeCaller on haploid organism

≪ Previous: Using HaplotypeCaller 3.5 vs HaplotypeCaller 4.0 joint calling large cohorts

As I screenshot, I found HC respectively parse the sequence corresponding to the ActiveRegion on reference genome and reads to kmers in length of 10 and 25.

Furthermore, (https://software.broadinstitute.org/gatk/documentation/article.php?id=4146) here you claimed that in the reads threading process, HC starts with the first read and compare its first kmer to the hash table to find if it has a match.

Under this circumstance, I have confusions:
Shouldn't the kmer length be an odd number?
If the kmer length is not consistent between ref-kmer and read-kmer, how are the read-kmers considered to be a match with the ref-kmer in the hash table?

Another little inquiry, by the time of my post, I found I cannot load the web page of your Bundle via FTP. Everytime I tried to log into that page, a little window pops out requiring input of username and code. I input the username and leave the code blank as instructed. But it does not work, the little window keeps popping out every time I hit Enter.

↧

Inconsistent results with HaplotypeCaller on haploid organism

March 5, 2018, 8:58 am

≫ Next: HC step 4: Assigning per-sample genotypes

≪ Previous: I got a question about the kmer length you parsed during the second step of HaplotypeCaller

Hello GATK team,

I would appreciate some help in understanding how GATK works in GVCF mode on my data.
Here is my data example I'm usign GATK v3.8:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 328-16 983-16 NC_018661.1 859953 . C T > 31035.30 SnpCluster AC=15;AF=0.536;AN=28;BaseQRankSum=-1.134e+00;ClippingRankSum=0.00;DP=7961;FS=3.292;MLEAC=15;MLEAF=0.536;MQ=56.52;MQRankSum=0.00;QD=7.76;ReadPosRankSum=-5.480e-01;SOR=0.486 GT:AD:DP:GQ:PL 0:354,0:354:99:0,106 1:157,157:314:99:361,0

First thing weird is that the variant is in heterocygosis with the highest GQ (99) when we are analyzing an haploid sample, this is different from the explanation given in this post
Second issue appears when we observe this position using IGV in our aligned reads using bwa mem(bam format). Here we see that both samples seem to have this site as AD 50%, but HaplotypeCaller calls it totally different.

This are the parameters we use for one sample:
java -Djava.io.tmpdir=/processing_Data/bioinformatics/services_and_colaborations/CNM/bacteriologia/SRVCNM062_20180214_ECOLI07_SS_S/TMP/ -Xmx10g -jar /opt/g atk/gatk-3.8.0/GenomeAnalysisTK.jar -T HaplotypeCaller -R /processing_Data/bioinformatics/services_and_colaborations/CNM/bacteriologia/SRVCNM062_20180214_ECO LI07_SS_S/REFERENCES/GCF_000299475.1_ASM29947v1_genomic_NoPlasmid.fna -I /processing_Data/bioinformatics/services_and_colaborations/CNM/bacteriologia/SRVCNM0 62_20180214_ECOLI07_SS_S/ANALYSIS/20180205_ECOLI0601/Alignment/BAM/328-16/328-16.woduplicates.bam -o /processing_Data/bioinformatics/services_and_colaboratio ns/CNM/bacteriologia/SRVCNM062_20180214_ECOLI07_SS_S/ANALYSIS/20180205_ECOLI0601/variant_calling/variants_gatk/variants/328-16.g.vcf -stand_call_conf 30 --em itRefConfidence GVCF -ploidy 1 -S LENIENT -log /processing_Data/bioinformatics/services_and_colaborations/CNM/bacteriologia/SRVCNM062_20180214_ECOLI07_SS_S/A NALYSIS/20180205_ECOLI0601/variant_calling/variants_gatk/snp_indels.vcf-HaplotypeCaller.log

How is this even possible? (I have infinite-checked that bam files are the same used in IGV and passed to GATK, you never know...)

Could be the effect referred in this thread be somehow affecting the variant calling? Should we use BP_Resolution? Which is the main difference between GVCF and BP_RESOLUTION mode?

Our first idea is select by AD our GVCFs using JEXL expressions but as GVCF has reference blocks with no AD the command fails:

ERROR MESSAGE: Invalid JEXL expression detected for select-0 with message ![35,47]: 'vc.getGenotype('328-16').getAD().1.floatValue() / vc.getGenotype('328-16').getDP() > 0.90;' attempting to call method on null

I could filter them manually before GenotypeGVCFs but, it is a good practice? As I read in this thread this is not recommended, obviously because we override GATK model which takes a lot more of variables into account...
Any ideas? We are kind of struggling, maybe is something trivial but we can't see it, any help will be much appreciate.

Thanks very much in advance,
Best Regards
Sara

↧

HC step 4: Assigning per-sample genotypes

July 23, 2014, 10:38 am

≫ Next: HaplotypeCaller may fail to detect variant with the same reads with a different composition.

≪ Previous: Inconsistent results with HaplotypeCaller on haploid organism

This document describes the procedure used by HaplotypeCaller to assign genotypes to individual samples based on the allele likelihoods calculated in the previous step. For more context information on how this fits into the overall HaplotypeCaller method, please see the more general HaplotypeCaller documentation. See also the documentation on the QUAL score as well as PL and GQ.

Note that this describes the regular mode of HaplotypeCaller, which does not emit an estimate of reference confidence. For details on how the reference confidence model works and is applied in -ERC modes (GVCF and BP_RESOLUTION) please see the reference confidence model documentation.

Overview

1. Preliminary assumptions / limitations

Quality

Keep in mind that we are trying to infer the genotype of each sample given the observed sequence data, so the degree of confidence we can have in a genotype depends on both the quality and the quantity of the available data. By definition, low coverage and low quality will both lead to lower confidence calls. The GATK only uses reads that satisfy certain mapping quality thresholds, and only uses “good” bases that satisfy certain base quality thresholds (see documentation for default values).

Ploidy

Both the HaplotypeCaller and GenotypeGVCFs (but not UnifiedGenotyper) assume that the organism of study is diploid by default, but desired ploidy can be set using the -ploidy argument. The ploidy is taken into account in the mathematical development of the Bayesian calculation. The generalized form of the genotyping algorithm that can handle ploidies other than 2 is available as of version 3.3-0. Note that using ploidy for pooled experiments is subject to some practical limitations due to the number of possible combinations resulting from the interaction between ploidy and the number of alternate alleles that are considered (currently, the maximum "workable" ploidy is ~20 for a max number of alt alleles = 6). Future developments will aim to mitigate those limitations.

Paired end reads

Reads that are mates in the same pair are not handled together in the reassembly, but if they overlap, there is some special handling to ensure they are not counted as independent observations.

Single-sample vs multi-sample

We apply different genotyping models when genotyping a single sample as opposed to multiple samples together (as done by HaplotypeCaller on multiple inputs or GenotypeGVCFs on multiple GVCFs). The multi-sample case is not currently documented for the public but is an extension of previous work by Heng Li and others.

2. Calculating genotype likelihoods using Bayes' Theorem

We use the approach described in Li 2011 to calculate the posterior probabilities of non-reference alleles (Methods 2.3.5 and 2.3.6) extended to handle multi-allelic variation.

The basic formula we use for all types of variation under consideration (SNPs, insertions and deletions) is:

$$ P(G|D) = \frac{ P(G) P(D|G) }{ \sum_{i} P(G_i) P(D|G_i) } $$

If that is meaningless to you, please don't freak out -- we're going to break it down and go through all the components one by one. First of all, the term on the left:

$$ P(G|D) $$

is the quantity we are trying to calculate for each possible genotype: the conditional probability of the genotype G given the observed data D.

Now let's break down the term on the right:

$$ \frac{ P(G) P(D|G) }{ \sum_{i} P(G_i) P(D|G_i) } $$

We can ignore the denominator (bottom of the fraction) because it ends up being the same for all the genotypes, and the point of calculating this likelihood is to determine the most likely genotype. The important part is the numerator (top of the fraction):

$$ P(G) P(D|G) $$

which is composed of two things: the prior probability of the genotype and the conditional probability of the data given the genotype.

The first one is the easiest to understand. The prior probability of the genotype G:

$$ P(G) $$

represents how probably we expect to see this genotype based on previous observations, studies of the population, and so on. By default, the GATK tools use a flat prior (always the same value) but you can input your own set of priors if you have information about the frequency of certain genotypes in the population you're studying.

The second one is a little trickier to understand if you're not familiar with Bayesian statistics. It is called the conditional probability of the data given the genotype, but what does that mean? Assuming that the genotype G is the true genotype,

$$ P(D|G) $$

is the probability of observing the sequence data that we have in hand. That is, how likely would we be to pull out a read with a particular sequence from an individual that has this particular genotype? We don't have that number yet, so this requires a little more calculation, using the following formula:

$$ P(D|G) = \prod{j} \left( \frac{P(D_j | H_1)}{2} + \frac{P(D_j | H_2)}{2} \right) $$

You'll notice that this is where the diploid assumption comes into play, since here we decomposed the genotype G into:

$$ G = H_1H_2 $$

which allows for exactly two possible haplotypes. In future versions we'll have a generalized form of this that will allow for any number of haplotypes.

Now, back to our calculation, what's left to figure out is this:

$$ P(D_j|H_n) $$

which as it turns out is the conditional probability of the data given a particular haplotype (or specifically, a particular allele), aggregated over all supporting reads. Conveniently, that is exactly what we calculated in Step 3 of the HaplotypeCaller process, when we used the PairHMM to produce the likelihoods of each read against each haplotype, and then marginalized them to find the likelihoods of each read for each allele under consideration. So all we have to do at this point is plug the values from that table into the equation above, and we can work our way back up to obtain:

$$ P(G|D) $$

for the genotype G.

3. Selecting a genotype and emitting the call record

We go through the process of calculating a likelihood for each possible genotype based on the alleles that were observed at the site, considering every possible combination of alleles. For example, if we see an A and a T at a site, the possible genotypes are AA, AT and TT, and we end up with 3 corresponding probabilities. We pick the largest one, which corresponds to the most likely genotype, and assign that to the sample.

Note that depending on the variant calling options specified in the command-line, we may only emit records for actual variant sites (where at least one sample has a genotype other than homozygous-reference) or we may also emit records for reference sites. The latter is discussed in the reference confidence model documentation.

Assuming that we have a non-ref genotype, all that remains is to calculate the various site-level and genotype-level metrics that will be emitted as annotations in the variant record, including QUAL as well as PL and GQ -- see the linked docs for details. For more information on how the other variant context metrics are calculated, please see the corresponding variant annotations documentation.

↧

Contents

1. Overview

2. Calculating the raw activity profile

3. Smoothing the activity profile

4. Setting the ActiveRegion thresholds and intervals

ERROR MESSAGE: Fasta dict file /net/rcnfs02/srv/export/duraisingh_lab/share_root/data/Plasmodium_knowlesi/jva/PlasmoDB-26_PknowlesiH_Genome_02.dict for reference /net/rcnfs02/srv/export/duraisingh_lab/share_root/data/Plasmodium_knowlesi/jva/PlasmoDB-26_PknowlesiH_Genome_02.fasta does not exist

Commands

Problems

How it works

Overview

1. Define active regions

2. Determine haplotypes by re-assembly of the active region.

3. Evaluating the evidence for haplotypes and variant alleles

4. Assigning per-sample genotypes

ERROR MESSAGE: Invalid JEXL expression detected for select-0 with message ![35,47]: 'vc.getGenotype('328-16').getAD().1.floatValue() / vc.getGenotype('328-16').getDP() > 0.90;' attempting to call method on null

Overview

1. Preliminary assumptions / limitations

Quality

Ploidy

Paired end reads

Single-sample vs multi-sample

2. Calculating genotype likelihoods using Bayes' Theorem

3. Selecting a genotype and emitting the call record