Quantcast
Channel: haplotypecaller — GATK-Forum
Viewing all 1335 articles
Browse latest View live

Why mutations have much less read depth after variant calling by Haplotypecaller GATK 4.0.4.0

$
0
0

hi, I'm new here. I am trying to find germline mutations in a mutiplex_pcr generated NGS data with Haplotypecaller (GATK 4.0.4.0 ). I am confused about that why mutations have much less read depth after variant calling compared with that in input bam. For example:

17 41223094 . T C 947.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=-6.201;ClippingRankSum=0.000;DP=75;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=12.64;ReadPosRankSum=-0.181;SOR=0.572 GT:AD:DP:GQ:PL 0/1:55,55:110:99:1586,0,1853

This is a mutation called by Haplotypecaller, the VCF shows the read depth in this positon is 110, but IGV shows the read depth at this position in Bam is around 1000, what makes this difference?
And I checked that there was no read filtered by HaplotypeCaller .

17:32:41.050 INFO HaplotypeCaller - No reads filtered by: ((((((((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter) AND NotSecondaryAlignmentReadFilter) AND NotDuplicateReadFilter) AND PassesVendorQualityCheckReadFilter) AND NonZeroReferenceLengthAlignmentReadFilter) AND GoodCigarReadFilter) AND WellformedReadFilter)

Thanks!


default read filter in haplotypecaller

$
0
0

Hi,

Where can I find which read filters that are applied by default in Haplotypecaller in GATK4? (GATK web page or github page)

haplotypecaller gvcf very slow on draft WGS

$
0
0

Dear GATK people,

I am trying to call SNPs out of 150 samples of WGS data on a non-model species (coral). The reference is a draft genome of 500 MB, each sample has roughly 15 M paired end reads. Species is diploid.

I followed the suggested pipeline for data preparation and this ended in bam files of roughly 5 GB each.

I then wanted to follow the suggested approach Haplotypecaller with gVCF -> merge gVCF -> genotypeVCFs.

I am stuck at the first step: haplotypecaller with the --ERC GVCF flag.

I firstly tried it with GATK 3.8: progressmeter estimated 5 hours for the first sample, so 1 month for all the samples (using -nct multithreading on 10 cores, 6 GB available per core).

I then tryed to switch on GATK 4.0.8: progressmeter doesn't show runtime expected duration any more (why?) but I expect it to be even slower since multithreading is not implemented anymore.

I always use default set-ups, the only flag on is --ERC GVCF.

Is it normal that this is taking that long? Any suggestion on how to speed up the whole thing?

thank you in advance

OSelm

Omission of IndelRealignment in production pipelines using GATK 3.5

$
0
0

It was announced with the release of GATK 3.6 that it is no longer necessary to run IndelRealigner when HaplotypeCaller will be used to call variants.

We are testing the published "prod" single-sample workflow and notice that it is still using GATK 3.5 for the HaplotypeCaller step, but omits IndelRealigner (implicitly, since GATK4 is used for e.g. BQSR and other upstream steps).

Question: Does this mean that the advice not to bother with IndelRealigner prior to running HaplotypeCaller applies to the GATK 3.5 version of HaplotypeCaller as well, not just 3.6 and later?

What is a GVCF and how is it different from a 'regular' VCF?

$
0
0

Overview

GVCF stands for Genomic VCF. A GVCF is a kind of VCF, so the basic format specification is the same as for a regular VCF (see the spec documentation here), but a Genomic VCF contains extra information.

This document explains what that extra information is and how you can use it to empower your variants analyses.

Important caveat

What we're covering here is strictly limited to GVCFs produced by HaplotypeCaller in GATK versions 3.0 and above. The term GVCF is sometimes used simply to describe VCFs that contain a record for every position in the genome (or interval of interest) regardless of whether a variant was detected at that site or not (such as VCFs produced by UnifiedGenotyper with --output_mode EMIT_ALL_SITES). GVCFs produced by HaplotypeCaller 3.x contain additional information that is formatted in a very specific way. Read on to find out more.

General comparison of VCF vs. gVCF

The key difference between a regular VCF and a gVCF is that the gVCF has records for all sites, whether there is a variant call there or not. The goal is to have every site represented in the file in order to do joint analysis of a cohort in subsequent steps. The records in a gVCF include an accurate estimation of how confident we are in the determination that the sites are homozygous-reference or not. This estimation is generated by the HaplotypeCaller's built-in reference model.

image

Note that some other tools (including the GATK's own UnifiedGenotyper) may output an all-sites VCF that looks superficially like the BP_RESOLUTION gVCFs produced by HaplotypeCaller, but they do not provide an accurate estimate of reference confidence, and therefore cannot be used in joint genotyping analyses.

The two types of gVCFs

As you can see in the figure above, there are two options you can use with -ERC: GVCF and BP_RESOLUTION. With BP_RESOLUTION, you get a gVCF with an individual record at every site: either a variant record, or a non-variant record. With GVCF, you get a gVCF with individual variant records for variant sites, but the non-variant sites are grouped together into non-variant block records that represent intervals of sites for which the genotype quality (GQ) is within a certain range or band. The GQ ranges are defined in the ##GVCFBlock line of the gVCF header. The purpose of the blocks (also called banding) is to keep file size down, and there is no downside for the downstream analysis, so we do recommend using the -GVCF option.

Example gVCF file

This is a banded gVCF produced by HaplotypeCaller with the -GVCF option.

Header:

As you can see in the first line, the basic file format is a valid version 4.1 VCF:

##fileformat=VCFv4.1
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location">
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
##GVCFBlock=minGQ=0(inclusive),maxGQ=5(exclusive)
##GVCFBlock=minGQ=20(inclusive),maxGQ=60(exclusive)
##GVCFBlock=minGQ=5(inclusive),maxGQ=20(exclusive)
##GVCFBlock=minGQ=60(inclusive),maxGQ=2147483647(exclusive)
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
##contig=<ID=20,length=63025520,assembly=b37>
##reference=file:///humgen/1kg/reference/human_g1k_v37.fasta

Toward the middle you see the ##GVCFBlock lines (after the ##FORMAT lines) (repeated here for clarity):

##GVCFBlock=minGQ=0(inclusive),maxGQ=5(exclusive)
##GVCFBlock=minGQ=20(inclusive),maxGQ=60(exclusive)
##GVCFBlock=minGQ=5(inclusive),maxGQ=20(exclusive)

which indicate the GQ ranges used for banding (corresponding to the boundaries [5, 20, 60]).

You can also see the definition of the MIN_DP annotation in the ##FORMAT lines.

Records

The first thing you'll notice, hopefully, is the <NON_REF> symbolic allele listed in every record's ALT field. This provides us with a way to represent the possibility of having a non-reference allele at this site, and to indicate our confidence either way.

The second thing to look for is the END tag in the INFO field of non-variant block records. This tells you at what position the block ends. For example, the first line is a non-variant block that starts at position 20:10000000 and ends at 20:10000116.

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NA12878
20  10000000    .   T   <NON_REF>   .   .   END=10000116    GT:DP:GQ:MIN_DP:PL  0/0:44:99:38:0,89,1385
20  10000117    .   C   T,<NON_REF> 612.77  .   BaseQRankSum=0.000;ClippingRankSum=-0.411;DP=38;MLEAC=1,0;MLEAF=0.500,0.00;MQ=221.39;MQ0=0;MQRankSum=-2.172;ReadPosRankSum=-0.235   GT:AD:DP:GQ:PL:SB   0/1:17,21,0:38:99:641,0,456,691,519,1210:6,11,11,10
20  10000118    .   T   <NON_REF>   .   .   END=10000210    GT:DP:GQ:MIN_DP:PL  0/0:42:99:38:0,80,1314
20  10000211    .   C   T,<NON_REF> 638.77  .   BaseQRankSum=0.894;ClippingRankSum=-1.927;DP=42;MLEAC=1,0;MLEAF=0.500,0.00;MQ=221.89;MQ0=0;MQRankSum=-1.750;ReadPosRankSum=1.549    GT:AD:DP:GQ:PL:SB   0/1:20,22,0:42:99:667,0,566,728,632,1360:9,11,12,10
20  10000212    .   A   <NON_REF>   .   .   END=10000438    GT:DP:GQ:MIN_DP:PL  0/0:52:99:42:0,99,1403
20  10000439    .   T   G,<NON_REF> 1737.77 .   DP=57;MLEAC=2,0;MLEAF=1.00,0.00;MQ=221.41;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,56,0:56:99:1771,168,0,1771,168,1771:0,0,0,0
20  10000440    .   T   <NON_REF>   .   .   END=10000597    GT:DP:GQ:MIN_DP:PL  0/0:56:99:49:0,120,1800
20  10000598    .   T   A,<NON_REF> 1754.77 .   DP=54;MLEAC=2,0;MLEAF=1.00,0.00;MQ=185.55;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,53,0:53:99:1788,158,0,1788,158,1788:0,0,0,0
20  10000599    .   T   <NON_REF>   .   .   END=10000693    GT:DP:GQ:MIN_DP:PL  0/0:51:99:47:0,120,1800
20  10000694    .   G   A,<NON_REF> 961.77  .   BaseQRankSum=0.736;ClippingRankSum=-0.009;DP=54;MLEAC=1,0;MLEAF=0.500,0.00;MQ=106.92;MQ0=0;MQRankSum=0.482;ReadPosRankSum=1.537 GT:AD:DP:GQ:PL:SB   0/1:21,32,0:53:99:990,0,579,1053,675,1728:9,12,10,22
20  10000695    .   G   <NON_REF>   .   .   END=10000757    GT:DP:GQ:MIN_DP:PL  0/0:48:99:45:0,120,1800
20  10000758    .   T   A,<NON_REF> 1663.77 .   DP=51;MLEAC=2,0;MLEAF=1.00,0.00;MQ=59.32;MQ0=0  GT:AD:DP:GQ:PL:SB   1/1:0,50,0:50:99:1697,149,0,1697,149,1697:0,0,0,0
20  10000759    .   A   <NON_REF>   .   .   END=10001018    GT:DP:GQ:MIN_DP:PL  0/0:40:99:28:0,65,1080
20  10001019    .   T   G,<NON_REF> 93.77   .   BaseQRankSum=0.058;ClippingRankSum=-0.347;DP=26;MLEAC=1,0;MLEAF=0.500,0.00;MQ=29.65;MQ0=0;MQRankSum=-0.925;ReadPosRankSum=0.000 GT:AD:DP:GQ:PL:SB   0/1:19,7,0:26:99:122,0,494,179,515,694:12,7,4,3
20  10001020    .   C   <NON_REF>   .   .   END=10001020    GT:DP:GQ:MIN_DP:PL  0/0:26:72:26:0,72,1080
20  10001021    .   T   <NON_REF>   .   .   END=10001021    GT:DP:GQ:MIN_DP:PL  0/0:25:37:25:0,37,909
20  10001022    .   C   <NON_REF>   .   .   END=10001297    GT:DP:GQ:MIN_DP:PL  0/0:30:87:25:0,72,831
20  10001298    .   T   A,<NON_REF> 1404.77 .   DP=41;MLEAC=2,0;MLEAF=1.00,0.00;MQ=171.56;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,41,0:41:99:1438,123,0,1438,123,1438:0,0,0,0
20  10001299    .   C   <NON_REF>   .   .   END=10001386    GT:DP:GQ:MIN_DP:PL  0/0:43:99:39:0,95,1226
20  10001387    .   C   <NON_REF>   .   .   END=10001418    GT:DP:GQ:MIN_DP:PL  0/0:41:42:39:0,21,315
20  10001419    .   T   <NON_REF>   .   .   END=10001425    GT:DP:GQ:MIN_DP:PL  0/0:45:12:42:0,9,135
20  10001426    .   A   <NON_REF>   .   .   END=10001427    GT:DP:GQ:MIN_DP:PL  0/0:49:0:48:0,0,1282
20  10001428    .   T   <NON_REF>   .   .   END=10001428    GT:DP:GQ:MIN_DP:PL  0/0:49:21:49:0,21,315
20  10001429    .   G   <NON_REF>   .   .   END=10001429    GT:DP:GQ:MIN_DP:PL  0/0:47:18:47:0,18,270
20  10001430    .   G   <NON_REF>   .   .   END=10001431    GT:DP:GQ:MIN_DP:PL  0/0:45:0:44:0,0,1121
20  10001432    .   A   <NON_REF>   .   .   END=10001432    GT:DP:GQ:MIN_DP:PL  0/0:43:18:43:0,18,270
20  10001433    .   T   <NON_REF>   .   .   END=10001433    GT:DP:GQ:MIN_DP:PL  0/0:44:0:44:0,0,1201
20  10001434    .   G   <NON_REF>   .   .   END=10001434    GT:DP:GQ:MIN_DP:PL  0/0:44:18:44:0,18,270
20  10001435    .   A   <NON_REF>   .   .   END=10001435    GT:DP:GQ:MIN_DP:PL  0/0:44:0:44:0,0,1130
20  10001436    .   A   AAGGCT,<NON_REF>    1845.73 .   DP=43;MLEAC=2,0;MLEAF=1.00,0.00;MQ=220.07;MQ0=0 GT:AD:DP:GQ:PL:SB   1/1:0,42,0:42:99:1886,125,0,1888,126,1890:0,0,0,0
20  10001437    .   A   <NON_REF>   .   .   END=10001437    GT:DP:GQ:MIN_DP:PL  0/0:44:0:44:0,0,0

Note that toward the end of this snippet, you see multiple consecutive non-variant block records. These were not merged into a single record because the sites they contain belong to different ranges of GQ (which are defined in the header).

What is the difference between the data fed to the 1st and 3rd steps of the HaplotypeCaller?

$
0
0

Hello everyone,

Can anyone please explain to me whether we need all the reads for the 3rd step of the HaplotypeCaller ("Determine likelihoods of the haplotypes given the read data") as evidence or only the reads that contain the active regions?
In other words, is the data set for determining the haplotype likelihoods the same data set that we feed into the HaplotypeCaller to begin with?
If the data set is the same as the initial bam file, does that mean that we need more data to perform the PairHMM algorithm in the 3rd step than to perform the Smith-Waterman algorithm in the 2nd step?

Hope my question makes sense and thank you in advance!

Evaluating the evidence for haplotypes and variant alleles (HaplotypeCaller & Mutect2)

$
0
0

This document details the procedure used by HaplotypeCaller to evaluate the evidence for variant alleles based on candidate haplotypes determined in the previous step for a given ActiveRegion. For more context information on how this fits into the overall HaplotypeCaller method, please see the more general HaplotypeCaller documentation.

This procedure is also applied by Mutect2 for somatic short variant discovery. See this article for a direct comparison between HaplotypeCaller and Mutect2.


Contents

  1. Overview
  2. Evaluating the evidence for each candidate haplotype
  3. Evaluating the evidence for each candidate site and corresponding alleles

1. Overview

The previous step produced a list of candidate haplotypes for each ActiveRegion, as well as a list of candidate variant sites borne by the non-reference haplotypes. Now, we need to evaluate how much evidence there is in the data to support each haplotype. This is done by aligning each sequence read to each haplotype using the PairHMM algorithm, which produces per-read likelihoods for each haplotype. From that, we'll be able to derive how much evidence there is in the data to support each variant allele at the candidate sites, and that produces the actual numbers that will finally be used to assign a genotype to the sample.


2. Evaluating the evidence for each candidate haplotype

We originally obtained our list of haplotypes for the ActiveRegion by constructing an assembly graph and selecting the most likely paths in the graph by counting the number of supporting reads for each path. That was a fairly naive evaluation of the evidence, done over all reads in aggregate, and was only meant to serve as a preliminary filter to whittle down the number of possible combinations that we're going to look at in this next step.

Now we want to do a much more thorough evaluation of how much evidence we have for each haplotype. So we're going to take each individual read and align it against each haplotype in turn (including the reference haplotype) using the PairHMM algorithm (see Durbin et al., 1998). If you're not familiar with PairHMM, it's a lot like the BLAST algorithm, in that it's a pairwise alignment method that uses a Hidden Markov Model (HMM) and produces a likelihood score. In this use of the PairHMM, the output score expresses the likelihood of observing the read given the haplotype by taking into account the information we have about the quality of the data (i.e. the base quality scores and indel quality scores). Note: If reads from a pair overlap at a site and they have the same base, the base quality is capped at Q20 for both reads (Q20 is half the expected PCR error rate). If they do not agree, we set both base qualities to Q0.

This produces a big table of likelihoods where the columns are haplotypes and the rows are individual sequence reads. The table essentially represents how much supporting evidence there is for each haplotype (including the reference), itemized by read.


3. Evaluating the evidence for each candidate site and corresponding alleles

Having per-read likelihoods for entire haplotypes is great, but ultimately we want to know how much evidence there is for individual alleles at the candidate sites that we identified in the previous step. To find out, we take the per-read likelihoods of the haplotypes and marginalize them over alleles, which produces per-read likelihoods for each allele at a given site. In practice, this means that for each candidate site, we're going to decide how much support each read contributes for each allele, based on the per-read haplotype likelihoods that were produced by the PairHMM.

This may sound complicated, but the procedure is actually very simple -- there is no real calculation involved, just cherry-picking appropriate values from the table of per-read likelihoods of haplotypes into a new table that will contain per-read likelihoods of alleles. This is how it happens. For a given site, we list all the alleles observed in the data (including the reference allele). Then, for each read, we look at the haplotypes that support each allele; we select the haplotype that has the highest likelihood for that read, and we write that likelihood in the new table. And that's it! For a given allele, the total likelihood will be the product of all the per-read likelihoods.

At the end of this step, sites where there is sufficient evidence for at least one of the variant alleles considered will be called variant, and a genotype will be assigned to the sample in the next (final) step.

Variant calling on BAMS from different Alignment Tools

$
0
0

I have a question about performing variant calling on multiple BAM files that have been generated using 2 different alignment tools (for example - 'BWA-mem' and 'Novoalign' in my case).

For example - Total 4 Samples and one Bam per Sample. Sample 1 & 2 BAMs are from BWA-mem. Sample 3 & 4 BAMs are from Novoalign. And all 4 of these BAMs go through the usual best practices QC after alignment (Duplicate removal/BQSR etc.)

  1. Is it recommended to use HaplotypeCaller on Sample 1, 2, 3 & 4 together (either in gVCF mode or regular mode)? (Since their BAMs were generated using different alignment tools)
  2. If yes, any specific QC to perform pre-calling on BAMs (or post-calling on VCFs) to ensure the compatibility of BAMs with one another?

Shalabh


Alternate Alleles in VCF are more than 1 base

$
0
0

Hi there,

I've removed INDELS from a multi-sample vcf from HaplotypeCaller using SelectVariants. However, the ALT 'SNPs' are more than a single nucleotide substitution. Eg.

TTTTTTGTTTTTTGTTTT,GTTTTTGTTTT,G
TTTTTTTA,*
TTTTTTTAG,*
TTTTTTTATTTTTCATTTA,*
TTTTTGTTTTTTTA,TC,*

Q1) What is the meaning of the * symbol?
Q2) Is it to be expected that these SNPs are more than a single nucleotide substitution?

Thanks,
Tom

HaplotypeCaller in a nutshell

$
0
0

This document outlines the basic operation of the HaplotypeCaller run in its default mode on a single sample, and does not cover the additional processing and calculations done when it is run in "GVCF mode" (with -ERC GVCF or -ERC BP_RESOLUTION) or when it is run on multiple samples. For more details and discussion of the GVCF workflow, see the Best Practices documentation on germline short variant discovery as well as the HaplotypeCaller manuscript on bioRxiv.

Overview

The core operations performed by HaplotypeCaller can be grouped into these major steps:

image

1. Define active regions. The program determines which regions of the genome it needs to operate on, based on the presence of significant evidence for variation.

2. Determine haplotypes by re-assembly of the active region. For each ActiveRegion, the program builds a De Bruijn-like graph to reassemble the ActiveRegion and identifies what are the possible haplotypes present in the data. The program then realigns each haplotype against the reference haplotype using the Smith-Waterman algorithm in order to identify potentially variant sites.

3. Determine likelihoods of the haplotypes given the read data. For each ActiveRegion, the program performs a pairwise alignment of each read against each haplotype using the PairHMM algorithm. This produces a matrix of likelihoods of haplotypes given the read data. These likelihoods are then marginalized to obtain the likelihoods of alleles per read for each potentially variant site.

4. Assign sample genotypes. For each potentially variant site, the program applies Bayes’ rule, using the likelihoods of alleles given the read data to calculate the posterior likelihoods of each genotype per sample given the read data observed for that sample. The most likely genotype is then assigned to the sample.


1. Define active regions

In this first step, the program traverses the sequencing data to identify regions of the genomes in which the samples being analyzed show substantial evidence of variation relative to the reference. The resulting areas are defined as “active regions”, and will be passed on to the next step. Areas that do not show any variation beyond the expected levels of background noise will be skipped in the next step. This aims to accelerate the analysis by not wasting time performing reassembly on regions that are identical to the reference anyway.

To define these active regions, the program operates in three phases. First, it computes an activity score for each individual genome position, yielding the raw activity profile, which is a wave function of activity per position. Then, it applies a smoothing algorithm to the raw profile, which is essentially a sort of averaging process, to yield the actual activity profile. Finally, it identifies local maxima where the activity profile curve rises above the preset activity threshold, and defines appropriate intervals to encompass the active profile within the preset size constraints. For more details on how the activity profile is computed and processed, as well as what options are available to modify the active region parameters, please see this article.

Once this process is complete, the program applies a few post-processing steps to finalize the the active regions (see detailed doc above). The final output of this process is a list of intervals corresponding to the active regions which will be processed in the next step.


2. Determine haplotypes by local assembly of the active region.

The goal of this step is to reconstruct the possible sequences of the real physical segments of DNA present in the original sample organism. To do this, the program goes through each active region and uses the input reads that mapped to that region to construct complete sequences covering its entire length, which are called haplotypes. This process will typically generate several different possible haplotypes for each active region due to:

  • real diversity on polyploid (including CNV) or multi-sample data
  • possible allele combinations between variant sites that are not totally linked within the active region
  • sequencing and mapping errors

In order to generate a list of possible haplotypes, the program first builds an assembly graph for the active region using the reference sequence as a template. Then, it takes each read in turn and attempts to match it to a segment of the graph. Whenever portions of a read do not match the local graph, the program adds new nodes to the graph to account for the mismatches. After this process has been repeated with many reads, it typically yields a complex graph with many possible paths. However, because the program keeps track of how many reads support each path segment, we can select only the most likely (well-supported) paths. These likely paths are then used to build the haplotype sequences which will be used for scoring and genotyping in the next step.

The assembly and haplotype determination procedure is described in full detail in this method article.

Once the haplotypes have been determined, each one is realigned against the original reference sequence in order to identify potentially variant sites. This produces the set of sites that will be processed in the next step. A subset of these sites will eventually be emitted as variant calls to the output VCF.


3. Evaluating the evidence for haplotypes and variant alleles

Now that we have all these candidate haplotypes, we need to evaluate how much evidence there is in the data to support each one of them. So the program takes each individual read and aligns it against each haplotype in turn (including the reference haplotype) using the PairHMM algorithm, which takes into account the information we have about the quality of the data (i.e. the base quality scores and indel quality scores). This outputs a score for each read-haplotype pairing, expressing the likelihood of observing that read given that haplotype.

Those scores are then used to calculate out how much evidence there is for individual alleles at the candidate sites that were identified in the previous step. The process is called marginalization over alleles and produces the actual numbers that will finally be used to assign a genotype to the sample in the next step.

For further details on the pairHMM output and the marginalization process, see this document.


4. Assigning per-sample genotypes

The previous step produced a table of per-read allele likelihoods for each candidate variant site under consideration. Now, all that remains to do is to evaluate those likelihoods in aggregate to determine what is the most likely genotype of the sample at each site. This is done by applying Bayes' theorem to calculate the likelihoods of each possible genotype, and selecting the most likely. This produces a genotype call as well as the calculation of various metrics that will be annotated in the output VCF if a variant call is emitted.

For further details on the genotyping calculations, see this document.

This concludes the overview of how HaplotypeCaller works.

Missing annotations in the output callset VCF

$
0
0

The problem

You specified -A <some annotation> in a command line invoking one of the annotation-capable tools (HaplotypeCaller, MuTect2, GenotypeGVCFs and VariantAnnotator), but that annotation did not show up in your output VCF.

Keep in mind that all annotations that are necessary to run our Best Practices are annotated by default, so you should generally not need to request annotations unless you're doing something a bit special.

Why this happens & solutions

There can be several reasons why this happens, depending on the tool, the annotation, and you data. These are the four we see most often; if you encounter another that is not listed here, let us know in the comments.

1. You requested an annotation that cannot be calculated by the tool

For example, you're running Mutect2 but requested an annotation that is specific to HaplotypeCaller. There should be an error message to that effect in the output log. It's not possible to override this; but if you believe the annotation should be available to the tool, let us know in the forum and we'll consider putting in a feature request.

2. You requested an annotation that can only be calculated if an optional input is provided

For example, you're running HaplotypeCaller and you want InbreedingCoefficient, but you didn't specify a pedigree file. There should be an error message to that effect in the output log. The solution is simply to provide the missing input file. Another example: you're running VariantAnnotator and you want to annotate Coverage, but you didn't specify a BAM file. The tool needs to see the read data in order to calculate the annotation, so again, you simply need to provide the BAM file.

3. You requested an annotation that has requirements which are not met by some or all sites

For example, you're looking at RankSumTest annotations, which require heterozygous sites in order to perform the necessary calculations, but you're running on haploid data so you don't have any het sites. There is no workaround; the annotation is not applicable to your data. Another example: you requested InbreedingCoefficient, but your population includes fewer than 10 founder samples, which are required for the annotation calculation. There is no workaround; the annotation is not applicable to your data.

4. You requested an annotation that is already applied by default by the tool you are running

For example, you requested Coverage from HaplotypeCaller, which already annotates this by default. There is currently a bug that causes some default annotations to be dropped from the list if specified on the command line. This will be addressed in an upcoming version. For now the workaround is to check what annotations are applied by default and NOT request them with -A.

HaplotypeCaller producing output without any vcf

$
0
0

Hello,
I am using GATK 4.0.8.1in Ubuntu 18.04 LTS and javac -version => javac 1.8.0_181.

I am basically trying to call variants from Targetted DNA sequencing data using HaplotypeCaller. Though HaplotypeCaller is executing but it is not calling any variants. Now seeing some other posts I did ensure that i have java 8 and validate bam file using picard ValidateSam, and it gave no error. I'm confused. Below is the output of the haplotypecaller run.

Using GATK jar /opt/gatk-4.0.8.1/gatk-package-4.0.8.1-local.jar Running: java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /opt/gatk-4.0.8.1/gatk-package-4.0.8.1-local.jar HaplotypeCaller --input BV-15-186_mapto_genom_sortedRG.bam --output BV-15-186_bwa_vars.vcf --reference hg19.fa 17:31:19.630 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/opt/gatk-4.0.8.1/gatk-package-4.0.8.1-local.jar!/com/intel/gkl/native/libgkl_compression.so 17:31:29.729 INFO HaplotypeCaller - ------------------------------------------------------------ 17:31:29.730 INFO HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.0.8.1 17:31:29.730 INFO HaplotypeCaller - For support and documentation go to https://software.broadinstitute.org/gatk/ 17:31:29.731 INFO HaplotypeCaller - Executing as root@bioinfo on Linux v4.15.0-20-generic amd64 17:31:29.731 INFO HaplotypeCaller - Java runtime: OpenJDK 64-Bit Server VM v10.0.2+13-Ubuntu-1ubuntu0.18.04.2 17:31:29.732 INFO HaplotypeCaller - Start Date/Time: 15 September 2018 at 5:31:19 PM IST 17:31:29.732 INFO HaplotypeCaller - ------------------------------------------------------------ 17:31:29.732 INFO HaplotypeCaller - ------------------------------------------------------------ 17:31:29.733 INFO HaplotypeCaller - HTSJDK Version: 2.16.0 17:31:29.734 INFO HaplotypeCaller - Picard Version: 2.18.7 17:31:29.734 INFO HaplotypeCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2 17:31:29.734 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false 17:31:29.734 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true 17:31:29.734 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false 17:31:29.734 INFO HaplotypeCaller - Deflater: IntelDeflater 17:31:29.735 INFO HaplotypeCaller - Inflater: IntelInflater 17:31:29.735 INFO HaplotypeCaller - GCS max retries/reopens: 20 17:31:29.735 INFO HaplotypeCaller - Using google-cloud-java fork https://github.com/broadinstitute/google-cloud-java/releases/tag/0.20.5-alpha-GCS-RETRY-FIX 17:31:29.735 INFO HaplotypeCaller - Initializing engine 17:31:29.884 INFO HaplotypeCaller - Done initializing engine 17:31:29.891 INFO HaplotypeCallerEngine - Disabling physical phasing, which is supported only for reference-model confidence output 17:31:29.899 INFO NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/opt/gatk-4.0.8.1/gatk-package-4.0.8.1-local.jar!/com/intel/gkl/native/libgkl_utils.so 17:31:29.901 INFO NativeLibraryLoader - Loading libgkl_pairhmm_omp.so from jar:file:/opt/gatk-4.0.8.1/gatk-package-4.0.8.1-local.jar!/com/intel/gkl/native/libgkl_pairhmm_omp.so 17:31:29.939 WARN IntelPairHmm - Flush-to-zero (FTZ) is enabled when running PairHMM 17:31:29.940 INFO IntelPairHmm - Available threads: 4 17:31:29.940 INFO IntelPairHmm - Requested threads: 4 17:31:29.940 INFO PairHMM - Using the OpenMP multi-threaded AVX-accelerated native PairHMM implementation 17:31:29.965 INFO ProgressMeter - Starting traversal 17:31:29.965 INFO ProgressMeter - Current Locus Elapsed Minutes Regions Processed Regions/Minute 17:31:29.971 INFO VectorLoglessPairHMM - Time spent in setup for JNI call : 0.0 17:31:29.971 INFO PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 0.0 17:31:29.972 INFO SmithWatermanAligner - Total compute time in java Smith-Waterman : 0.00 sec 17:31:29.972 INFO HaplotypeCaller - Shutting down engine [15 September 2018 at 5:31:29 PM IST] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 0.17 minutes. Runtime.totalMemory()=314572800 Exception in thread "main" java.lang.IncompatibleClassChangeError: Inconsistent constant pool data in classfile for class org/broadinstitute/hellbender/transformers/ReadTransformer. Method lambda$identity$d67512bf$1(Lorg/broadinstitute/hellbender/utils/read/GATKRead;)Lorg/broadinstitute/hellbender/utils/read/GATKRead; at index 65 is CONSTANT_MethodRef and should be CONSTANT_InterfaceMethodRef at org.broadinstitute.hellbender.transformers.ReadTransformer.identity(ReadTransformer.java:30) at org.broadinstitute.hellbender.engine.GATKTool.makePreReadFilterTransformer(GATKTool.java:290) at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.traverse(AssemblyRegionWalker.java:262) at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:979) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:137) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:182) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:201) at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160) at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203) at org.broadinstitute.hellbender.Main.main(Main.java:289)

Is GATK4 HaplotypeCaller in evaluation phase?

$
0
0

Hi GATK team,

Congratulations on the release! I just found this public method in FireCloud that notes that HaplotypeCaller in GATK4 should not be used for production use yet since it is still in evaluation phase. This post was last updated on January 9th, the day of GATK4 release. Is this statement true? Could you provide more details about HaplotypeCaller evaluation?

Thanks!

ALT * no deletion

$
0
0

Hi:
my vcf, there is a "*", but I can't find the deletion in OF2-M bam by IGV 。I can't understand why ?
GATK is 3.7 version。Haplotyper --emit_mode gvcf + GVCFtyper
genome is not human.
Thank you!

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT OF1-M OF2-M OF3-M OF4-M OM1-M OM2-M OM3-M OM4-M

chr3 3619228 . G A,* 624.28 . AC=1,1;AF=0.071,0.071;AN=14;BaseQRankSum=1.514;ClippingRankSum=0.000;DP=211;ExcessHet=3.3579;FS=2.114;MLEAC=1,1;MLEAF=0.071,0.071;MQ=43.88;MQRankSum=0.000;QD=9.18;ReadPosRankSum=0.103;SOR=1.077 GT:AD:DP:GQ:PGT:PID:PL 0/0:21,0,0:21:60:.:.:0,60,771,60,771,771 0/2:18,0,5:23:99:0|1:3619156_C_A:157,211,967,0,756,740 0/0:24,0,0:24:66:.:.:0,66,990,66,990,990 0/0:27,0,0:27:81:.:.:0,81,873,81,873,873 0/0:27,0,0:27:81:.:.:0,81,749,81,749,749 0/1:25,20,0:45:99:0|1:3619228_G_A:505,0,588,579,648,1227 0/0:26,0,0:26:36:.:.:0,36,779,36,779,779 ./.:17,0,0:17:.:.:.:.
SNP_Filter=-e 'QUAL<30 || QD<2.0 || FS>60.0 || SOR>4.0 || MQ<40.0 || MQRankSum<-12.5 || ReadPosRankSum<-8.0'
chr3 3619287 . CCG C,* 1215.79 PASS AC=2,1;AF=0.125,0.062;AN=16;BaseQRankSum=0.691;ClippingRankSum=0;DP=211;ExcessHet=0.4576;FS=1.097;MLEAC=2,1;MLEAF=0.125,0.062;MQ=45.88;MQRankSum=0;QD=22.51;ReadPosRankSum=0.785;SOR=0.793 GT:AD:DP:GQ:PGT:PID:PL 0/0:22,0,0:22:63:.:.:0,63,671,63,671,671 0/2:18,0,5:23:99:0|1:3619156_C_A:157,211,967,0,756,740 0/0:28,0,0:28:81:.:.:0,81,1215,81,1215,1215 0/0:26,0,0:26:75:.:.:0,75,1125,75,1125,1125 0/0:31,0,0:31:90:.:.:0,90,1350,90,1350,1350 1/1:1,30,0:31:41:1|1:3619228_G_A:1117,41,0,1120,93,1172 0/0:30,0,0:30:84:.:.:0,84,1260,84,1260,1260 0/0:17,0,0:17:51:.:.:0,51,462,51,462,462

chr3 3619228

chr3 3619287

Assigning per-sample genotypes (HaplotypeCaller)

$
0
0

This document describes the procedure used by HaplotypeCaller to assign genotypes to individual samples based on the allele likelihoods calculated in the previous step. For more context information on how this fits into the overall HaplotypeCaller method, please see the more general HaplotypeCaller documentation. See also the documentation on the QUAL score as well as the one on PL and GQ.

This procedure is NOT applied by Mutect2 for somatic short variant discovery. See this article for a direct comparison between HaplotypeCaller and Mutect2.


Contents

  1. Overview
  2. Preliminary assumptions / limitations
  3. Calculating genotype likelihoods using Bayes' Theorem
  4. Selecting a genotype and emitting the call record

1. Overview

The previous step produced a table of per-read allele likelihoods for each candidate variant site under consideration. Now, all that remains to do is to evaluate those likelihoods in aggregate to determine what is the most likely genotype of the sample at each site. This is done by applying Bayes' theorem to calculate the likelihoods of each possible genotype, and selecting the most likely. This produces a genotype call as well as the calculation of various metrics that will be annotated in the output VCF if a variant call is emitted.

Note that this describes the regular mode of HaplotypeCaller, which does not emit an estimate of reference confidence. For details on how the reference confidence model works and is applied in GVCF modes (-ERC GVCF and -ERC BP_RESOLUTION) please see the reference confidence model documentation.


2. Preliminary assumptions / limitations

Quality

Keep in mind that we are trying to infer the genotype of each sample given the observed sequence data, so the degree of confidence we can have in a genotype depends on both the quality and the quantity of the available data. By definition, low coverage and low quality will both lead to lower confidence calls. The GATK only uses reads that satisfy certain mapping quality thresholds, and only uses “good” bases that satisfy certain base quality thresholds (see documentation for default values).

Ploidy

Both the HaplotypeCaller and GenotypeGVCFs assume that the organism of study is diploid by default, but the desired ploidy can be set using the -ploidy argument. The ploidy is taken into account in the mathematical development of the Bayesian calculation using a generalized form of the genotyping algorithm that can handle ploidies other than 2. Note that using ploidy for pooled experiments is subject to some practical limitations due to the number of possible combinations resulting from the interaction between ploidy and the number of alternate alleles that are considered. There are some arguments that aim to mitigate those limitations but they are not fully documented yet.

Paired end reads

Reads that are mates in the same pair are not handled together in the reassembly, but if they overlap, there is some special handling to ensure they are not counted as independent observations.

Single-sample vs multi-sample

We apply different genotyping models when genotyping a single sample as opposed to multiple samples together (as done by HaplotypeCaller on multiple inputs or GenotypeGVCFs on multiple GVCFs). The multi-sample case is not currently documented for the public but is an extension of previous work by Heng Li and others.


3. Calculating genotype likelihoods using Bayes' Theorem

We use the approach described in Li 2011 to calculate the posterior probabilities of non-reference alleles (Methods 2.3.5 and 2.3.6) extended to handle multi-allelic variation.

The basic formula we use for all types of variation under consideration (SNPs, insertions and deletions) is:

$$ P(G|D) = \frac{ P(G) P(D|G) }{ \sum_{i} P(G_i) P(D|G_i) } $$

If that is meaningless to you, please don't freak out -- we're going to break it down and go through all the components one by one. First of all, the term on the left:

$$ P(G|D) $$

is the quantity we are trying to calculate for each possible genotype: the conditional probability of the genotype G given the observed data D.

Now let's break down the term on the right:

$$ \frac{ P(G) P(D|G) }{ \sum_{i} P(G_i) P(D|G_i) } $$

We can ignore the denominator (bottom of the fraction) because it ends up being the same for all the genotypes, and the point of calculating this likelihood is to determine the most likely genotype. The important part is the numerator (top of the fraction):

$$ P(G) P(D|G) $$

which is composed of two things: the prior probability of the genotype and the conditional probability of the data given the genotype.

The first one is the easiest to understand. The prior probability of the genotype G:

$$ P(G) $$

represents how probably we expect to see this genotype based on previous observations, studies of the population, and so on. By default, the GATK tools use a flat prior (always the same value) but you can input your own set of priors if you have information about the frequency of certain genotypes in the population you're studying.

The second one is a little trickier to understand if you're not familiar with Bayesian statistics. It is called the conditional probability of the data given the genotype, but what does that mean? Assuming that the genotype G is the true genotype,

$$ P(D|G) $$

is the probability of observing the sequence data that we have in hand. That is, how likely would we be to pull out a read with a particular sequence from an individual that has this particular genotype? We don't have that number yet, so this requires a little more calculation, using the following formula:

$$ P(D|G) = \prod{j} \left( \frac{P(D_j | H_1)}{2} + \frac{P(D_j | H_2)}{2} \right) $$

You'll notice that this is where the diploid assumption comes into play, since here we decomposed the genotype G into:

$$ G = H_1H_2 $$

which allows for exactly two possible haplotypes. In future versions we'll have a generalized form of this that will allow for any number of haplotypes.

Now, back to our calculation, what's left to figure out is this:

$$ P(D_j|H_n) $$

which as it turns out is the conditional probability of the data given a particular haplotype (or specifically, a particular allele), aggregated over all supporting reads. Conveniently, that is exactly what we calculated in Step 3 of the HaplotypeCaller process, when we used the PairHMM to produce the likelihoods of each read against each haplotype, and then marginalized them to find the likelihoods of each read for each allele under consideration. So all we have to do at this point is plug the values from that table into the equation above, and we can work our way back up to obtain:

$$ P(G|D) $$

for the genotype G.


4. Selecting a genotype and emitting the call record

We go through the process of calculating a likelihood for each possible genotype based on the alleles that were observed at the site, considering every possible combination of alleles. For example, if we see an A and a T at a site, the possible genotypes are AA, AT and TT, and we end up with 3 corresponding probabilities. We pick the largest one, which corresponds to the most likely genotype, and assign that to the sample.

Note that depending on the variant calling options specified in the command-line, we may only emit records for actual variant sites (where at least one sample has a genotype other than homozygous-reference) or we may also emit records for reference sites. The latter is discussed in the reference confidence model documentation.

Assuming that we have a non-ref genotype, all that remains is to calculate the various site-level and genotype-level metrics that will be emitted as annotations in the variant record, including QUAL as well as PL and GQ. For more information on how the other variant context metrics are calculated, please see the corresponding variant annotations documentation.


Why does HaplotypeCaller (HC) use a flat prior in joint calling?

$
0
0

In genotyping, P(G|D)=P(G)P(D|G)/P(D). Why HC uses a flat P(G) instead of computing one based on cohort allele frequencies?

Is it possible to suppress the NON_REF tag on variant calls?

$
0
0

Hello,

In GVCF output from HaplotypeCaller, each line contains the allele, including the lines with explicit variant calls. Is there a simple way to suppress the allele on variant calls?

Also, what is the reason to have a allele on a variant where a specific alternate allele is called?

Thanks for your help.

Missing chr in gVCF

$
0
0

Hi,

I have generated a gVCF file using GATK-3.7 and it is completed without any error. However, i notice only few chromosome names in the first column of the gVCF file:

 java -Xmx10G -GenomeAnalysisTK.jar -R species.fa -T HaplotypeCaller -I Input.bam -stand_call_conf 30 -ERC GVCF --min_base_quality_score 20 --variant_index_parameter 128000 --variant_index_type LINEAR --genotyping_mode DISCOVERY -o GATK.g.vcf.gz

  less GATK.g.vcf.gz | grep -v "##" | cut -f1| sort | uniq
  chr1
  chr10
  chr11
  chr12
  chr13
 #CHROM

Does it mean that the run is incomplete? or ami i missing some property of gVCF format.

Why is there difference of variants between after-BQSR bam and after-HaplotypeCaller bam?

$
0
0

Dear GATK team,

Hi, I have followed Best Practices to find out germline variants (GATK-3.7) of my samples designed by case-control study for ~500 samples in total.
I have run BQSR, Prind Reads, and then HaplotypeCaller as described in below:

BQSR
java -jar $GATK/GenomeAnalysisTK.jar -T BaseRecalibrator -R $Reference -knownSites $dbSNP138 -knownSites $Mills -knownSites $oneKGindels -nct 8 -I $Output/$1.sort.dup.ir.bam -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov ContextCovariate -o $Output/$1.recal.data.grp -L $Interval -ip 100

Print Reads
java -jar $GATK/GenomeAnalysisTK.jar -T PrintReads -nct 8 -R $Reference -I $Output/$1.sort.dup.ir.bam -BQSR $Output/$1.recal.data.grp -o $Output/$1.sort.dup.ir.BQSR.bam

HaplotypeCaller (HC)
java -jar $GATK/GenomeAnalysisTK.jar -T HaplotypeCaller -R $Reference -I $Input/$1.sort.dup.ir.BQSR.bam -o $Output/$1.hc.vcf.gz -L chr14:92537200-92537700 -bamout $Output/$1.bamout.bam

When I comparing variants of after-BQSR bam with those of after-HC bam in region of chr14:92537200-92537700 using IGV, I noticed that both of the bams showed different looking especially for indels like this:

So I have several questions,
1) Why is there difference of variants between after-BQSR bam and after-HC bam in terms of indels? The indels at chr14:92,537,354 were not in after-BQSR bam, but those were in after-HC bam. Among my processed samples, some samples showed same indels in both bams, but others showed different indels.
2) I noticed that some regions seems to be snapped in after-HC bam, not in after-BQSR bam. I don't have an idea why this happened.
3) Some samples showed that variants in whole regions of chr14:92537200-92537700 were not called in after-HC bam, but reads were mapped in the same regions in after-BQSR bam. How can I interpret it?

I don't know exactly but I guess that there are quite possibility to calling inaccurate variants since the regions I interested in have several repeat sequences as well as the variants are repeated indels. Is this right? I don't know what can I do, so I ask for help me regarding to this issues.

Thanks in advance!

Best regards,
Soojin

Inference of genotype likelihoods for lower ploidy based on genotyping at higher ploidy using GATK

$
0
0

Let's say I have a bunch of mixed ploidy individuals (with biallelic markers) in my data. Some are tetraploid and some are diploid. But I choose to run GATK HaplotypeCaller (to get genotype likelihoods) with -ploidy set to 4 for all organisms since I know the highest ploidy level in the data to be 4.

My idea is to run the data and obtain genotype likelihoods with the highest resolution and then downscale those values obtained to a lower ploidy level post-hoc.

For instance, given that there are 5 genotype classes/dosage levels for tetraploid organisms (0 of the reference allele, 1 of the reference, 2 of the reference, 3 of the reference and 4 of the reference), I will get 5 phred-scaled scores for each locus in each individual. Each score represents the probability of having a certain count for the reference allele (0 through 4).
Now if I deduce that one of these individuals is a diploid but I've already run the analyses:

  • Can I just combine the genotype likelihoods of the 3 heterozygote classes in the tetraploid call (1/3, 2/2, 3/1) to get the genotype likelihood of the one heterozygote class (1/1) in a diploid individual?
  • If so, how do I do this quantitatively?

For example, at a locus in an individual that I assumed to be tetraploid during the GATK run, I get these phred-scaled genotype likelihoods:
0/4 1/3 2/2 3/1 4/0
6     67    0    4    60

But I now know that this individual is diploid, so I am now looking for just 3 phred-scaled genotype likelihoods instead of 5:
0/2 1/1 2/0
?     ?     ?

Would I keep the homozygote classes the same i.e. 6 and 60 and then just average the 3 dosage classes for the heterozygote of the diploid? Or would I perform another similar mathematical operation?

Thanks,
Vivaswat

Viewing all 1335 articles
Browse latest View live