AlleleBalance and HomopolymerRun not working for HaplotypeCaller/gVCF

February 11, 2015, 1:42 pm

≫ Next: Haplotype caller in GVCF mode still taking a very long time. Can I possibly speed up the process?

≪ Previous: Finding exact reason for haplotype caller reassembly

Hi,

We are running the best practices pipeline for variant discovery with GATK 3-2.2. When running the HaplotypeCaller with the flags -A AlleleBalance & -A HomopolymerRun to generate a gVCF, there are no ABHet/ABHom or HRun annotations showing up in the gVCF. I tried running VariantAnnotator on the gVCF and still no annotations.

The documentation on both of these annotations state that they only work for biallelic variants. I suspected that the , tag that is on every variant might be causing the tool to treat biallelic variants as multiallelic. So I stripped the , from the gVCF using sed and reran the VariantAnnotator and voila...I got the annotations.

Is there a way to either generate the gVCF without the , tag (and probabilities that go with it), or instruct the VariantAnnotator to ignore the , tag.

Thanks for you help.

Tom Kolar

↧

Haplotype caller in GVCF mode still taking a very long time. Can I possibly speed up the process?

February 13, 2015, 9:15 am

≫ Next: Filtering Haploytpe Caller calls for non-human data

≪ Previous: AlleleBalance and HomopolymerRun not working for HaplotypeCaller/gVCF

I have 13 whole exome sequencing samples, and unfortunately, I'm having a hard time getting HaplotypeCaller to complete within the time frame the cluster I use allows (150 hours). I use 10 nodes at a time with 10gb ram with 8 cores per node. Is there any way to speed up this rate? I tried using HaplotypeCaller in GVCF mode with the following command:

java -d64 -Xmx8g -jar $GATKDIR/GenomeAnalysisTK.jar \
-T HaplotypeCaller \
-R $REF --dbsnp $DBSNP \
-I 7-27_realigned.bam \
-o 7-27_hg19.vcf \
-U ALLOW_UNSET_BAM_SORT_ORDER \
-gt_mode DISCOVERY \
-mbq 20 \
-stand_emit_conf 20 -G Standard -A AlleleBalance -nct 16 \
--emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000

Am I doing something incorrectly? Is there anything I can tweak to minimize the runtime? What is the expected runtime for WES on a standard setup (a few cores and some ram)?

↧

Filtering Haploytpe Caller calls for non-human data

February 18, 2015, 4:54 am

≫ Next: How to add TI into INFO field in vcf

≪ Previous: Haplotype caller in GVCF mode still taking a very long time. Can I possibly speed up the process?

Hey GATK team,

Are the guidelines suggested in this forum page applicable for filtering calls made by GATK-3.3-0's HaplotypeCaller?

Cheers,
Mika

↧

How to add TI into INFO field in vcf

February 18, 2015, 7:22 am

≫ Next: Documents on Haplotype Caller

≪ Previous: Filtering Haploytpe Caller calls for non-human data

Dear all,

I need to generate vcf file with GATK and I need to have TI (transcript information) in INFO field and VF and GQX in FORMAT field.

Could you help me please with arguments.

My agruments are to call variants:

java -jar $gatk -T $variant_caller -R $reference -I in.bam -D dbsnp -L bed_file -o .raw.vcf

I still have no TI info. and in FORMAT i have GT:AD:DP:GQ:PL and i NEED GT:AD:DP:GQ:PL:VF:GQX.

Thank you for any help with command line arguments.

Paul.

↧

Documents on Haplotype Caller

February 19, 2015, 3:51 am

≫ Next: Running HaplotypeCaller in GENOTYPE_GIVEN_ALLELES mode with --emitRefConfidence GVCF

≪ Previous: How to add TI into INFO field in vcf

Hello,

I was wondering whether there is any paper or document that describes the mathematical models of Haplotype caller?

Thanks in advance,
Homa

↧

Running HaplotypeCaller in GENOTYPE_GIVEN_ALLELES mode with --emitRefConfidence GVCF

February 19, 2015, 6:29 am

≫ Next: Odd distribution of Coverage for GATK HaplotypeCaller Variants

≪ Previous: Documents on Haplotype Caller

Hi Sheila and Geraldine

When I run HaplotypeCaller (v3.3-0-g37228af) in GENOTYPE_GIVEN_ALLELES mode with --emitRefConfidence GVCF I get the error:

Invalid command line: Argument ERC/gt_mode has a bad value: you cannot request reference confidence output and GENOTYPE_GIVEN_ALLELES at the same time

It is however strange that GENOTYPE_GIVEN_ALLELES is mentioned in the --max_alternate_alleles section of the GenotypeGVCFs documentation.

Maybe I'm missing something?

Thanks,
Gerrit

↧

Odd distribution of Coverage for GATK HaplotypeCaller Variants

June 27, 2014, 2:45 am

≫ Next: Producing GVCF with Haplotype Caller 3.3.0 - Incompatibility with stand_emit_conf & stand_call_conf?

≪ Previous: Running HaplotypeCaller in GENOTYPE_GIVEN_ALLELES mode with --emitRefConfidence GVCF

Hi we've been looking at results of a recent run of GATK-HC (3.1-1) using the new N1 pipeline and we've been seeing something odd in the distribution of the Depth of Coverage (Using DP from the Genotype Fields) we're seeing for the raw unfiltered variants.

All our samples are sequenced using PCR-Free libraries and have two lanes of sequence (~24x mapped depth) and looking at depth of coverage from bedtools we see a nice clean distribution (red in graph) but when we look at the data from the HaplotypeCaller sites (Blue in graph) we see a bimodal distribution with an excess of variants called at lower coverage (~16x) vs the most common coverage of around 24x. We've seen this in all the samples we've looked at so far, so it's not just a one off.

I've had a quick look at read depth from another variant caller (Platypus) and there we see no evidence of this bimodal distribution in the variants it has called.

Is this expected behaviour?
If so why does it occur?
If not any idea what is going on here, is it a bug in the variant caller or the depth statistics?
Do you see the same thing in other datasets?

Thanks!

↧

Producing GVCF with Haplotype Caller 3.3.0 - Incompatibility with stand_emit_conf & stand_call_conf?

February 23, 2015, 5:25 am

≫ Next: genotyping sex chromosomes

≪ Previous: Odd distribution of Coverage for GATK HaplotypeCaller Variants

I am using HC 3.3-0-g37228af to generate GVCFs, including the parameters (full command below):

stand_emit_conf 10
stand_call_conf 30

The process completes fine, but when I look at the header of the gvcf produced, they are shown as follows:

standard_min_confidence_threshold_for_calling=-0.0 standard_min_confidence_threshold_for_emitting=-0.0

After trying various tests, it appears that setting these values is incompatible with -ERC GVCF (which requires "-variant_index_type LINEAR" and "-variant_index_parameter 128000" )

1) Can you confirm if this is expected behaviour, and why this should be so?
2) Is this another case where the GVCF is in intermediate file, and hence every possible variant is emitted initially?
3) Regardless of the answers above, is stand_call_conf equivalent to requiring a GQ of 30?

     java -Xmx11200m -Djava.io.tmpdir=$TMPDIR -jar /apps/GATK/3.3-0/GenomeAnalysisTK.jar \
     -T HaplotypeCaller \
     -I /E000007/target_indel_realignment/E000007.6.bqsr.bam \
     -R /project/production/Indexes/samtools/hsapiens.hs37d5.fasta \
     -et NO_ET \
     -K /project/production/DAT/apps/GATK/2.4.9/ourkey \
     -dt NONE \
     -L 10 \
     -A AlleleBalance \
     -A BaseCounts \
     -A BaseQualityRankSumTest \
     -A ChromosomeCounts \
     -A ClippingRankSumTest \
     -A Coverage \
     -A DepthPerAlleleBySample \
     -A DepthPerSampleHC \
     -A FisherStrand \
     -A GCContent \
     -A HaplotypeScore \
     -A HardyWeinberg \
     -A HomopolymerRun \
     -A ClippingRankSumTest \
     -A LikelihoodRankSumTest \
     -A LowMQ \
     -A MappingQualityRankSumTest \
     -A MappingQualityZero \
     -A MappingQualityZeroBySample \
     -A NBaseCount \
     -A QualByDepth \
     -A RMSMappingQuality \
     -A ReadPosRankSumTest \
     -A StrandBiasBySample \
     -A StrandOddsRatio \
     -A VariantType \
     -ploidy 2 \
     --min_base_quality_score 10 \
     -ERC GVCF \
     -variant_index_type LINEAR \
     -variant_index_parameter 128000 \
     --GVCFGQBands 20 \
     --standard_min_confidence_threshold_for_calling 30 \
     --standard_min_confidence_threshold_for_emitting 10

↧

genotyping sex chromosomes

February 24, 2015, 5:42 am

≫ Next: importance of known sites/resources

≪ Previous: Producing GVCF with Haplotype Caller 3.3.0 - Incompatibility with stand_emit_conf & stand_call_conf?

Hi GATK team,

I have exome samples, some males and some females.
I mapped the female to a reference genome without the Y chromosome, and continued with each sample the Best Practice steps.
The reason for that mapping is that we don't want to lose some of the females reads to the homologous regions of chro Y.
Will I be able to run GenotypeGVCFs on those samples?
Is this a good way to do the genotyping on the sex chromosomes?

thank in advanced,
Maya

↧

importance of known sites/resources

February 24, 2015, 6:12 pm

≫ Next: How does HaplotypeCaller treat Ns in the reference genome

≪ Previous: genotyping sex chromosomes

Hi,
I have a general question about the importance of known VCFs (for BQSR and HC) and resources file (for VQSR). I am working on rice for which the only known sites are the dbSNP VCF files which are built on a genomic version older than the reference genomic fasta file which I am using as basis.
How does it affect the quality/accuracy of variants? How important is to have the exact same build of the genome as the one on which the known VCF is based? Is it better to leave out the known sites for some of the steps than to use the version which is built on a different version of the genome for the same species? In other words, which steps (BQSR, HC, VQSR etc) can be performed without the known sites/resource file?
If the answers to the above questions are too detailed, can you please point me to any document, if available, which might address this issue?

Thanks,
NB

↧

How does HaplotypeCaller treat Ns in the reference genome

March 2, 2015, 3:20 am

≫ Next: GenotypeGVCFs on one sample

≪ Previous: importance of known sites/resources

Hi there,

I am fairly new to using the GATK and was hoping someone could answer a little question I have.

I have been using HaplotypeCaller to call two variants at a locus (A and . As I understand it, my calls may be subject to reference bias if the reference genome I use to call variants is based individuals carrying allele A and allele B is fairly diverged from A. This will result in an underestimate of GQ and perhaps undercall variants for B. My question is, how does HaplotypeCaller treat Ns in the reference genome? Does it still estimate genotypes for these? Or does it treat the site as invariant?

Cheers

↧

GenotypeGVCFs on one sample

March 2, 2015, 1:39 am

≫ Next: Trimming a GVCF with "-L"

≪ Previous: How does HaplotypeCaller treat Ns in the reference genome

Hi GATK team,

In the documentation of GenotypeGVCFs it is writen:
"Input - One or more Haplotype Caller gVCFs to genotype"
I have 3 questions regarding this tool:
1. I wonder, what is the meaning of running it on one sample?
2. I tried to run it on one sample, and noticed that the genotype quality is different than the one in the original gvcf file from HC. What is causing this difference? I'm asking this since running the tool on one sample means that there are no other samples to consider in recalculating the quality.
3. Last question - From your experience, what is the best way to analyze one exome sample? Should I run HC with the default genotype_mode parameter and do hard filtering? Should I run HC in GVCF mode, run GenotypeGVCFs and than do hard filtering? Any other suggestion?

Thank you for the answer,
Maya

↧

Trimming a GVCF with "-L"

March 3, 2015, 1:03 pm

≫ Next: HaplotypeCaller MappingQualityZero always is 0

≪ Previous: GenotypeGVCFs on one sample

GATK team,

I currently have many WES gVCFs called with GATK 3.x HaplotypeCaller, and I'm now looking to combine them and run GenotypeGVCFs. Unfortunately, I forgot to add the "-L" argument to HC to reduce the size of the resulting gVCFs, and CombineGVCFs looks like it's taking much longer than I expect it to.

Is there any potential problem with using the "-L" argument to SelectVariants to reduce the size of my gVCFs and then use those smaller gVCFs in the CombineGVCFs stage (and beyond), or do I have to re-call HaplotypeCaller again? Would it be better to extend the boundaries of the target file by a certain amount to avoid recalling HaplotypeCaller?

Thanks,

John Wallace

↧

HaplotypeCaller MappingQualityZero always is 0

March 4, 2015, 5:47 am

≫ Next: How the HaplotypeCaller's reference confidence model works

≪ Previous: Trimming a GVCF with "-L"

Hi,

I am using the standard HaplotypeCaller/GenotypeGVCFs pipeline with the --includeNonVariantSites option.
I always get MQ0=0 for all SNPs and don't get it reported for non-variants, even if I add "--annotation MappingQualityZero" to both commands.
With UnifiedGenotyper, MQ0 values are reported correctly.

Is this a bug?

Also, with HaplotypeCaller, MQ is only reported for variants, not for non-variants. Is there a way to get this annotation reported for non-variants, such as with UnifiedGenotpyer?

I am using Version 3.3-0, but collegues have a similar problem with version 3.2-2.

Thanks,
Hannes

↧

How the HaplotypeCaller's reference confidence model works

April 10, 2014, 2:57 pm

≫ Next: a strange varinat from HC

≪ Previous: HaplotypeCaller MappingQualityZero always is 0

This document describes the reference confidence model applied by HaplotypeCaller to generate genomic VCFs (gVCFS), invoked by -ERC GVCF or -ERC BP_RESOLUTION (see the FAQ on gVCFs for format details).

Please note that this document may be expanded with more detailed information in the near future.

How it works

The mode works by assembling the reads to create potential haplotypes, realigning the reads to their most likely haplotypes, and then projecting these reads back onto the reference sequence via their haplotypes to compute alignments of the reads to the reference. For each position in the genome we have either an ALT call (via the standard calling mechanism) or we can estimate the chance that some (unknown) non-reference allele is segregating at this position by examining the realigned reads that span the reference base. At this base we perform two calculations:

Estimate the confidence that no SNP exists at the site by contrasting all reads with the ref base vs all reads with any non-reference base.
Estimate the confidence that no indel of size < X (determined by command line parameter) could exist at this site by calculating the number of reads that provide evidence against such an indel, and from this value estimate the chance that we would not have seen the allele confidently.

Based on this, we emit the genotype likelihoods (PL) and compute the GQ (from the PLs) for the least confidence of these two models.

We use a symbolic allele pair, <NON_REF>, to indicate that the site is not homozygous reference, and because we have an ALT allele we can provide allele-specific AD and PL field values.

For details of the gVCF format, please see the document that explains what is a gVCF.

↧

a strange varinat from HC

March 8, 2015, 1:52 am

≫ Next: Combine GVCF files problem

≪ Previous: How the HaplotypeCaller's reference confidence model works

Hello GATK team,

I've noticed a strange variant in the gvcf output of HC:

Raw vcf:
6 32552140 rs17882663 T A, 108.18 . DB;DP=0;MLEAC=2,0;MLEAF=1.00,0.00;MQ=0.00;MQ0=0 GT:GQ:PGT:PID:PL:SB 1/1:9:0|1:32552059_G_T:135,9,0,135,9,135:0,0,0,0

After GenortpeGVCFs:
6 32552140 rs17882663 T A 107.28 . AC=2;AF=1.00;AN=2;DB;DP=0;FS=0.000;GQ_MEAN=9.00;MLEAC=2;MLEAF=1.00;MQ=0.00;MQ0=0;NCC=0;SOR=0.693 GT:AD:GQ:PGT:PID:PL 1/1:0,0:9:1|1:32552059_G_T:135,9,0

The DP and AD are 0, but there is a variant - A.
What do you think? Why does it happened?

What is the difference between the DP in the format part and in the info part? I looked for an answer in the documentation, but couldn't find one. Bellow is an example of a big difference between the values of this two.
3 195511214 . G GACCTGTGGATGCTGAGGAAGTGTCGGTGACAGGAAGAGGGGTGGTGTC 673.77 . AC=1;AF=0.500;AN=2;DP=169;FS=0.000;GQ_MEAN=38.00;MLEAC=1;MLEAF=0.500;MQ=56.57;MQ0=0;NCC=0;SOR=0.693 GT:AD:DP:GQ:PL 0/1:0,0:0:38:702,0,38

Maya

↧

Combine GVCF files problem

March 9, 2015, 11:47 am

≫ Next: GenotypeGVCFs problem with rsID

≪ Previous: a strange varinat from HC

I used the following command to combine 3 VCF files which are outputs of HaplotypeCaller:

java -jar data/GenomeAnalysisTK-3.2-2/GenomeAnalysisTK.jar \
-R data/ucsc.hg19.fasta \
-T CombineGVCFs \
--variant data/47V_post.ERC.vcf \
--variant data/48V_post.ERC.vcf \
--variant data/49V_post.ERC.vcf \
--out data/Combined_3files.vcf

However, after combined all 3 files, in the output final VCF, I can only see ./. genotypes. What is the problem? how I can to fix this?
Thanks

↧

GenotypeGVCFs problem with rsID

March 10, 2015, 11:27 am

≫ Next: HaplotypeCaller - treatment of scaffolds

≪ Previous: Combine GVCF files problem

I run the following command for "GenotypeGVCFs" for 3 VCF files output of HaplotypeCaller as below:

java data/GenomeAnalysisTK-3.2-2/GenomeAnalysisTK.jar \
-R data/ucsc.hg19.fasta \
-T GenotypeGVCFs \
--variant data/47V_post.ERC.vcf \
--variant data/48V_post.ERC.vcf \
--variant data/49V_post.ERC.vcf \
--out data/Combined_geno_3files.vcf

but in a final VCF output there is no rsID information and all rows are "."
what is the problem? I am really confused. Could you please advise how to get SNP-ID in the output VCF

Thanks

↧

HaplotypeCaller - treatment of scaffolds

March 11, 2015, 6:48 am

≫ Next: GenotypeGVCFs long estimated runtime

≪ Previous: GenotypeGVCFs problem with rsID

Hi Team,

1 BAM = 1 individual

my question is regarding the HaplotypeCaller and scaffolds in a BAM file.
When I want to do the individual SNP-calling procedure (--emitRefConfidence GVCF) before the Joint Genotyping,
I found that with my number of scaffolds the process is computationally quite costy.
I now ran for every BAM the HaplotypeCaller just for a single scafflod (by using -L)

Question is: Do you see any downside in this approach regarding the result quality?
Or are the scaffolds treated independently anyways and my approach is fine?

The next step would be to combine the gvcfs to a single one again (corresponding to the original BAM)
and then do joint genotyping on a cohort of gvcfs (-> cohort of individuals)

Thanks a lot!
Alexander

↧

GenotypeGVCFs long estimated runtime

March 19, 2015, 7:50 am

≫ Next: Downsampling with HaplotypeCaller

≪ Previous: HaplotypeCaller - treatment of scaffolds

Hello!

I would like to run GenotypeGVCFs on 209 WES, called with HC (--emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000).

When I run GenotypeGVCFs, with this command (computing nodes have 8 cores and 24G of memory) :

java -Xmx24g -jar $GATK_JAR \
-R Homo_sapiens.GRCh37_decoy.fa \
-T GenotypeGVCFs \
-nt 8 \
-V gvcf.all.list \
-o calls.vcf

It estimates a huge runtime and just dies hanging:

INFO  10:27:00,790 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  10:27:00,795 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.3-0-g37228af, Compiled 2014/10/24 01:07:22 
INFO  10:27:00,796 HelpFormatter - Copyright (c) 2010 The Broad Institute 
INFO  10:27:00,796 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk 
INFO  10:27:00,800 HelpFormatter - Program Args: -R Homo_sapiens.GRCh37_decoy.fa -T GenotypeGVCFs -nt 8 -V gvcf.all.list -o calls.vcf 
INFO  10:27:00,810 HelpFormatter - Executing as emixaM@r107-n50 on Linux 2.6.32-504.12.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_60-ea-b07. 
INFO  10:27:00,810 HelpFormatter - Date/Time: 2015/03/19 10:27:00 
INFO  10:27:00,811 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  10:27:00,811 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  10:27:04,719 GenomeAnalysisEngine - Strictness is SILENT 
INFO  10:27:04,882 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 
INFO  10:27:41,565 MicroScheduler - Running the GATK in parallel mode with 8 total threads, 1 CPU thread(s) for each of 8 data thread(s), of 8 processors available on this machine 
INFO  10:27:43,169 GenomeAnalysisEngine - Preparing for traversal 
INFO  10:27:43,179 GenomeAnalysisEngine - Done preparing for traversal 
INFO  10:27:43,179 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] 
INFO  10:27:43,180 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining 
INFO  10:27:43,180 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime 
INFO  10:27:44,216 GenotypeGVCFs - Notice that the -ploidy parameter is ignored in GenotypeGVCFs tool as this is automatically determined by the input variant files 
INFO  10:28:15,035 ProgressMeter -       1:1000201         0.0    31.0 s      52.7 w        0.0%    27.0 h      27.0 h 
INFO  10:29:17,386 ProgressMeter -       1:1068701         0.0    94.0 s     155.8 w        0.0%    76.7 h      76.6 h 
INFO  10:30:18,055 ProgressMeter -       1:1115101         0.0     2.6 m     256.1 w        0.0%     5.0 d       5.0 d

What did I do wrong?

Cheers!

↧