i_variant_quality_by_depth/i_genotype_quality interpretation

July 5, 2018, 8:45 am

≫ Next: a question about running HaplotypeCaller with intervals

≪ Previous: HaplotypeCaller output header and one position recode without error

When interpreting the output of HaplotypeCaller, what do the i_variant_quality_by_depth and i_genotype_quality
columns represent and which of these would be a good value on which to base an assessment of confidence in the variant call and quality? What scale are they on? Or is there a different column that would be better?

↧

a question about running HaplotypeCaller with intervals

July 16, 2018, 2:16 pm

≫ Next: can VariantsToTable output the raw genotype call (i.e., 0/1) rather than the actual basecall (A/T)?

≪ Previous: i_variant_quality_by_depth/i_genotype_quality interpretation

Hi,

I have a question when running HaplotypeCaller functions with intervals on exome-seq data.
Here is the command I used:
java -jar gatk-package-4.0.6.0-local.jar HaplotypeCaller -R /espresso/share/genomes/hg38/genome.fa -I recal_reads.bam -O variants.g.vcf -ERC GVCF -L capture.bed

However, when I ran the command, I got the following message:
17:13:14.439 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk-4.0.6.0/gatk-package-4.0.6.0-local.jar!/com/intel/gkl/native/libgkl_compression.so 17:13:14.591 INFO HaplotypeCaller - ------------------------------------------------------------ 17:13:14.591 INFO HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.0.6.0 17:13:14.591 INFO HaplotypeCaller - For support and documentation go to https://software.broadinstitute.org/gatk/ 17:13:14.591 INFO HaplotypeCaller - Executing as ... on Linux v2.6.32-431.29.2.el6.x86_64 amd64 17:13:14.592 INFO HaplotypeCaller - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_121-b13 17:13:14.592 INFO HaplotypeCaller - Start Date/Time: July 16, 2018 5:13:14 PM EDT 17:13:14.592 INFO HaplotypeCaller - ------------------------------------------------------------ 17:13:14.592 INFO HaplotypeCaller - ------------------------------------------------------------ 17:13:14.592 INFO HaplotypeCaller - HTSJDK Version: 2.16.0 17:13:14.592 INFO HaplotypeCaller - Picard Version: 2.18.7 17:13:14.592 INFO HaplotypeCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2 17:13:14.592 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false 17:13:14.592 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true 17:13:14.592 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false 17:13:14.593 INFO HaplotypeCaller - Deflater: IntelDeflater 17:13:14.593 INFO HaplotypeCaller - Inflater: IntelInflater 17:13:14.593 INFO HaplotypeCaller - GCS max retries/reopens: 20 17:13:14.593 INFO HaplotypeCaller - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes 17:13:14.593 INFO HaplotypeCaller - Initializing engine 17:13:15.037 INFO FeatureManager - Using codec BEDCodec to read file file:///capture.bed 17:13:16.883 INFO IntervalArgumentCollection - Processing 64190747 bp from intervals 17:13:17.009 INFO HaplotypeCaller - Shutting down engine [July 16, 2018 5:13:17 PM EDT] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 0.04 minutes. Runtime.totalMemory()=2041053184 java.lang.NullPointerException at java.util.ComparableTimSort.countRunAndMakeAscending(ComparableTimSort.java:325) at java.util.ComparableTimSort.sort(ComparableTimSort.java:202) at java.util.Arrays.sort(Arrays.java:1312) at java.util.Arrays.sort(Arrays.java:1506) at java.util.ArrayList.sort(ArrayList.java:1454) at java.util.Collections.sort(Collections.java:141) at org.broadinstitute.hellbender.utils.IntervalUtils.sortAndMergeIntervals(IntervalUtils.java:459) at org.broadinstitute.hellbender.utils.IntervalUtils.getIntervalsWithFlanks(IntervalUtils.java:956) at org.broadinstitute.hellbender.utils.IntervalUtils.getIntervalsWithFlanks(IntervalUtils.java:971) at org.broadinstitute.hellbender.engine.MultiIntervalLocalReadShard.<init>(MultiIntervalLocalReadShard.java:59) at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.makeReadShards(AssemblyRegionWalker.java:195) at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.onStartup(AssemblyRegionWalker.java:175) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:133) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:180) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:199) at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160) at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203) at org.broadinstitute.hellbender.Main.main(Main.java:289)

I did not see any error but it seems HaplotypeCaller did not run and there is no output.
So I will really appreciate it if I can get help from you guys.

Thank you!

Best,
Siyu

↧

can VariantsToTable output the raw genotype call (i.e., 0/1) rather than the actual basecall (A/T)?

October 31, 2016, 4:17 pm

≫ Next: Haptyepecaller calls incorrect genotype in several site

≪ Previous: a question about running HaplotypeCaller with intervals

I'm interested in getting simple "heterozygous" or "homozygous" designations for all of the samples/SNPs in my multisample VCF file. In the past, I have been using the -GF GT option in VariantsToTable, and then annotating my basecalls in Excel as either heterozygous or homozygous. This takes forever since Excel isn't really built for big data like this. Is there a simple way to output all of the SNPs as 0/1, 0/0, 0/1, or 1/1 instead of C/A, A/A, G/T, C/C?

↧

Haptyepecaller calls incorrect genotype in several site

July 24, 2018, 6:39 am

≫ Next: Short read data in highly repetitive genomic region for heterozygous individuals

≪ Previous: can VariantsToTable output the raw genotype call (i.e., 0/1) rather than the actual basecall (A/T)?

Hi,
I found that the Haptyepecaller made heterozygous calls where there is no support for them in the BAM. We use IGV to compare input BAM and Haptyepecaller output bam. The region shown in the figure confused us. At the top of this figure is input-BAM while another is Haptyepecaller-output-bam. Haptyepecaller-output-gvcf also suggest this site is heterozygous.
It seems that it's the same issue as https://gatkforums.broadinstitute.org/gatk/discussion/2319/haplotypecaller-incorrectly-making-heterozygous-calls-again. In that question,your suggested solution is updating GATK. Howerer,we used GATK 3.8 and GATK4.0.6 and we got same results.
The command line we used is:
~/software/gatk-4.0.6.0/gatk --java-options "-Xmx30G" HaplotypeCaller -L chr01:9550000-9850000 -ERC GVCF -R -I -O <output_g.vcf> -bamout

↧

Short read data in highly repetitive genomic region for heterozygous individuals

July 25, 2018, 1:37 pm

≫ Next: Distribution of RGQ scores

≪ Previous: Haptyepecaller calls incorrect genotype in several site

Hello GATK team,

This might be a very general and overrated question but I appreciate your input. I am working with natural populations of plants (expected highly heterozygous individuals) and an enriched genomic region which contains some promoters of interest together with transposons, duplications and a lot of expected indels and SVs, including a potential paralog for one of our BACs. Unfortunately the long read sequencing is not yet ready so I am using the 2*75pb data and our BAC sequences as references to test how close we can get with HaplotypeCaller to see some SNP and short indel calls for an association analysis. Our coverage distribution seems to be heavily biased towards areas with duplications and potential TE and most of the assemblers based on local assembly are thrown off by our data. I have use very strict mapping parameters to avoid this problem with missaligned reads, given that we can't discard the possibility of having hyper-variable regions.

I understand that aiming for genotype calls is dangerous given our kind of data and the lack of a genome reference, so I am aiming to include the genotype likelihoods into the association analysis. With HaplotypeCaller I get a vcf file for my population and an associated PL value. My question is basically if given our type of data, do you consider that the local assembly inherent to HaplotypeCaller will give us false positives variants in the final output? Do you have any suggestion or alternative tools to get genotype likelihoods (without local assembly?) and input those into an association analysis tool?

I really appreciate your insight.

Best,

↧

Distribution of RGQ scores

September 11, 2015, 5:29 am

≫ Next: Issue of Haplotype call on a large chromosome (>536 Mb)

≪ Previous: Short read data in highly repetitive genomic region for heterozygous individuals

I work with non-human genomes and commonly need the confidence of the reference sites, so I was happy to see the inclusion of the RGQ score in the format field of GenotypeGVCFs. However, I am a little confused as to what this score means (how it is calculated). Out of curiosity I plotted the distribution of RGQ and GQ scores over ~1Mbp. A few things jumped out that I was hoping you could explain:

(1) There are two peaks of GQ and RGQ scores, one at 99 - which is obviously just the highest confidence score and another at exactly GQ/RGQ=45. You can see this in the GQ/RGQ distribution below. I've excluded the sites where RGQ/GQ = 0 or 99 (RGQ = blue, GQ=red) is there some reason why so many GT calls == 45?

(2) There are very few GQ = 0 calls and ~96% are GQ=99 - but in the RGQ ~42% == 0 and 54%=99. Is there any explanation why so many RGQ scores == 0? I fear that filtering on RGQ will bias the data against reference calls and include a disproportionate number of variant calls.

↧

Issue of Haplotype call on a large chromosome (>536 Mb)

August 1, 2018, 2:57 am

≫ Next: Mutect2 missed variant called by HaplotypeCaller

≪ Previous: Distribution of RGQ scores

Hi
I tried to run HaplotypeCaller with GVCF mode. My reference genome is over 5 Gb in size. Below my code and error,

Using GATK jar /source/gatk-4.0.6.0/gatk-package-4.0.6.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -XX:+UseSerialGC -Xmx100g -jar /source/gatk-4.0.6.0/gatk-package-4.0.6.0-local.jar HaplotypeCaller -R /data/Pseudomolecule_v3.fasta -L /IntervalFiles/0003-scattered.intervals -I WGS_FTNO.cram -O result/0003-scattered.vcf.gz -mbq 20 --native-pair-hmm-threads 4 -ERC GVCF --verbosity ERROR
[August 1, 2018 11:32:11 AM CEST] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=2076049408
htsjdk.samtools.SAMException: Exception creating BAM index for slice slice: seqID 1, start 536834320, span 457789, records 259850.
at htsjdk.samtools.CRAMBAIIndexer.processSingleReferenceSlice(CRAMBAIIndexer.java:194)
at htsjdk.samtools.cram.CRAIIndex.openCraiFileAsBaiStream(CRAIIndex.java:180)
at htsjdk.samtools.SamIndexes.asBaiSeekableStreamOrNull(SamIndexes.java:78)
at htsjdk.samtools.CRAMFileReader.initWithStreams(CRAMFileReader.java:228)
at htsjdk.samtools.CRAMFileReader.(CRAMFileReader.java:219)
at htsjdk.samtools.SamReaderFactory$SamReaderFactoryImpl.open(SamReaderFactory.java:422)
at htsjdk.samtools.SamReaderFactory.open(SamReaderFactory.java:105)
at org.broadinstitute.hellbender.engine.ReadsDataSource.(ReadsDataSource.java:227)
at org.broadinstitute.hellbender.engine.ReadsDataSource.(ReadsDataSource.java:162)
at org.broadinstitute.hellbender.engine.GATKTool.initializeReads(GATKTool.java:387)
at org.broadinstitute.hellbender.engine.GATKTool.onStartup(GATKTool.java:636)
at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.onStartup(AssemblyRegionWalker.java:156)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:133)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:180)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:199)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 32770
at htsjdk.samtools.CRAMBAIIndexer$BAMIndexBuilder.processSingleReferenceSlice(CRAMBAIIndexer.java:354)
at htsjdk.samtools.CRAMBAIIndexer$BAMIndexBuilder.access$100(CRAMBAIIndexer.java:227)
at htsjdk.samtools.CRAMBAIIndexer.processSingleReferenceSlice(CRAMBAIIndexer.java:192)
... 17 more

Does GATK4 handle large single chromosome ? Is there any solution ?

↧

Mutect2 missed variant called by HaplotypeCaller

August 1, 2018, 5:50 am

≫ Next: HaplotypeCaller and Reduced BAMs

≪ Previous: Issue of Haplotype call on a large chromosome (>536 Mb)

Hi,

I am running GATK 3.5.0 with java version 1.8.0. I have two cell line samples that I paired with a promega baseline reference (its essentially a mixed germline sample) to run Mutect2 (which I am aware of is not a part of the Best Practices). I also ran the tumour sample a lone using the HaplotypeCaller and noticed a very clear ALK variant that was missed by Mutect2 but called by the HaplotypeCaller in both samples. Due to the nature of the cell line we also expected to see an ALK variant which is why it was detected.

What I find odd is that the local reassembly of Mutect2 seems to have discarded the variant as the bamout does not contain the variant (C > T) at loci chr2:29443695 whereas the HaplotypeCaller call does for both samples. I have read through the documentation and the specifics of the local reassembly and would be very interested in knowing at what stage this occurs and your suggestions on what can be done.

I will be trying GATK v.4.0 as well as some of the things mentioned here https://software.broadinstitute.org/gatk/documentation/article?id=1235 in the meantime I would be very greatful if someone could look into this. I will be posting the updates on my new tests as well. See details below on various metrics and IGV screenshots.

The chemistry is a DNA capture Kapa hyperplus kit, 75 paired end reads.

Sample 945

Entire ALK covered up to 80X
Mean/min coverage 1013/378
BWA bam shows 50% allele frequency

HaplotypeCaller line Sample 945

chr2 29443695 . G T 8496.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=5.863;ClippingRankSum=-0.368;DP=601;ExcessHet=3.0103;FS=0.536;MLEAC=1;MLEAF=0.500;MQ=62.46;MQRankSum=1.113;QD=14.21;ReadPosRankSum=0.502;SOR=0.76GT:AD:DP:GQ:PL 0/1:300,298:598:99:8525,0,8240

Sample 946

Entire ALK covered up to 80x
Mean/min coverage 523/204
BWA bam shows 49% allele frequency

HaplotypeCaller line Sample 946

chr2 29443695 . G T 5056.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=3.569;ClippingRankSum=-0.212;DP=397;ExcessHet=3.0103;FS=2.133;MLEAC=1;MLEAF=0.500;MQ=63.61;MQRankSum=-1.274;QD=13.00;ReadPosRankSum=0.063;SOR=0.595 GT:AD:DP:GQ:PL 0/1:199,190:389:99:5085,0,5319

Promega control sample

Same control sample used as pair for both 945 and 946 using Mutect
Coverage around ALK region ~200+

Please see IGV images of the various cases below. The --bamout (run together with disabling optimization and forcing output) command was run with a 500bp padding downstream and upstream of the target location that contains the variant (i.e the actual padding upstream and downstream the actual variant at loci 29443695 will be slighly more than 500bp). I also ran mutect with the adjust 500bp but included all the targets in chr2 without adding any padding on any other targets other than the one that contains the variant.

Sample945_bwaBAM - Bam output from BWA

Sample946_bwaBAM - Bam output from BWA

Sample945_GATKForcedBamOut

Sample946_GATKForcedBamOut

Sample945_MutectForcedBamOutChr2

Sample946_MutectForcedBamOutChr2

Sample945_MutectForcedBamOutALKOnly

Sample946_MutectForcedBamOutALKOnly

Thank you very much and I look forward hearing your thoughts on this
Sabri

↧

HaplotypeCaller and Reduced BAMs

August 3, 2012, 6:19 am

≫ Next: Release notes for GATK version 2.1

≪ Previous: Mutect2 missed variant called by HaplotypeCaller

Hi,

I would like to ask if the most recent version of GATK is stable enough for HaplotypeCaller to work well with Reduced BAMs. If not, can you give an estimate of when would that be in place? Thanks in advance.

↧

Release notes for GATK version 2.1

August 20, 2012, 11:52 am

≫ Next: Error in Haplotype Caller

≪ Previous: HaplotypeCaller and Reduced BAMs

Base Quality Score Recalibration

Multi-threaded support in the BaseRecalibrator tool has been temporarily suspended for performance reasons; we hope to have this fixed for the next release.
Implemented support for SOLiD no call strategies other than throwing an exception.
Fixed smoothing in the BQSR bins.
Fixed plotting R script to be compatible with newer versions of R and ggplot2 library.

Unified Genotyper

Renamed the per-sample ML allelic fractions and counts so that they don't have the same name as the per-site INFO fields, and clarified the description in the VCF header.
UG now makes use of base insertion and base deletion quality scores if they exist in the reads (output from BaseRecalibrator).
Changed the -maxAlleles argument to -maxAltAlleles to make it more accurate.
In pooled mode, if haplotypes cannot be created from given alleles when genotyping indels (e.g. too close to contig boundary, etc.) then do not try to genotype.
Added improvements to indel calling in pooled mode: we compute per-read likelihoods in reference sample to determine whether a read is informative or not.

Haplotype Caller

Added LowQual filter to the output when appropriate.
Added some support for calling on Reduced Reads. Note that this is still experimental and may not always work well.
Now does a better job of capturing low frequency branches that are inside high frequency haplotypes.
Updated VQSR to work with the MNP and symbolic variants that are coming out of the HaplotypeCaller.
Made fixes to the likelihood based LD calculation for deciding when to combine consecutive events.
Fixed bug where non-standard bases from the reference would cause errors.
Better separation of arguments that are relevant to the Unified Genotyper but not the Haplotype Caller.

Reduce Reads

Fixed bug where reads were soft-clipped beyond the limits of the contig and the tool was failing with a NoSuchElement exception.
Fixed divide by zero bug when downsampler goes over regions where reads are all filtered out.
Fixed a bug where downsampled reads were not being excluded from the read window, causing them to trail back and get caught by the sliding window exception.

Variant Eval

Fixed support in the AlleleCount stratification when using the MLEAC (it is now capped by the AN).
Fixed incorrect allele counting in IndelSummary evaluation.

Combine Variants

Now outputs the first non-MISSING QUAL, instead of the maximum.
Now supports multi-threaded running (with the -nt argument).

Select Variants

Fixed behavior of the --regenotype argument to do proper selecting (without losing any of the alternate alleles).
No longer adds the DP INFO annotation if DP wasn't used in the input VCF.
If MLEAC or MLEAF is present in the original VCF and the number of samples decreases, remove those annotations from the output VC (since they are no longer accurate).

Miscellaneous

Updated and improved the BadCigar read filter.
GATK now generates a proper error when a gzipped FASTA is passed in.
Various improvements throughout the BCF2-related code.
Removed various parallelism bottlenecks in the GATK.
Added support of X and = CIGAR operators to the GATK.
Catch NumberFormatExceptions when parsing the VCF POS field.
Fixed bug in FastaAlternateReferenceMaker when input VCF has overlapping deletions.
Fixed AlignmentUtils bug for handling Ns in the CIGAR string.
We now allow lower-case bases in the REF/ALT alleles of a VCF and upper-case them.
Added support for handling complex events in ValidateVariants.
Picard jar remains at version 1.67.1197.
Tribble jar remains at version 110.

↧

Error in Haplotype Caller

August 7, 2012, 8:02 am

≫ Next: Does HaplotypeCaller cost a lot of RAM?

≪ Previous: Release notes for GATK version 2.1

Hi,

I am trying to run the latest version (GenomeAnalysisTK-2.0-35-g2d70733) of the HaplotypeCaller on some .bam files that I had prepared according to the Best Practice v.3. Now GATK reports the following error:

ERROR ------------------------------------------------------------------------------------------

ERROR stack trace

java.lang.IllegalArgumentException: Duplicate allele added to VariantContext: T
at org.broadinstitute.sting.utils.variantcontext.VariantContext.makeAlleles(VariantContext.java:1328)
at org.broadinstitute.sting.utils.variantcontext.VariantContext.(VariantContext.java:304)
at org.broadinstitute.sting.utils.variantcontext.VariantContextBuilder.make(VariantContextBuilder.java:518)
at org.broadinstitute.sting.gatk.walkers.haplotypecaller.GenotypingEngine.generateVCsFromAlignment(GenotypingEngine.java:604)
at org.broadinstitute.sting.gatk.walkers.haplotypecaller.GenotypingEngine.assignGenotypeLikelihoodsAndCallIndependentEvents(GenotypingEngine.java:198)
at org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:414)
at org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:104)
at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions.processActiveRegion(TraverseActiveRegions.java:246)
at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions.callWalkerMapOnActiveRegions(TraverseActiveRegions.java:202)
at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions.processActiveRegions(TraverseActiveRegions.java:177)
at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:134)
at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:27)
at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:62)
at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:269)
at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146)
at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:93)

ERROR ------------------------------------------------------------------------------------------

ERROR A GATK RUNTIME ERROR has occurred (version 2.0-35-g2d70733):

ERROR

ERROR Please visit the wiki to see if this is a known problem

ERROR If not, please post the error, with stack trace, to the GATK forum

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions http://www.broadinstitute.org/gatk

ERROR

ERROR MESSAGE: Duplicate allele added to VariantContext: T

ERROR ------------------------------------------------------------------------------------------

Now I am assuming my old bam files are not compatible with the new HaplotypeCaller. Is that correct?

Thank you for your help,
K

↧

Does HaplotypeCaller cost a lot of RAM?

August 24, 2012, 7:55 am

≫ Next: haplotypecaller indel format

≪ Previous: Error in Haplotype Caller

I tried to run several HaplotypeCaller jobs simultaneously on a node. But sometimes, not all the time, they caused the nodes collapse. I noticed that HaplotypeCaller sometimes occupied tens of GB of memory, is this normal? The average depth for my samples is 30x-50x.

↧

haplotypecaller indel format

September 6, 2012, 1:05 am

≫ Next: HaplotypeCaller with Queue

≪ Previous: Does HaplotypeCaller cost a lot of RAM?

Hi there,
I've done with a run of HaplotypeCaller on my samples. I'm now analysing everything with snpEff, although I'm doing this "outside" GATK. I had to stop the analysis because a huge number of errors, all dealing with indels, such as:

Error while processing VCF entry (line 580649) :
    chr21   26718345    .   TAATCCTGAGTTTAA TATCCTAAATGTTTAC    943.26  […]
java.lang.RuntimeException: Insertion '-A+AT' does not start with '+'. This should never happen!
    chr21   35260360    .   CATAACAGTTCAT   AGAGACAGAG  425.22  […]
java.lang.RuntimeException: Deletion '+G-TTC' does not start with '-'. This should never happen!

Of course, this is a snpEff error, nevertheless the Indel format looks quite different from what I've ever seen. Consider the first line above: shouldn't it be like

chr21   26718345    .   AT  T   943.26  […]

(I can't resolve the second right now).
Any hint is appreciated at this point. I'm writing to snpEff developer for the same reason...

↧

HaplotypeCaller with Queue

September 13, 2012, 8:03 am

≫ Next: HaplotypeCaller --fullHaplotype not outputting full haplotype

≪ Previous: haplotypecaller indel format

Is there any example of a queue script calling variants with the HaplotypeCaller?
thanks!
Francesco

↧

HaplotypeCaller --fullHaplotype not outputting full haplotype

August 13, 2012, 10:33 am

≫ Next: Expected file size - Haplotype Caller

≪ Previous: HaplotypeCaller with Queue

I am also having a problem with HaplotypeCaller - when I set the flag to print out full haplotypes, no haplotype files are created. Here is my command:

java -jar /home/chodon/applications/GATK/GenomeAnalysisTK.jar -T HaplotypeCaller -R /home/chodon/comb_eny/abyss/final_contigs/A2/Assembly.fasta -I /home/chodon/comb_eny/abyss/final_contigs/A2/bowtie2_out/A2.sorted.bam --fullHaplotype --ignoreLaneInfo -o /home/chodon/comb_eny/abyss/final_contigs/A2/GATK_haplo2.out

Any help is much appreciated! Thanks, C

↧

Expected file size - Haplotype Caller

October 17, 2012, 8:56 am

≫ Next: HaplotypeScore not annotated?

≪ Previous: HaplotypeCaller --fullHaplotype not outputting full haplotype

Hi All,
I've been attempting to use the haplotype caller on my 50x coverage exome data. The bam being parsed is about 12G. Each time, the caller runs for many hours and then the output is only the header of the VCF - no errors seen. I'm wondering if this is due to limited space on my drives or if the expected file size is much larger than I am anticipating.

Command:

GenomeAnalysisTK.jar -T HaplotypeCaller -R  Homo_sapiens_assembly19.fasta -I input.bam --dbsnp dbsnp_132.b37.nochr.vcf  -stand_call_conf 30 -stand_emit_conf 10 -o output.Haplotypes.vcf

↧

HaplotypeScore not annotated?

October 23, 2012, 3:00 am

≫ Next: Variant Recalibration - Number of Whole Exome Samples Needed and Where?

≪ Previous: Expected file size - Haplotype Caller

Hi there,
I'm running into an issue with my Queue pipeline, with variants called with HaplotypeCaller.
Once I get the raw VCF file, I use VariantAnnotator before VQSR: however, no HaplotypeScore annotation is being produced, resulting in a failure of the VariantRecalibrator step where 'HaplotypeScore' was indicated as an annotation.

I tried to correct the issue by indicating to VariantAnnotator to use all annotations

  class AnnotationArguments (t: Target) extends VariantAnnotator with UNIVERSAL_GATK_ARGS {
this.reference_sequence = qscript.referenceFile
// Set the memory limit to 7 gigabytes on each command.
    this.memoryLimit = 7
    this.input_file :+= qscript.bamFile
    this.useAllAnnotations
    this.D = qscript.dbSNP_b37
  }

But I still can't get any output in the annotated VCF of that parameter. Here an example of a variant

 AC=5;AF=0.078;AN=64;ActiveRegionSize=179;ClippingRankSum=-0.568;DB;DP=2025;EVENTLENGTH=0;FS=4.139;InbreedingCoeff=-0.0847;MLEAC=5;MLEAF=0.078;MQ=69.98;MQRankSum=-1.428;NVH=1;NumHapAssembly=15;NumHapEval=13;QD=17.20;QDE=17.20;ReadPosRankSum=-1.762;TYPE=SNP;extType=SNP

Any suggestions on what I might be doing wrong?

thanks very much for your help,
Francesco

↧

Variant Recalibration - Number of Whole Exome Samples Needed and Where?

October 27, 2012, 12:32 am

≫ Next: EMIT_ALL_CONFIDENT_SITES for bacteria

≪ Previous: HaplotypeScore not annotated?

Hello,

I've just made a long needed update to the most recent version of GATK. I had been toying with the variant quality score recalibrator before but now that I have a great deal more exomes at my disposal I'd like to fully implement it in a meaningful way.

The phrase I'm confused about is "In our testing we've found that in order to achieve the best exome results one needs to use an exome callset with at least 30 samples." How exactly do I arrange these 30+ exomes?

Is there any difference or reason to choose one of the following two workflows over the other?

Input 30+ exomes in the "-I" argument of either the UnifiedGenotyper or HaplotypeCaller and then with my multi-sample VCF perform the variant recalibration procedure and then split the individual call sets out of the multi-sample vcf with SelectVariants?
Take 30+ individual vcf files, merge them together, and then perform variant recalibration on the merged vcf and then split the individual call sets out of the multi-sample vcf with SelectVariants?
Or some third option I'm missing

Any help is appreciated.

Thanks

↧

EMIT_ALL_CONFIDENT_SITES for bacteria

September 6, 2012, 5:39 pm

≫ Next: haplotypecaller doesn't emit genotype at all sites

≪ Previous: Variant Recalibration - Number of Whole Exome Samples Needed and Where?

Hello,
I am running the variant caller to identify SNPs and Reference Calls in a bacterial genome, which means I am running with -ploidy 1, -glm POOLSNP and -prnm POOL as suggested in other regions of this forum. The tool does an excellent job when just looking for Variants, but when I attempt to EMIT_ALL_CONFIDENT_SITES, it just spits out the SNPs and not the reference calls. When I remove the arguments stating that it is ploidy of 1, it works fine but calls SNPs that shouldn't be there since it's assuming diploid. Thus, I would really like to be able to emit all sites in ploidy=1 mode. Any reason why this is not possible?
Thanks for you help!
John

↧

haplotypecaller doesn't emit genotype at all sites

October 11, 2012, 11:54 am

≫ Next: Why did HaplotypeCaller report HET genotype for loci without reads supporting REF?

≪ Previous: EMIT_ALL_CONFIDENT_SITES for bacteria

I've tried to get genotype for all sites provided in interval file using haplotypeCaller. If using unifiedGenotyper, I can get the result by setting "output_mode EMIT_ALL_SITES". But haplotypeCaller doesn't report as expected by "output_mode EMIT_ALL_SITES". Even though I set "genotypeFullActiveRegion" or "fullHaplotype", haplotypeCaller doesn't seem to emit genotype at all sites. How to get desirable result using haplotypeCaller?
Thanks!

↧