When interpreting the output of HaplotypeCaller, what do the i_variant_quality_by_depth and i_genotype_quality
columns represent and which of these would be a good value on which to base an assessment of confidence in the variant call and quality? What scale are they on? Or is there a different column that would be better?
i_variant_quality_by_depth/i_genotype_quality interpretation
a question about running HaplotypeCaller with intervals
Hi,
I have a question when running HaplotypeCaller functions with intervals on exome-seq data.
Here is the command I used:
java -jar gatk-package-4.0.6.0-local.jar HaplotypeCaller -R /espresso/share/genomes/hg38/genome.fa -I recal_reads.bam -O variants.g.vcf -ERC GVCF -L capture.bed
However, when I ran the command, I got the following message:
17:13:14.439 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk-4.0.6.0/gatk-package-4.0.6.0-local.jar!/com/intel/gkl/native/libgkl_compression.so 17:13:14.591 INFO HaplotypeCaller - ------------------------------------------------------------ 17:13:14.591 INFO HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.0.6.0 17:13:14.591 INFO HaplotypeCaller - For support and documentation go to https://software.broadinstitute.org/gatk/ 17:13:14.591 INFO HaplotypeCaller - Executing as ... on Linux v2.6.32-431.29.2.el6.x86_64 amd64 17:13:14.592 INFO HaplotypeCaller - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_121-b13 17:13:14.592 INFO HaplotypeCaller - Start Date/Time: July 16, 2018 5:13:14 PM EDT 17:13:14.592 INFO HaplotypeCaller - ------------------------------------------------------------ 17:13:14.592 INFO HaplotypeCaller - ------------------------------------------------------------ 17:13:14.592 INFO HaplotypeCaller - HTSJDK Version: 2.16.0 17:13:14.592 INFO HaplotypeCaller - Picard Version: 2.18.7 17:13:14.592 INFO HaplotypeCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2 17:13:14.592 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false 17:13:14.592 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true 17:13:14.592 INFO HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false 17:13:14.593 INFO HaplotypeCaller - Deflater: IntelDeflater 17:13:14.593 INFO HaplotypeCaller - Inflater: IntelInflater 17:13:14.593 INFO HaplotypeCaller - GCS max retries/reopens: 20 17:13:14.593 INFO HaplotypeCaller - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes 17:13:14.593 INFO HaplotypeCaller - Initializing engine 17:13:15.037 INFO FeatureManager - Using codec BEDCodec to read file file:///capture.bed 17:13:16.883 INFO IntervalArgumentCollection - Processing 64190747 bp from intervals 17:13:17.009 INFO HaplotypeCaller - Shutting down engine [July 16, 2018 5:13:17 PM EDT] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 0.04 minutes. Runtime.totalMemory()=2041053184 java.lang.NullPointerException at java.util.ComparableTimSort.countRunAndMakeAscending(ComparableTimSort.java:325) at java.util.ComparableTimSort.sort(ComparableTimSort.java:202) at java.util.Arrays.sort(Arrays.java:1312) at java.util.Arrays.sort(Arrays.java:1506) at java.util.ArrayList.sort(ArrayList.java:1454) at java.util.Collections.sort(Collections.java:141) at org.broadinstitute.hellbender.utils.IntervalUtils.sortAndMergeIntervals(IntervalUtils.java:459) at org.broadinstitute.hellbender.utils.IntervalUtils.getIntervalsWithFlanks(IntervalUtils.java:956) at org.broadinstitute.hellbender.utils.IntervalUtils.getIntervalsWithFlanks(IntervalUtils.java:971) at org.broadinstitute.hellbender.engine.MultiIntervalLocalReadShard.<init>(MultiIntervalLocalReadShard.java:59) at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.makeReadShards(AssemblyRegionWalker.java:195) at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.onStartup(AssemblyRegionWalker.java:175) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:133) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:180) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:199) at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160) at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203) at org.broadinstitute.hellbender.Main.main(Main.java:289)
I did not see any error but it seems HaplotypeCaller did not run and there is no output.
So I will really appreciate it if I can get help from you guys.
Thank you!
Best,
Siyu
can VariantsToTable output the raw genotype call (i.e., 0/1) rather than the actual basecall (A/T)?
I'm interested in getting simple "heterozygous" or "homozygous" designations for all of the samples/SNPs in my multisample VCF file. In the past, I have been using the -GF GT option in VariantsToTable, and then annotating my basecalls in Excel as either heterozygous or homozygous. This takes forever since Excel isn't really built for big data like this. Is there a simple way to output all of the SNPs as 0/1, 0/0, 0/1, or 1/1 instead of C/A, A/A, G/T, C/C?
Haptyepecaller calls incorrect genotype in several site
Hi,
I found that the Haptyepecaller made heterozygous calls where there is no support for them in the BAM. We use IGV to compare input BAM and Haptyepecaller output bam. The region shown in the figure confused us. At the top of this figure is input-BAM while another is Haptyepecaller-output-bam. Haptyepecaller-output-gvcf also suggest this site is heterozygous.
It seems that it's the same issue as https://gatkforums.broadinstitute.org/gatk/discussion/2319/haplotypecaller-incorrectly-making-heterozygous-calls-again. In that question,your suggested solution is updating GATK. Howerer,we used GATK 3.8 and GATK4.0.6 and we got same results.
The command line we used is:
~/software/gatk-4.0.6.0/gatk --java-options "-Xmx30G" HaplotypeCaller -L chr01:9550000-9850000 -ERC GVCF -R -I -O <output_g.vcf> -bamout
Short read data in highly repetitive genomic region for heterozygous individuals
Hello GATK team,
This might be a very general and overrated question but I appreciate your input. I am working with natural populations of plants (expected highly heterozygous individuals) and an enriched genomic region which contains some promoters of interest together with transposons, duplications and a lot of expected indels and SVs, including a potential paralog for one of our BACs. Unfortunately the long read sequencing is not yet ready so I am using the 2*75pb data and our BAC sequences as references to test how close we can get with HaplotypeCaller to see some SNP and short indel calls for an association analysis. Our coverage distribution seems to be heavily biased towards areas with duplications and potential TE and most of the assemblers based on local assembly are thrown off by our data. I have use very strict mapping parameters to avoid this problem with missaligned reads, given that we can't discard the possibility of having hyper-variable regions.
I understand that aiming for genotype calls is dangerous given our kind of data and the lack of a genome reference, so I am aiming to include the genotype likelihoods into the association analysis. With HaplotypeCaller I get a vcf file for my population and an associated PL value. My question is basically if given our type of data, do you consider that the local assembly inherent to HaplotypeCaller will give us false positives variants in the final output? Do you have any suggestion or alternative tools to get genotype likelihoods (without local assembly?) and input those into an association analysis tool?
I really appreciate your insight.
Best,
Distribution of RGQ scores
I work with non-human genomes and commonly need the confidence of the reference sites, so I was happy to see the inclusion of the RGQ score in the format field of GenotypeGVCFs. However, I am a little confused as to what this score means (how it is calculated). Out of curiosity I plotted the distribution of RGQ and GQ scores over ~1Mbp. A few things jumped out that I was hoping you could explain:
(1) There are two peaks of GQ and RGQ scores, one at 99 - which is obviously just the highest confidence score and another at exactly GQ/RGQ=45. You can see this in the GQ/RGQ distribution below. I've excluded the sites where RGQ/GQ = 0 or 99 (RGQ = blue, GQ=red) is there some reason why so many GT calls == 45?
(2) There are very few GQ = 0 calls and ~96% are GQ=99 - but in the RGQ ~42% == 0 and 54%=99. Is there any explanation why so many RGQ scores == 0? I fear that filtering on RGQ will bias the data against reference calls and include a disproportionate number of variant calls.
Issue of Haplotype call on a large chromosome (>536 Mb)
Hi
I tried to run HaplotypeCaller with GVCF mode. My reference genome is over 5 Gb in size. Below my code and error,
Using GATK jar /source/gatk-4.0.6.0/gatk-package-4.0.6.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -XX:+UseSerialGC -Xmx100g -jar /source/gatk-4.0.6.0/gatk-package-4.0.6.0-local.jar HaplotypeCaller -R /data/Pseudomolecule_v3.fasta -L /IntervalFiles/0003-scattered.intervals -I WGS_FTNO.cram -O result/0003-scattered.vcf.gz -mbq 20 --native-pair-hmm-threads 4 -ERC GVCF --verbosity ERROR
[August 1, 2018 11:32:11 AM CEST] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=2076049408
htsjdk.samtools.SAMException: Exception creating BAM index for slice slice: seqID 1, start 536834320, span 457789, records 259850.
at htsjdk.samtools.CRAMBAIIndexer.processSingleReferenceSlice(CRAMBAIIndexer.java:194)
at htsjdk.samtools.cram.CRAIIndex.openCraiFileAsBaiStream(CRAIIndex.java:180)
at htsjdk.samtools.SamIndexes.asBaiSeekableStreamOrNull(SamIndexes.java:78)
at htsjdk.samtools.CRAMFileReader.initWithStreams(CRAMFileReader.java:228)
at htsjdk.samtools.CRAMFileReader.(CRAMFileReader.java:219)
at htsjdk.samtools.SamReaderFactory$SamReaderFactoryImpl.open(SamReaderFactory.java:422)
at htsjdk.samtools.SamReaderFactory.open(SamReaderFactory.java:105)
at org.broadinstitute.hellbender.engine.ReadsDataSource.(ReadsDataSource.java:227)
at org.broadinstitute.hellbender.engine.ReadsDataSource.(ReadsDataSource.java:162)
at org.broadinstitute.hellbender.engine.GATKTool.initializeReads(GATKTool.java:387)
at org.broadinstitute.hellbender.engine.GATKTool.onStartup(GATKTool.java:636)
at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.onStartup(AssemblyRegionWalker.java:156)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:133)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:180)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:199)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 32770
at htsjdk.samtools.CRAMBAIIndexer$BAMIndexBuilder.processSingleReferenceSlice(CRAMBAIIndexer.java:354)
at htsjdk.samtools.CRAMBAIIndexer$BAMIndexBuilder.access$100(CRAMBAIIndexer.java:227)
at htsjdk.samtools.CRAMBAIIndexer.processSingleReferenceSlice(CRAMBAIIndexer.java:192)
... 17 more
Does GATK4 handle large single chromosome ? Is there any solution ?
Mutect2 missed variant called by HaplotypeCaller
Hi,
I am running GATK 3.5.0 with java version 1.8.0. I have two cell line samples that I paired with a promega baseline reference (its essentially a mixed germline sample) to run Mutect2 (which I am aware of is not a part of the Best Practices). I also ran the tumour sample a lone using the HaplotypeCaller and noticed a very clear ALK variant that was missed by Mutect2 but called by the HaplotypeCaller in both samples. Due to the nature of the cell line we also expected to see an ALK variant which is why it was detected.
What I find odd is that the local reassembly of Mutect2 seems to have discarded the variant as the bamout does not contain the variant (C > T) at loci chr2:29443695 whereas the HaplotypeCaller call does for both samples. I have read through the documentation and the specifics of the local reassembly and would be very interested in knowing at what stage this occurs and your suggestions on what can be done.
I will be trying GATK v.4.0 as well as some of the things mentioned here https://software.broadinstitute.org/gatk/documentation/article?id=1235 in the meantime I would be very greatful if someone could look into this. I will be posting the updates on my new tests as well. See details below on various metrics and IGV screenshots.
The chemistry is a DNA capture Kapa hyperplus kit, 75 paired end reads.
Sample 945
- Entire ALK covered up to 80X
- Mean/min coverage 1013/378
- BWA bam shows 50% allele frequency
HaplotypeCaller line Sample 945
- chr2 29443695 . G T 8496.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=5.863;ClippingRankSum=-0.368;DP=601;ExcessHet=3.0103;FS=0.536;MLEAC=1;MLEAF=0.500;MQ=62.46;MQRankSum=1.113;QD=14.21;ReadPosRankSum=0.502;SOR=0.76GT:AD:DP:GQ:PL 0/1:300,298:598:99:8525,0,8240
Sample 946
- Entire ALK covered up to 80x
- Mean/min coverage 523/204
- BWA bam shows 49% allele frequency
HaplotypeCaller line Sample 946
- chr2 29443695 . G T 5056.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=3.569;ClippingRankSum=-0.212;DP=397;ExcessHet=3.0103;FS=2.133;MLEAC=1;MLEAF=0.500;MQ=63.61;MQRankSum=-1.274;QD=13.00;ReadPosRankSum=0.063;SOR=0.595 GT:AD:DP:GQ:PL 0/1:199,190:389:99:5085,0,5319
Promega control sample
- Same control sample used as pair for both 945 and 946 using Mutect
- Coverage around ALK region ~200+
Please see IGV images of the various cases below. The --bamout (run together with disabling optimization and forcing output) command was run with a 500bp padding downstream and upstream of the target location that contains the variant (i.e the actual padding upstream and downstream the actual variant at loci 29443695 will be slighly more than 500bp). I also ran mutect with the adjust 500bp but included all the targets in chr2 without adding any padding on any other targets other than the one that contains the variant.
Sample945_bwaBAM - Bam output from BWA
Sample946_bwaBAM - Bam output from BWA
Sample945_GATKForcedBamOut
Sample946_GATKForcedBamOut
Sample945_MutectForcedBamOutChr2
Sample946_MutectForcedBamOutChr2
Sample945_MutectForcedBamOutALKOnly
Sample946_MutectForcedBamOutALKOnly
Thank you very much and I look forward hearing your thoughts on this
Sabri
HaplotypeCaller and Reduced BAMs
Hi,
I would like to ask if the most recent version of GATK is stable enough for HaplotypeCaller to work well with Reduced BAMs. If not, can you give an estimate of when would that be in place? Thanks in advance.
Release notes for GATK version 2.1
Base Quality Score Recalibration
- Multi-threaded support in the BaseRecalibrator tool has been temporarily suspended for performance reasons; we hope to have this fixed for the next release.
- Implemented support for SOLiD no call strategies other than throwing an exception.
- Fixed smoothing in the BQSR bins.
- Fixed plotting R script to be compatible with newer versions of R and ggplot2 library.
Unified Genotyper
- Renamed the per-sample ML allelic fractions and counts so that they don't have the same name as the per-site INFO fields, and clarified the description in the VCF header.
- UG now makes use of base insertion and base deletion quality scores if they exist in the reads (output from BaseRecalibrator).
- Changed the -maxAlleles argument to -maxAltAlleles to make it more accurate.
- In pooled mode, if haplotypes cannot be created from given alleles when genotyping indels (e.g. too close to contig boundary, etc.) then do not try to genotype.
- Added improvements to indel calling in pooled mode: we compute per-read likelihoods in reference sample to determine whether a read is informative or not.
Haplotype Caller
- Added LowQual filter to the output when appropriate.
- Added some support for calling on Reduced Reads. Note that this is still experimental and may not always work well.
- Now does a better job of capturing low frequency branches that are inside high frequency haplotypes.
- Updated VQSR to work with the MNP and symbolic variants that are coming out of the HaplotypeCaller.
- Made fixes to the likelihood based LD calculation for deciding when to combine consecutive events.
- Fixed bug where non-standard bases from the reference would cause errors.
- Better separation of arguments that are relevant to the Unified Genotyper but not the Haplotype Caller.
Reduce Reads
- Fixed bug where reads were soft-clipped beyond the limits of the contig and the tool was failing with a NoSuchElement exception.
- Fixed divide by zero bug when downsampler goes over regions where reads are all filtered out.
- Fixed a bug where downsampled reads were not being excluded from the read window, causing them to trail back and get caught by the sliding window exception.
Variant Eval
- Fixed support in the AlleleCount stratification when using the MLEAC (it is now capped by the AN).
- Fixed incorrect allele counting in IndelSummary evaluation.
Combine Variants
- Now outputs the first non-MISSING QUAL, instead of the maximum.
- Now supports multi-threaded running (with the -nt argument).
Select Variants
- Fixed behavior of the --regenotype argument to do proper selecting (without losing any of the alternate alleles).
- No longer adds the DP INFO annotation if DP wasn't used in the input VCF.
- If MLEAC or MLEAF is present in the original VCF and the number of samples decreases, remove those annotations from the output VC (since they are no longer accurate).
Miscellaneous
- Updated and improved the BadCigar read filter.
- GATK now generates a proper error when a gzipped FASTA is passed in.
- Various improvements throughout the BCF2-related code.
- Removed various parallelism bottlenecks in the GATK.
- Added support of X and = CIGAR operators to the GATK.
- Catch NumberFormatExceptions when parsing the VCF POS field.
- Fixed bug in FastaAlternateReferenceMaker when input VCF has overlapping deletions.
- Fixed AlignmentUtils bug for handling Ns in the CIGAR string.
- We now allow lower-case bases in the REF/ALT alleles of a VCF and upper-case them.
- Added support for handling complex events in ValidateVariants.
- Picard jar remains at version 1.67.1197.
- Tribble jar remains at version 110.
Error in Haplotype Caller
Hi,
I am trying to run the latest version (GenomeAnalysisTK-2.0-35-g2d70733) of the HaplotypeCaller on some .bam files that I had prepared according to the Best Practice v.3. Now GATK reports the following error:
ERROR ------------------------------------------------------------------------------------------
ERROR stack trace
java.lang.IllegalArgumentException: Duplicate allele added to VariantContext: T
at org.broadinstitute.sting.utils.variantcontext.VariantContext.makeAlleles(VariantContext.java:1328)
at org.broadinstitute.sting.utils.variantcontext.VariantContext.(VariantContext.java:304)
at org.broadinstitute.sting.utils.variantcontext.VariantContextBuilder.make(VariantContextBuilder.java:518)
at org.broadinstitute.sting.gatk.walkers.haplotypecaller.GenotypingEngine.generateVCsFromAlignment(GenotypingEngine.java:604)
at org.broadinstitute.sting.gatk.walkers.haplotypecaller.GenotypingEngine.assignGenotypeLikelihoodsAndCallIndependentEvents(GenotypingEngine.java:198)
at org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:414)
at org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:104)
at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions.processActiveRegion(TraverseActiveRegions.java:246)
at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions.callWalkerMapOnActiveRegions(TraverseActiveRegions.java:202)
at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions.processActiveRegions(TraverseActiveRegions.java:177)
at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:134)
at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:27)
at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:62)
at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:269)
at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146)
at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:93)
ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 2.0-35-g2d70733):
ERROR
ERROR Please visit the wiki to see if this is a known problem
ERROR If not, please post the error, with stack trace, to the GATK forum
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: Duplicate allele added to VariantContext: T
ERROR ------------------------------------------------------------------------------------------
Now I am assuming my old bam files are not compatible with the new HaplotypeCaller. Is that correct?
Thank you for your help,
K
Does HaplotypeCaller cost a lot of RAM?
I tried to run several HaplotypeCaller jobs simultaneously on a node. But sometimes, not all the time, they caused the nodes collapse. I noticed that HaplotypeCaller sometimes occupied tens of GB of memory, is this normal? The average depth for my samples is 30x-50x.
haplotypecaller indel format
Hi there,
I've done with a run of HaplotypeCaller on my samples. I'm now analysing everything with snpEff, although I'm doing this "outside" GATK. I had to stop the analysis because a huge number of errors, all dealing with indels, such as:
Error while processing VCF entry (line 580649) : chr21 26718345 . TAATCCTGAGTTTAA TATCCTAAATGTTTAC 943.26 […] java.lang.RuntimeException: Insertion '-A+AT' does not start with '+'. This should never happen! chr21 35260360 . CATAACAGTTCAT AGAGACAGAG 425.22 […] java.lang.RuntimeException: Deletion '+G-TTC' does not start with '-'. This should never happen!
Of course, this is a snpEff error, nevertheless the Indel format looks quite different from what I've ever seen. Consider the first line above: shouldn't it be like
chr21 26718345 . AT T 943.26 […]
(I can't resolve the second right now).
Any hint is appreciated at this point. I'm writing to snpEff developer for the same reason...
HaplotypeCaller with Queue
Is there any example of a queue script calling variants with the HaplotypeCaller?
thanks!
Francesco
HaplotypeCaller --fullHaplotype not outputting full haplotype
I am also having a problem with HaplotypeCaller - when I set the flag to print out full haplotypes, no haplotype files are created. Here is my command:
java -jar /home/chodon/applications/GATK/GenomeAnalysisTK.jar -T HaplotypeCaller -R /home/chodon/comb_eny/abyss/final_contigs/A2/Assembly.fasta -I /home/chodon/comb_eny/abyss/final_contigs/A2/bowtie2_out/A2.sorted.bam --fullHaplotype --ignoreLaneInfo -o /home/chodon/comb_eny/abyss/final_contigs/A2/GATK_haplo2.out
Any help is much appreciated! Thanks, C
Expected file size - Haplotype Caller
Hi All,
I've been attempting to use the haplotype caller on my 50x coverage exome data. The bam being parsed is about 12G. Each time, the caller runs for many hours and then the output is only the header of the VCF - no errors seen. I'm wondering if this is due to limited space on my drives or if the expected file size is much larger than I am anticipating.
Command:
GenomeAnalysisTK.jar -T HaplotypeCaller -R Homo_sapiens_assembly19.fasta -I input.bam --dbsnp dbsnp_132.b37.nochr.vcf -stand_call_conf 30 -stand_emit_conf 10 -o output.Haplotypes.vcf
HaplotypeScore not annotated?
Hi there,
I'm running into an issue with my Queue pipeline, with variants called with HaplotypeCaller.
Once I get the raw VCF file, I use VariantAnnotator before VQSR: however, no HaplotypeScore annotation is being produced, resulting in a failure of the VariantRecalibrator step where 'HaplotypeScore' was indicated as an annotation.
I tried to correct the issue by indicating to VariantAnnotator to use all annotations
class AnnotationArguments (t: Target) extends VariantAnnotator with UNIVERSAL_GATK_ARGS {
this.reference_sequence = qscript.referenceFile
// Set the memory limit to 7 gigabytes on each command.
this.memoryLimit = 7
this.input_file :+= qscript.bamFile
this.useAllAnnotations
this.D = qscript.dbSNP_b37
}
But I still can't get any output in the annotated VCF of that parameter. Here an example of a variant
AC=5;AF=0.078;AN=64;ActiveRegionSize=179;ClippingRankSum=-0.568;DB;DP=2025;EVENTLENGTH=0;FS=4.139;InbreedingCoeff=-0.0847;MLEAC=5;MLEAF=0.078;MQ=69.98;MQRankSum=-1.428;NVH=1;NumHapAssembly=15;NumHapEval=13;QD=17.20;QDE=17.20;ReadPosRankSum=-1.762;TYPE=SNP;extType=SNP
Any suggestions on what I might be doing wrong?
thanks very much for your help,
Francesco
Variant Recalibration - Number of Whole Exome Samples Needed and Where?
Hello,
I've just made a long needed update to the most recent version of GATK. I had been toying with the variant quality score recalibrator before but now that I have a great deal more exomes at my disposal I'd like to fully implement it in a meaningful way.
The phrase I'm confused about is "In our testing we've found that in order to achieve the best exome results one needs to use an exome callset with at least 30 samples." How exactly do I arrange these 30+ exomes?
Is there any difference or reason to choose one of the following two workflows over the other?
Input 30+ exomes in the "-I" argument of either the UnifiedGenotyper or HaplotypeCaller and then with my multi-sample VCF perform the variant recalibration procedure and then split the individual call sets out of the multi-sample vcf with SelectVariants?
Take 30+ individual vcf files, merge them together, and then perform variant recalibration on the merged vcf and then split the individual call sets out of the multi-sample vcf with SelectVariants?
Or some third option I'm missing
Any help is appreciated.
Thanks
EMIT_ALL_CONFIDENT_SITES for bacteria
Hello,
I am running the variant caller to identify SNPs and Reference Calls in a bacterial genome, which means I am running with -ploidy 1, -glm POOLSNP and -prnm POOL as suggested in other regions of this forum. The tool does an excellent job when just looking for Variants, but when I attempt to EMIT_ALL_CONFIDENT_SITES, it just spits out the SNPs and not the reference calls. When I remove the arguments stating that it is ploidy of 1, it works fine but calls SNPs that shouldn't be there since it's assuming diploid. Thus, I would really like to be able to emit all sites in ploidy=1 mode. Any reason why this is not possible?
Thanks for you help!
John
haplotypecaller doesn't emit genotype at all sites
I've tried to get genotype for all sites provided in interval file using haplotypeCaller. If using unifiedGenotyper, I can get the result by setting "output_mode EMIT_ALL_SITES". But haplotypeCaller doesn't report as expected by "output_mode EMIT_ALL_SITES". Even though I set "genotypeFullActiveRegion" or "fullHaplotype", haplotypeCaller doesn't seem to emit genotype at all sites. How to get desirable result using haplotypeCaller?
Thanks!