FILTER column in vcf file from HaplotypeCaller

October 28, 2013, 8:30 am

≫ Next: In haplotype variant caller for the Influenza virus, do I need to mention or remove any parameters?

≪ Previous: Regarding GenderMap file in genomestrip

I used a cohort in running HaplotypeCaller. Some of values in the FILTER column of the resulting vcf file are ".", what does that mean?

Here is one example (not all columns included):

#CHROM  POS     ID      REF     ALT     QUAL    FILTER
22      16084134        .       A       C       152.03  .

↧

In haplotype variant caller for the Influenza virus, do I need to mention or remove any parameters?

November 7, 2017, 2:33 pm

≫ Next: Possible inconsistency in GATK 4.beta.6 source code

≪ Previous: FILTER column in vcf file from HaplotypeCaller

I have sequenced influenza virus and interested in finding variants (SNVs and INDELs). I am planning to use Haplotype caller.

↧

Possible inconsistency in GATK 4.beta.6 source code

November 7, 2017, 3:06 pm

≫ Next: HaplotypeCaller does not filter duplicate reads, why?

≪ Previous: In haplotype variant caller for the Influenza virus, do I need to mention or remove any parameters?

Hi GATK Team,

We are porting GATK4 to run on GPUs. We have found an inconsistency in the behavior of GATK 4.beta.6 in clipRead() functionality in ReadClipper.java while using HaplotypeCaller.

If the read does not require clipping (ops == null), clipRead() returns the original read otherwise it returns a copy of the clipped read. This leads to inconsistent behavior for users of this function such as finalizeRegions() in AssemblyBasedCallerUtils.java. Sometimes the clippedRead variable in the function is a copy of the original read and sometimes it is the original read. The variable clippedRead's base qualities are sometimes modified in a later part of the function and if the original read is returned from the clipRead() function, it will modify the original read. Now if the original read is used in another assembly region, it will have the adjusted quality scores from the previous region. On the other hand, for reads where copies are created, changes do not propagate from one region to another.

In our test cases that found this issue, the base qualities of the same read were different in different regions at the start of the processing of the regions. This behavior can impact the final vcf output.

Please let us know if this is the intended behavior. We would be happy to help with a minimal test case if required.

-- Ankit Sethia
Parabricks

↧

HaplotypeCaller does not filter duplicate reads, why?

November 9, 2017, 6:56 am

≫ Next: Haplotype caller BP_RESOLUTION :More AD values than alleles called for

≪ Previous: Possible inconsistency in GATK 4.beta.6 source code

Hi,
Im running HaplotypeCaller on a server this way:
java -XX:ParallelGCThreads=8 -Xmx80g -jar $GATK/GenomeAnalysisTK.jar -T HaplotypeCaller -I a2tl1_14_final.bam --min_base_quality_score 25 --min_mapping_quality_score 25 -rf DuplicateRead -rf BadMate -rf BadCigar -R JIC_reference/alygenomes.fasta -o a2tl1_14_HC1.g.vcf.gz -ploidy 2 -stand_call_conf 25 -ERC GVCF --pcr_indel_model NONE -nct 8 --max_num_PL_values 350

And I can not figure out why no duplicate reads are being filtered out although they were marked by Picard (with option TAGGING_POLICY=All ) and I also see around 20% of duplicates in corresponding samtools flagstat.

The beginning of the stdout looks like this:

INFO 04:37:54,634 HelpFormatter - -------------------------------------------------------------------------------- INFO 04:37:54,986 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.7-0-gcfedb67, Compiled 2016/12/12 11:21:18 INFO 04:37:54,986 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute INFO 04:37:54,986 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk INFO 04:37:54,986 HelpFormatter - [Tue Nov 07 04:37:54 CET 2017] Executing on Linux 3.16.0-4-amd64 amd64 INFO 04:37:54,986 HelpFormatter - Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 INFO 04:37:54,990 HelpFormatter - Program Args: -T HaplotypeCaller -I h2al1_21_final.bam --min_base_quality_score 25 --min_mapping_quality_score 25 -rf DuplicateRead -rf BadMate -rf BadCigar -R JIC_reference/alygenomes.fasta -o h2al1_21_HC1.g.vcf.gz -ploidy 2 -stand_call_conf 25 -ERC GVCF --pcr_indel_model NONE -nct 8 --max_num_PL_values 350 INFO 04:37:55,002 HelpFormatter - Executing as vlkofly@zigur17.cerit-sc.cz on Linux 3.16.0-4-amd64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27. INFO 04:37:55,003 HelpFormatter - Date/Time: 2017/11/07 04:37:54 INFO 04:37:55,003 HelpFormatter - -------------------------------------------------------------------------------- INFO 04:37:55,003 HelpFormatter - -------------------------------------------------------------------------------- WARN 04:37:55,009 GATKVCFUtils - Creating Tabix index for h2al1_21_HC1.g.vcf.gz, ignoring user-specified index type and parameter INFO 04:37:55,237 GenomeAnalysisEngine - Strictness is SILENT INFO 04:37:56,044 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 500 INFO 04:37:56,051 SAMDataSource$SAMReaders - Initializing SAMRecords in serial WARNING: BAM index file /scratch/vlkofly/job_386162.wagap-pro.cerit-sc.cz/h2al1_21_final.bai is older than BAM /scratch/vlkofly/job_386162.wagap-pro.cerit-sc.cz/h2al1_21_final.bam INFO 04:37:56,221 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.17 INFO 04:37:56,244 HCMappingQualityFilter - Filtering out reads with MAPQ < 25 INFO 04:37:56,289 MicroScheduler - Running the GATK in parallel mode with 8 total threads, 8 CPU thread(s) for each of 1 data thread(s), of 8 processors available on this machine INFO 04:37:57,903 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files INFO 04:37:58,090 GenomeAnalysisEngine - Done preparing for traversal INFO 04:37:58,090 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 04:37:58,091 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 04:37:58,091 ProgressMeter - Location | active regions | elapsed | active regions | completed | runtime | runtime INFO 04:37:58,091 HaplotypeCaller - Standard Emitting and Calling confidence set to 0.0 for reference-model confidence output INFO 04:37:58,092 HaplotypeCaller - All sites annotated with PLs forced to true for reference-model confidence output WARN 04:37:58,411 InbreedingCoeff - Annotation will not be calculated. InbreedingCoeff requires at least 10 unrelated samples. INFO 04:37:58,510 HaplotypeCaller - Using global mismapping rate of 45 => -4.5 in log10 likelihood units INFO 04:37:58,511 PairHMM - Performance profiling for PairHMM is disabled because the program is being run with multiple threads (-nct>1) option

And the info lines showing no duplicates removed:

INFO 16:53:05,155 ProgressMeter - Total runtime 130507.06 secs, 2175.12 min, 36.25 hours INFO 16:53:05,155 MicroScheduler - 46705813 reads were filtered out during the traversal out of approximately 149962396 total reads (31.15%) INFO 16:53:05,155 MicroScheduler - -> 0 reads (0.00% of total) failing BadCigarFilter INFO 16:53:05,156 MicroScheduler - -> 13334530 reads (8.89% of total) failing BadMateFilter INFO 16:53:05,156 MicroScheduler - -> 0 reads (0.00% of total) failing DuplicateReadFilter INFO 16:53:05,156 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter INFO 16:53:05,156 MicroScheduler - -> 31823278 reads (21.22% of total) failing HCMappingQualityFilter INFO 16:53:05,156 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter INFO 16:53:05,157 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter INFO 16:53:05,157 MicroScheduler - -> 1548005 reads (1.03% of total) failing NotPrimaryAlignmentFilter INFO 16:53:05,157 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter

↧

Haplotype caller BP_RESOLUTION :More AD values than alleles called for

August 1, 2016, 11:38 pm

≫ Next: HC not calling variants at the edges after clipping probes

≪ Previous: HaplotypeCaller does not filter duplicate reads, why?

My intention is to find different bases called in a particular chromosome location irrespective of it being assigned as SNP/badbase. I user the below command:
java -jar 3.5/GenomeAnalysisTK.jar -T HaplotypeCaller -R Reference.fa -I Sample1.bam -o Sample1.BPR.vcf -ERC BP_RESOLUTION -L 1

I got my intended result but is confused with the result for example:

1 1222274 . A . . . GT:AD:DP:GQ:PL 0/0:92,1:93:99:0,120,1800
1 8333303 . A AG,AT,G,T, 0 . BaseQRankSum=0.913;ClippingRankSum=-0.141;DP=57;ExcessHet=3.0103;MLEAC=0,0,0,0,0;MLEAF=0.00,0.00,0.00,0.00,0.00;MQRankSum=0.445;RAW_MQ=205200.00;ReadPosRankSum=-0.85GT:AD:DP:GQ:PL:SB 0/0:39,2,5,2,2,0:50:36:0,79,1151,36,1041,1086,88,985,919,1611,88,985,919,1546,1611,110,942,900,957,957,928:10,29,3,8

Why there are more values for AD than the number of Alleles called for. Please note I am working with RNAseq dataset after BQSR

↧

HC not calling variants at the edges after clipping probes

November 14, 2017, 7:41 am

≫ Next: Bug in HaplotypeCaller：GT is called “./.”，but AD and DP isn't 0

≪ Previous: Haplotype caller BP_RESOLUTION :More AD values than alleles called for

Hi,
I am facing a strange issue with GATK 3.4. I have a set of PE fastq files. I first ran the variant calling pipeline using the following steps:

bwa mem --> sam to bam --> mark duplicates using picard --> RealignerTargetCreator GATK 3.4 --> IndelRealigner --> HaplotypeCaller --> GenotypeGVCFs

Then I saw that the probe sequences that were used to design the region of interest (its a custom-designed panel) overlapped with the some variants of interest. So, the genotypes of some of these variants was 0/1 (due to probes seqs in those positions) instead of the the true 1/1 (I am using a gold standard sample).

So, then I clipped the first 27 bases of each read (the probes are all around that range 22-31 nts with 27 being the most common length). And then I ran the same above pipeline but some of the variants were not called in spite of all reads having those variants.

I am attaching a snapshot of the bam file alignment with the variant in orange in the center. Top track is the first case (before clipping the probes) where the probes are present in the upper reference block without the variant and the lower block with the insert that carry the variant and hence the 0/1 call.

https://us.v-cdn.net/5019796/uploads/editor/dv/fagssh9xfzxb.png

Lower track is after the clipping of the probe sequences and so only the insert with the variant are left. But the variant is still not called in the VCF file.

I read in another thread that this happens because the tool needs 50bp on either side to do proper reassembly. Is that correct? If so, how do I get around it? Should I not use the IndelRealignment? Any other suggestion to solve this issue?

Thanks a bunch in advance!!

↧

Bug in HaplotypeCaller：GT is called “./.”，but AD and DP isn't 0

November 22, 2017, 6:48 pm

≫ Next: HaplotypeCaller 4.beta.6 gVCF performance

≪ Previous: HC not calling variants at the edges after clipping probes

HI, I'd like to report a weird result from HaplotypeCaller.
We have a patient sequenced by targeted sequencing,We expected to see no heterozygous variants called in this locus,it have found the insert ,but GT is "./.", and miss the other information;
Therefore I'm really confused why the substitution has "./." called by the HaplotypeCaller and why it passed the filter.

Many Thanks
Minghui

↧

HaplotypeCaller 4.beta.6 gVCF performance

December 1, 2017, 3:06 am

≫ Next: Minor Allele Frequency filter in GATK

≪ Previous: Bug in HaplotypeCaller：GT is called “./.”，but AD and DP isn't 0

Hi, ever since the 4.beta.4 release, I've noticed a significant increase in the memory requirements and execution time of HaplotypeCaller in gVCF mode. I tested the 4.beta.2 and 4.beta.6 version of HaplotypeCaller with a NA12878 BAM, aligned with BWA 0.7.13 with approximately 30x coverage. 4.beta.2 completed after roughly 5h with 2GB of memory, while 4.beta.6 completed after roughly 30h with 15GB of memory. 4.beta.6 failed with an out of memory exception when given less memory.

Both versions were ran with the same settings (--interval_set_rule UNION --genotyping_mode DISCOVERY --createOutputVariantIndex --emitRefConfidence GVCF) and parallelized on intervals from a custom BED file.

From my understanding of the release notes, the versions from 4.beta.4 onwards have a bug fix that corrects the results of HaplotypeCaller in gVCF mode. Is the performance difference to be expected?

Thank you,
Teodora

↧

Minor Allele Frequency filter in GATK

December 3, 2017, 11:27 pm

≫ Next: Insert length filtering problem

≪ Previous: HaplotypeCaller 4.beta.6 gVCF performance

Hi all,
I'm working on a resequencing dataset , which contains 60 individuals from 5 different populations. I used HaplotypeCaller to conduct variants calling, and got all individuals in one big dataset after applying hard filter, then I would like to know if in GATK, can I apply minor allele frequency filter to one specific population considering I have 5 different populations in one file?

↧

Insert length filtering problem

December 4, 2017, 8:43 am

≫ Next: Conversion of vcf to gvcf

≪ Previous: Minor Allele Frequency filter in GATK

Hello, I'm following the HaplotypeCaller pipeline for SNP and indel calling, however, I'm facing the following problem. The sequence I'm analyzing has two copies of the same gene with reverse orientation separated by a sequence of approximately 2kbs. Therefore, when I'm mapping the reads from the fastq files some of the reads are mapping on the wrong copy of the gene which is quite obvious as the insert length is much greater than the expected insert length. I tried to filter the sam files using custom bash scripts for the 9th column (insert length column on the sam file) but since the alignment is done with bwa mem, all of the values for the 9th column are set to zero. I also tried to use the gatk MaxInsertSizeFilter read filter but it didn't seem to influence the output of the HaplotypeCaller. I am aware than HaplotypeCaller is realigning the reads when necessary and is also determines the likelihoods of the haplotypes but it seems that in my case I'm missing some of the SNPs in the final vcf file and I'm pretty convinced it has to do with the wrong mapping of the reads. Does anybody have any idea how I can resolve this? I would really appreciate any help.

↧

Conversion of vcf to gvcf

December 4, 2017, 3:03 pm

≫ Next: detect MNP variants using by Haplotypecaller

≪ Previous: Insert length filtering problem

I am trying to combine and convert two VCF 4.1 files (an SNP VCF and an INDEL VCF) to VCF 4.2 (which, if I understood correctly is the same thing as gVCF). The resulting file would then be used as input to a third-party analysis software. The files are based on the GRCh37 reference genome.

However, despite searching the forum, I was unable to find a solution that would directly combine and convert both files into gVCF. Therefore, one possibility would be to first merge the two files with CombineVariants, second, convert the resulting file to a BAM file through SimulateReadsForVariants and, third, to derive the gVCF file through HaplotypeCaller.

Would this approach work or would you rather suggest a different and maybe simpler approach?

Thanks

↧

detect MNP variants using by Haplotypecaller

April 16, 2014, 3:31 am

≫ Next: Why is HaplotypeCaller slower in the most recent GATK4 beta versions?

≪ Previous: Conversion of vcf to gvcf

Hi!

I used Haplotypecaller 3.1 to detect MNP variants.
java -Xmx6g -Djava.io.tmpdir=$PWD -jar $GATK -R $hg19 -T HaplotypeCaller -I $bamlist --dbsnp $dpsnp135 -o $call/$sample.all.vcf -stand_call_conf 50.0 -stand_emit_conf 10.0 -dcov 200 -nct $nct -A SpanningDeletions -A TandemRepeatAnnotator -A HomopolymerRun -A AlleleBalance -l INFO -baqGOP 30 --max_alternate_alleles 2 -rf BadCigar --minPruning 5

The result had no MNP type varaints, but I can find some continuous snp variants. these snp variants didn't be combined to MNP variants.
Is this normal ? or I really didn't get MNP variants ?

↧

Why is HaplotypeCaller slower in the most recent GATK4 beta versions?

December 6, 2017, 8:40 am

≫ Next: What is a good number of samples that can be used to detect a variant - I have 15K GVCFs with 1000DP

≪ Previous: detect MNP variants using by Haplotypecaller

Because it's saving its strength for the 4.0 general release

Many of the "early adopters" who have been testing out the GATK4 during its beta phase have pointed out that they saw significant speed improvements in early beta versions (yay!), but then when they upgraded to more recent betas (starting with 4.beta.4), they observed a return to the slowness seen in GATK3 versions (boo!). This has understandably caused some concern to those who were attracted to the GATK4 beta version of HaplotypeCaller because of its promised speed improvements -- so, basically everyone.

The good news is that this is only a temporary artifact of some of our development and evaluation constraints, which forced us to remove some key improvements while we refine and evaluate the equivalence of results with the older version. We should be able to restore the HaplotypeCaller's speed improvements in the very near future -- in time for the GATK 4.0 planned for January 9, 2018.

If you're interested in understanding why we had to hobble the HaplotypeCaller in this way, please read on! Otherwise feel free to take our word for it.

There are two opposing forces in play when we migrate tools from the older GATK to the new 4.x framework. One is that we want to streamline the program's operation to make it run faster and cheaper. The other is that we have been asked by our internal stakeholders to produce an exact "tie-out" for the germline variant discovery pipeline that we run in production at the Broad (i.e. for a subset of tools including HaplotypeCaller). This means that the HaplotypeCaller we release in GATK 4.0 needs to produce exactly the same output (modulo some margins) as the one from version 3.8, to minimize disruption when the pipelines are migrated. That's a very high standard, and it's the right thing to do both from an operations standpoint and from a software engineering standpoint.

However, these two directives came into conflict because we realized, somewhere in the early beta stages, that some of the optimizations that were introduced to make HaplotypeCaller faster also created output differences that were outside of the acceptable margins. We believe that those differences may actually be improvements on the "old" results, but for the sake of the tie-outs we had to take them out temporarily -- hence the HaplotypeCaller went back to being slower than we'd like in the later beta releases.

We're confident we have a solution that will allow us to put the efficiency optimizations back in as soon as the final tie-out test results have been approved, which appears to be imminent. So by the time GATK4 is released into general availability in January, the new HaplotypeCaller should have all its superpowers back.

↧

What is a good number of samples that can be used to detect a variant - I have 15K GVCFs with 1000DP

December 9, 2017, 1:18 pm

≫ Next: Phantom indels from HaplotypeCaller?

≪ Previous: Why is HaplotypeCaller slower in the most recent GATK4 beta versions?

Hi,

I have 15k GVCFs. To call variants, I understand I can run combineGVCFs step to get batches of GVCF combined. I would like to know whats the good number for a sample set, for bams with coverage of over 800-1000X, to detect a variant? Would the variants called from batches of 500 samples have the same power to detect a variant in all the samples as compared to a variant call in 15k samples together?

↧

Phantom indels from HaplotypeCaller?

December 12, 2017, 1:16 pm

≫ Next: 9 Things You've Been Dying To Know About The HaplotypeCaller Paper

≪ Previous: What is a good number of samples that can be used to detect a variant - I have 15K GVCFs with 1000DP

Dear GATK users and developers,

I am running HaplotypeCaller followed by ValidateVariants and the latter complains about variants that have called alternative allele without any observation for it.

ERROR MESSAGE: File /storage/rafal.gutaker/NEXT_test/work/4f/6f8738a66d1c9d12651b76b7ef8819/IRIS_313-15896.g.vcf fails strict validation: one or more of the ALT allele(s) for the record at position LOC_Os01g01010:6190 are not observed at all in the sample genotypes |

ERROR ------------------------------------------------------------------------------------------

Here is an example of site that ValidateVariant complains about:

LOC_Os01g01010 6190 . GT G, 0 . DP=4;ExcessHet=3.0103;MLEAC=0,0;MLEAF=0.00,0.00;RAW_MQ=14400.00 GT:AD:DP:GQ:PL:SB 0/0:4,0,0:4:12:0,12,135,12,135,135:4,0,0,0
LOC_Os01g01010 6192 . T . . END=6192 GT:DP:GQ:MIN_DP:PL 0/0:8:0:8:0,0,254

In general, it seems not dangerous so i am thinking of removing this check, but why is HaplotypeCaller finding phanotm variants is a mystery to me.

Thank you and

Best!
Rafal

↧

9 Things You've Been Dying To Know About The HaplotypeCaller Paper

December 12, 2017, 8:07 pm

≫ Next: HaplotypeCaller/ Variantannotator no allele balance tag for all SNPs

≪ Previous: Phantom indels from HaplotypeCaller?

Q: What, there's a HaplotypeCaller paper?

A: Yes! We are super pumped to announce the long-awaited release of The HaplotypeCaller Paper -- or rather, the preprint in bioRxiv. (Actually we announced it on Twitter a while back but we understand not everyone enjoys such an old-school way of keeping up with the news). Hopefully you’re as excited as we are, if not more so, but we understand that this probably raises a few questions for some of you, so we tried to address some of those below.

Q: Why did it take so long?!

A: Our mission is to develop the tools that get used by others to do groundbreaking scientific research. Benchmarking and validation are important parts of our prototyping and development cycle, but given that we’re not subject to the “publish or perish” culture of a research lab, submitting manuscripts presenting those results wasn’t a high priority for us.

Q: Are you going to submit it to a peer-reviewed journal?

A: Probably not.

Q: Why not?

A: Our main motivation for posting the HaplotypeCaller manuscript to bioRxiv was to provide something recent/reasonable to cite and to make more details of the methods public. Submitting to a peer-reviewed journal usually involves a lot of time working on revisions that we’d rather put towards working on further improvements to the tools.

Q: Is it still a preprint if it's never intended to go to print?

A: You tell us.

Q: What version of HaplotypeCaller does the paper describe?

A: The paper describes the GATK 3.4 version of the HaplotypeCaller (yes we started this a while back) but the HaplotypeCaller has not changed significantly in later 3.x versions so it's fair to say the paper covers up to version 3.8 completely.

Q: How do these results compare to GATK4?

A: At time of writing, the GATK4 version of HaplotypeCaller is still considered a beta version. The team is actively working on validating the GATK4 version to make sure that it’s guaranteed to be as good as or better than the GATK3 version described in the paper.

Q: How does the methodology compare to GATK4?

A: The GATK engine that parses the BAM and “shards” the data to pass to the tools has been rewritten for improved efficiency over GATK3, and the HaplotypeCaller code has been refactored for better organization and readability. So there's a lot that is different in terms of software implementation. However the algorithms and equations presented in the manuscript remain the same, so overall the paper's description of how the HaplotypeCaller operates also applies to the GATK4 beta version, and it is appropriate to use it as a citation for results derived from versions up to the current beta (4.beta.6).

Q: Does the release of this paper hint at a change in how the team prioritizes publication?

A: To some extent. The developers of the somatic variant caller Mutect2 and related tools have put in a lot of effort to prepare white papers on the methods involved (Mutect2 itself, the assembly process and the pairHMM algorithm), some of which are shared with the HaplotypeCaller. They hope to release a manuscript featuring Mutect2 somatic SNV and INDEL variant calling results in the near future. Additionally, the GATK development team as a whole aims to make more of our internal benchmarking and validation efforts more transparent and available to other tool developers; an effort that our colleague Yossi Farjoun kicked off in style in his blog post about the new "SynDip" benchmark last week.

↧

HaplotypeCaller/ Variantannotator no allele balance tag for all SNPs

April 28, 2014, 8:11 am

≫ Next: GATK4beta6 HaplotypeCaller doesn't index g.vcf.gz output

≪ Previous: 9 Things You've Been Dying To Know About The HaplotypeCaller Paper

Version 3.1.1. Human normal samples.

I couldnt find AlleleBalance and AlleleBalanceBySample tags in my vcf outputs. Tags are not found even for single variant
I tried HaplotypeCaller with -all or directly with -A AlleleBalance or -A AlleleBalanceBySample.
Also I tried Variantannotator with -all or -A AlleleBalance or -A AlleleBalanceBySample.

Any help will be apreciated

↧

GATK4beta6 HaplotypeCaller doesn't index g.vcf.gz output

January 2, 2018, 7:24 am

≫ Next: GATK4beta6 annotation incompatibility between HaplotypeCaller and GenomicsDBImport

≪ Previous: HaplotypeCaller/ Variantannotator no allele balance tag for all SNPs

With the following command:

java -Xmx7g -jar gatk-package-4.beta.6-local.jar HaplotypeCaller -ERC GVCF -G StandardAnnotation -G AS_StandardAnnotation --maxReadsPerAlignmentStart 0 -GQB 5 -GQB 10 -GQB 15 -GQB 20 -GQB 25 -GQB 30 -GQB 35 -GQB 40 -GQB 45 -GQB 50 -GQB 55 -GQB 60 -GQB 65 -GQB 70 -GQB 75 -GQB 80 -GQB 85 -GQB 90 -GQB 95 -GQB 99 -I example.bam -O example.g.vcf.gz -R /path/to/GRCh38.d1.vd1.fa

the output file example.g.vcf.gz does not get indexed, despite the default value for --createOutputVariantIndex being True. The command finishes successfully without error, but never creates the index. '

Ben

↧

GATK4beta6 annotation incompatibility between HaplotypeCaller and GenomicsDBImport

January 2, 2018, 7:15 am

≫ Next: GenotypeGVCFs: Long runtime exclusively with a single sample

≪ Previous: GATK4beta6 HaplotypeCaller doesn't index g.vcf.gz output

Happy New Year!

I'm attempting to joint genotype ~1000 exomes using GATK4. I've run HC per sample with the following command:

And then attempted to create a GenomicDB per chromosome with the following command:

java -Xmx70g -jar gatk-package-4.beta.6-local.jar GenomicsDBImport -genomicsDBWorkspace chrX_db --overwriteExistingGenomicsDBWorkspace true --intervals chrX -V gvcfs.list

I get the following error:

Exception: [January 2, 2018 9:36:26 AM EST] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 0.09 minutes. Runtime.totalMemory()=2238185472 htsjdk.tribble.TribbleException$InvalidHeader: Your input file has a malformed header: Discordant field size detected for field AS_RAW_ReadPosRankSum at chrX:251751. Field had 4 values but the header says this should have 1 values based on header record INFO=<ID=AS_RAW_ReadPosRankSum,Number=1,Type=String,Description="allele specific raw data for rank sum test of read position bias"> at htsjdk.variant.variantcontext.VariantContext.fullyDecodeAttributes(VariantContext.java:1571) at htsjdk.variant.variantcontext.VariantContext.fullyDecodeInfo(VariantContext.java:1546) at htsjdk.variant.variantcontext.VariantContext.fullyDecode(VariantContext.java:1530) at htsjdk.variant.variantcontext.writer.BCF2Writer.add(BCF2Writer.java:176) at com.intel.genomicsdb.GenomicsDBImporter.add(GenomicsDBImporter.java:1232) at com.intel.genomicsdb.GenomicsDBImporter.importBatch(GenomicsDBImporter.java:1282) at org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport.traverse(GenomicsDBImport.java:443) at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:838) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:119) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:176) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:195) at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:137) at org.broadinstitute.hellbender.Main.mainEntry(Main.java:158) at org.broadinstitute.hellbender.Main.main(Main.java:239)

Which refers the following line in one of the GVCFs:

chrX 251751 . G A,<NON_REF> 46.56 . AS_RAW_BaseQRankSum=|30,1,33,1|;AS_RAW_MQ=0.00|7200.00|0.00;AS_RAW_MQRankSum=|60,2|;AS_RAW_ReadPosRankSum=|5,1,20,1|;AS_SB_TABLE=0,0|0,0|0,0;DP=2;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;RAW_MQ=7200.00 GT:AD:GQ:PL:SB 1/1:0,2,0:6:73,6,0,73,6,73:0,0,1,1

I haven't found a way to get past this error. I found this post from a while back with a very similar error:

https://gatkforums.broadinstitute.org/gatk/discussion/comment/43382#Comment_43382

But they seemed to indicate that it was fixed for them in GATK4beta6.

Any help/insight in to how to resolve it, or if its an unimportant annotation how to ignore it would be greatly appreciated. Thanks!

Ben

↧

GenotypeGVCFs: Long runtime exclusively with a single sample

January 4, 2018, 1:45 am

≫ Next: GVCF - Genomic Variant Call Format

≪ Previous: GATK4beta6 annotation incompatibility between HaplotypeCaller and GenomicsDBImport

I have been having some trouble with long runtime with several of GATK utilities.
However it was manageable.
I could arrive at a g.vcf file( I used HaplotypeCaller instead of UnifiedGenotyper upon a suggestion made on a seperate thread).

Now I two different g.vcf file for two different samples and for one of them I could get a vcf file using GenotypeGVCFs within 45 minutes or so.
However with another sample I am getting ** a 40 week long runtime.**
The samples are that of Aedes aegypti and Aedes albopictus (this is the one giving trouble).

The walker starts walking instantly with Aedes aegypti sample and gives me the vcf without any errors.However In the Aedes albopictus the walker itself is initiated after an hour or so.

The command used is:

java -jar GenomeAnalysisTK-3.7-0-gcfedb6 -T GenotypeGVCFs -nt 12 -R ref-ab/GCA_001444175.2_A.albopictus_v1.1_genomic.fasta --variant output-AB.raw.snps.indels.g.vcf -o genotyped-ab.vcf

It should be noted that this exact command has worked for the other sample(except that the necessary files were changed).

The log is as follows:
INFO 19:56:34,300 ProgressMeter - | processed | time | per 1M | | total | remaining
INFO 19:56:34,301 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime
INFO 22:49:04,685 ProgressMeter - KQ560100.1:879201 0.0 2.9 h 15250.3 w 0.0% 37.4 w 37.4 w
INFO 22:50:04,687 ProgressMeter - KQ560100.1:879201 0.0 2.9 h 15250.3 w 0.0% 37.7 w 37.6 w
INFO 22:51:04,689 ProgressMeter - KQ560100.1:879201 0.0 2.9 h 15250.3 w 0.0% 37.9 w 37.9 w
INFO 22:52:04,690 ProgressMeter - KQ560100.1:879201 0.0 2.9 h 15250.3 w 0.0% 38.1 w 38.1 w
INFO 22:53:04,694 ProgressMeter - KQ560100.1:879201 0.0 2.9 h 15250.3 w 0.0% 38.3 w 38.3 w
(the run time is increasing instead of decreasing)

IMPORTANT NOTES:

1)The genome sizes are:
1.9 G for A.albopictus and 1.4 G for A.aegypti
2)Cannot blame it on space
I have around 48 usable threads at the moment and enough RAM space
I have tried using different number of threads as well. Its not making any difference.

3) have tried re-running the a.aegypti sample parallely (to get rid of any doubts that the computation maybe have been faster due to uncertain variables at that point in time),and its reproducing its behaviour i.e gets done in 45 minutes or so.But the a.albopictus sample is still showing the same problem.

↧