Quantcast
Channel: haplotypecaller — GATK-Forum
Viewing all 1335 articles
Browse latest View live

Can GenotypeGVCFs be ran without filtering?

$
0
0

For bacteria genomes I use the "Best Practices" and the HaplotypeCaller to call variants. I would like to output a VCF containing all positions I can then parse on my own. I'm using -ERC BP_RESOLUTION to output such a VCF. However to get INFO such as AC and MQ I need to follow up with GenotypeGVFs, and there doesn't seem to be an option to keep all positions when using this. In the end I must have a VCF which contains all possible variants and any position with zero coverage. Is there a way to generate a VCF with every reference position that includes AC and MQ values.


Recommended protocol for bootstrapping HaplotypeCaller and BaseRecalibrator outputs?

$
0
0

I am identifying new sequence variants/genotypes from RNA-Seq data. The species I am working with is not well studied, and there are no available datasets of reliable SNP and INDEL variants.

For BaseRecallibrator, it is recommended that when lacking a reliable set of sequence variants:
"You can bootstrap a database of known SNPs. Here's how it works: First do an initial round of SNP calling on your original, unrecalibrated data. Then take the SNPs that you have the highest confidence in and use that set as the database of known SNPs by feeding it as a VCF file to the base quality score recalibrator. Finally, do a real round of SNP calling with the recalibrated data. These steps could be repeated several times until convergence."

Setting up a script to run HaplotypeCaller and BaseRecallibrator in a loop should be fairly strait forward. What is a good strategy for comparing VCF files and assessing convergence?

Meaning of error: expected haplotypes.size() >= eventsAtThisLoc.size() + 1

$
0
0

Hi, I am running HaplotypeCaller (GATK 4.0.0.0) in genoype-given-alleles mode using a VCF of common coding germline variants. Please see the error below. Can anyone help to point me in the right direction? Thanks!

java.lang.IllegalArgumentException: expected haplotypes.size() >= eventsAtThisLoc.size() + 1
at org.broadinstitute.hellbender.utils.Utils.validateArg(Utils.java:681)
at org.broadinstitute.hellbender.tools.walkers.haplotypecaller.AssemblyBasedCallerGenotypingEngine.createAlleleMapper(AssemblyBasedCallerGenotypingEngine.java:159)
at org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCallerGenotypingEngine.assignGenotypeLikelihoods(HaplotypeCallerGenotypingEngine.java:123)
at org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCallerEngine.callRegion(HaplotypeCallerEngine.java:566)
at org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller.apply(HaplotypeCaller.java:218)
at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.processReadShard(AssemblyRegionWalker.java:295)
at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.traverse(AssemblyRegionWalker.java:271)
at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:893)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:136)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:179)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:198)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:152)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:195)
at org.broadinstitute.hellbender.Main.main(Main.java:275)

Any ploidy goes!

$
0
0

Until now, HaplotypeCaller was only capable of calling variants in diploid organisms due to some assumptions made in the underlying algorithms. I'm happy to announce that we now have a generalized version that is capable of handling any ploidy you specify at the command line!

This new feature, which we're calling "omniploidy", is technically still under development, but we think it's mature enough for the more adventurous to try out as a beta test ahead of the next official release. We'd especially love to get some feedback from people who work with non-diploids on a regular basis, so we're hoping that some of you microbiologists and assorted plant scientists will take it out for a spin and let us know how it behaves in your hands.

It's available in the latest nightly builds; just use the -ploidy argument to give it a whirl. If you have any questions or feedback, please post a comment on this article in the forum.

Caveat: the downstream tools involved in the new GVCF-based workflow (GenotypeGVCFs and CombineGVCFs) are not yet capable of handling non-diploid calls correctly -- but we're working on it.

UPDATE:

We have added omniploidy support to the GVCF handling tools, with the following limitations:

  • When running, you need to indicate the sample ploidy that was used to generate the GVCFs with -ploidy. As usual 2 is the default ploidy.

  • The system does not support mixed ploidy across samples nor positions. An error message will be thrown if you attempt to genotype GVCFs that have a mixture, or that have some genotype whose ploidy does not match the -ploidy argument.

LATEST UPDATE:

As of GATK version 3.3-0, the GVCF tools are capable of ad-hoc ploidy detection, and can handle mixed ploidies. See the release highlights for details.

running haplotypeCaller using Queue

$
0
0

I wrote my first script in scala to run haplotyperCaller walker of GATK. However, I am running into some errors when I execute the *.scala script. I am unable to figure out the source of error, any help will be appreciated.

package org.broadinstitute.sting.queue.qscripts
import org.broadinstitute.sting.queue.QScript
import org.broadinstitute.sting.queue.extensions.gatk._

class haplotypeCaller extends QScript {

                @Input(doc="Reference file for the bam files",shortName="R")
                var referenceFile: File = _

                @Input(doc="One or more bam files",shortName="I")
                var bamFiles: List[File] = Nil

                @Argument(doc="the interval string",shortName="L")
                var intervalString: String = ""

                @Argument(doc="heterozygosity to be considered",shortName="heterozygosity")
                var het: Double = 0.006

                @Argument(doc="which sites to emit",shortName="output_mode")
                var outmode: String = "EMIT_ALL_SITES"

                @Output(doc="outFile to write snps and indels",shortName="out")
                var outFile: File = _

        def script(){
                val hc = new HaplotypeCaller
                hc.memoryLimit=20
                add(hc)
        }

}

command line arguments passed:
java -jar ./Queue-2.7-4-g6f46d11/Queue.jar -S haplotypeCaller.scala -R ./reference/xx.fasta -I x1.realigned.bam / -I x2.realigned.bam -I x3.realigned.bam -I x4.realigned.bam -I x5.realigned.bam -I x6.realigned.bam -I x7.realigned.bam / -I x8.realigned.bam -I x9.realigned.bam -I x10.realigned.bam -I x11.realigned.bam -I x12.realigned.bam / -L chr1 -heterozygosity 0.006 -output_mode EMIT_ALL_SITES -out gatk.hc.chr1.raw.snps.indels.vcf -run

error:
##### ERROR MESSAGE: Walker requires a reference but none was provided.

The reference file exists in the above mentioned path. I even tried running with absolute path for the reference file but was not successful. Any help will be appreciated.

I expect to see a variant at a specific site, but it's not getting called

$
0
0

This can happen when you expect a call to be made based on the output of other variant calling tools, or based on examination of the data in a genome browser like IGV.

There are several possibilities, and among them, it is possible that GATK may be missing a real variant. But we are generally very confident in the calculations made by our tools, and in our experience, most of the time, the problem lies elsewhere. So, before you post this issue in our support forum, please follow these troubleshooting guidelines, which hopefully will help you figure out what's going on.

In all cases, to diagnose what is happening, you will need to look directly at the sequencing data at the position in question.

1. Generate the bamout and compare it to the input bam

If you are using HaplotypeCaller to call your variants (as you nearly always should) you'll need to run an extra step first to produce a file called the "bamout file". See this tutorial for step-by-step instructions on how to do this.

What often happens is that when you look at the reads in the original bam file, it looks like a variant should be called. However, once HaplotypeCaller has performed the realignment, the reads may no longer support the expected variant. Generating the bamout file and comparing it to the original bam will allow you to elucidate such cases.

In the example below, you see the original bam file on the top, and on the bottom is the bam file after reassembly. In this case, there seem to be many SNPs present, however, after reassembly, we find there is really a large deletion!

image

2. Check the base qualities of the non-reference bases

The variant callers apply a minimum base quality threshold, under which bases will not be counted as supporting evidence for a variant. This is because low base qualities mean that the sequencing machine was not confident that it called the right bases. If your expected variant is only supported by low-confidence bases, it is probably a false positive.

Keep in mind that the depth reported in the DP field of the VCF is the unfiltered depth. You may believe you have good coverage at your site of interest, but since the variant callers ignore bases that fail the quality filters, the actual coverage seen by the variant callers may be lower than you think.

3. Check the mapping qualities of the reads that support the non-reference allele(s)

The quality of a base is capped by the mapping quality of the read that it is on. This is because low mapping qualities mean that the aligner had little confidence that the read was mapped to the correct location in the genome. You may be seeing mismatches because the read doesn't belong there -- in fact, you may be looking at the sequence of some other locus in the genome!

Keep in mind also that reads with mapping quality 255 ("unknown") are ignored.

4. Check how many alternate alleles are present

By default the variant callers will only consider a certain number of alternate alleles. This parameter can be relaxed using the --max_alternate_alleles argument (see the HaplotypeCaller documentation page to find out what is the default value for this argument). Note however that genotyping sites with many alternate alleles increases the computational cost of the processing, scaling exponentially with the number of alternate alleles, which means it will use more resources and take longer. Unless you have a really good reason to change the default value, we highly recommend that you not modify this parameter.

5. When using UnifiedGenotyper, check for overlapping deletions

The UnifiedGenotyper ignores sites if there are too many overlapping deletions. This parameter can be relaxed using the --max_deletion_fraction argument (see the UG's documentation page to find out what is the default value for this argument) but be aware that increasing its value could adversely affect the reliability of your results.

6. Check for systematic biases introduced by your sequencing technology

Some sequencing technologies introduce particular sources of bias. For example,
in data produced by the SOLiD platform, alignments tend to have reference bias and it can be severe in some cases. If the SOLiD reads have a lot of mismatches (no-calls count as mismatches) around the the site, you are probably seeing false positives.

7. Try fiddling with graph arguments (ADVANCED)

This is highly experimental, but if all else fails, worth a shot (with HaplotypeCaller and MuTect2).

Fiddle with kmers

In some difficult sequence contexts (e.g. repeat regions), when some default-sized kmers are non-unique, cycles get generated in the graph. By default the program increases the kmer size automatically to try again, but after several attempts it will eventually quit trying and fail to call the expected variant (typically because the variant gets pruned out of the read-threading assembly graph, and is therefore never assembled into a candidate haplotype). We've seen cases where it's still possible to force a resolution using -allowNonUniqueKmersInRef and/or increasing the --kmerSize (or range of permitted sizes: 10, 25, 35 for example).

Note: While --allowNonUniqueKmersInRef allows missed calls to be made in repeat regions, it should not be used in all regions as it may increase false positives. We have plans to improve variant calling in repeat regions, but for now please try this flag if you notice calls being missed in repeat regions.

Fiddle with pruning

Decreasing the value of -minPruning and/or -minDanglingBranchLength (i.e. increasing the amount of evidence necessary to keep a path in the graph) can recover variants, at the risk of taking on more false positives.

haplotypecaller_gvcf_gatk4 failed to delocalize files

$
0
0

Hello,

I was running haplotypecaller_gvcf_gatk4 for four samples in the same workspace. All but one finished the analysis. Message for the failed sample (all 50 shards) is as the following,

Task HaplotypeCallerGvcf_GATK4.HaplotypeCaller:5:1 failed. Job exit code 1. Check gs://fc-8da20bb3-0689-423f-b94b-8c196afd7a82/17ad2991-dab2-45c7-a68f-d5a31995f9c4/HaplotypeCallerGvcf_GATK4/5c9a1b93-7f39-424f-950e-70542e19e0b1/call-HaplotypeCaller/shard-5/HaplotypeCaller-5-stderr.log for more information. PAPI error code 5. Message: 10: Failed to delocalize files: failed to copy the following files: "/mnt/local-disk/HB3hg382.g.vcf.gz -> gs://fc-8da20bb3-0689-423f-b94b-8c196afd7a82/17ad2991-dab2-45c7-a68f-d5a31995f9c4/HaplotypeCallerGvcf_GATK4/5c9a1b93-7f39-424f-950e-70542e19e0b1/call-HaplotypeCaller/shard-5/HB3hg382.g.vcf.gz (cp failed: gsutil -q -m cp -L /var/log/google-genomics/out.log /mnt/local-disk/HB3hg382.g.vcf.gz gs://fc-8da20bb3-0689-423f-b94b-8c196afd7a82/17ad2991-dab2-45c7-a68f-d5a31995f9c4/HaplotypeCallerGvcf_GATK4/5c9a1b93-7f39-424f-950e-70542e19e0b1/call-HaplotypeCaller/shard-5/HB3hg382.g.vcf.gz, command failed: CommandException: No URLs matched: /mnt/local-disk/HB3hg382.g.vcf.gz\nCommandException: 1 file/object could not be transferred.\n)"

Please help.

Thanks a lot.

Ming-yi

ReadBackedPhasing of somatic and germline variants?

$
0
0

Hi all,

I'd like to know which germline variants are proximal to and on the same chromosome as detected somatic variants. Is there an out-of-the-box way to phase germline and somatic variants to one another? If not, do you see a role for GATK ReadBackedPhasing?

Thanks for your help and my apologies if this has been answered already, I couldn't find anything related.

-M


Why I obtained a g.vcf with wrong variant DPs, and too few variants according to coverage?

$
0
0

Hi,
I extracted Exome regions from public bam files to apply the same pipeline that I did for my samples and merge them. It is weird that looking in the g.vcf files, I have very few "variants", and those have very few reads (these are high coverage samples). Here there are two examples: As you can see this regions has a coverage of 22-40, How can I have a DP=1??

1 146639579 . C . . END=146639580 GT:DP:GQ:MIN_DP:PL 0/0:38:99:37:0,99,1485
1 146639581 . C . . END=146639581 GT:DP:GQ:MIN_DP:PL 0/0:38:96:38:0,96,1440
1 146639582 . C . . END=146639587 GT:DP:GQ:MIN_DP:PL 0/0:40:99:38:0,99,1485
1 146639588 rs55797044 A G, 0.01 . DB;DP=1;MLEAC=0,0;MLEAF=0.00,0.00;MQ=254.00 GT:AD:DP:GQ:PL:SB 0/0:1,0,0:1:3:0,3,10,3,10,10:0,1,0,0
1 146639589 . T . . END=146639612 GT:DP:GQ:MIN_DP:PL 0/0:40:99:37:0,105,1575
1 146643468 . G . . END=146643471 GT:DP:GQ:MIN_DP:PL 0/0:23:54:22:0,54,810
1 146643472 . G . . END=146643472 GT:DP:GQ:MIN_DP:PL 0/0:22:57:22:0,57,855

1 1263129 . C . . END=1263129 GT:DP:GQ:MIN_DP:PL 0/0:26:59:26:0,59,461
1 1263130 . T . . END=1263143 GT:DP:GQ:MIN_DP:PL 0/0:24:60:22:0,60,900
1 1263144 rs307350 G A, 0.14 . DB;DP=1;MLEAC=1,0;MLEAF=0.500,0.00;MQ=254.00 GT:AD:DP:GQ:PL:SB 1/1:0,1,0:1:3:10,3,0,10,3,10:0,0,1,0
1 1263145 . G . . END=1263163 GT:DP:GQ:MIN_DP:PL 0/0:25:63:23:0,63,945
1 1263164 . A . . END=1263164 GT:DP:GQ:MIN_DP:PL 0/0:22:49:22:0,49,378

When I compare the lines of this g.vcf file with what I obtained with extracting the same regions in individuals from other public bam files (the 1000GP) of (10x), I got:

1000GP bam (10X): Total reads: 13924960, wc raw.g.vcf = 20819861, grep 0/1 raw.g.vcf = 57,362
Public bam (39X): Total reads: 43168049, wc raw.g.vcf = 8136676, grep 0/1 raw.g.vcf = 9

So, why having high coverage bam files, I end up with so few variants?

Could it be the result of some error I had in the IndelRealignment step (GenomeAnalysisTK-3.4-46/GenomeAnalysisTK.jar -T IndelRealigner) ?

ERROR MESSAGE: Exception when processing alignment for BAM index HS2000-1017_290:2:2106:4486:29552 2/2 100b unmapped read.

The only way to solve that error (according to ValidateSamFile there was no errors) was to remove those unmapped reads:
samtools view -b -F 4 file.bam > mapped.bam
index again
IndelRealingment
BaseRecalibrator step1
BaseRecalibrator Round 2
Print Reads
HaplotypeCaller

Thank you very much,

magda

Using GATK: create a F0 SNP library and then genotype F2 sample using it

$
0
0

Hello GATK community,

I would like your comments/suggestions for my strategy.

I have F0 samples with two different phenotype.
I have F2 samples with unknown phenotype.
I would like to create a library with the F0 genotypes and then genotype my F2 samples using the previously created library.

STRATEGY:
I already pre-processed BAM files (I have all raw data if required).

Create genotype library with F0 samples:

  • GATK HaplotypeCaller for both F0 phenotype samples : java -Xmx30g -jar GenomeAnalysisTK_3-8.jar -nct 16 -T HaplotypeCaller -R GENOME --emitRefConfidence GVCF -I INPUT.bam -o OUTPUT.g.vcf

  • Merge the results: java -Xmx16g -jar GenomeAnalysisTK_3-8.jar -nt 16 -T GenotypeGVCFs -R GENOME --variant F0Variant1.g.vcf --variant F0Variant2.g.vcf -o Results_Merge_F0.vcf

  • then i used a homemade script to select only position with homozygous genotype and different genotype between both F0 phenotype samples (like 1/1 for a F0 sample and 0/0 for the other one): Results_Merge_F0_filtered.vcf

Genotype F2 sample with the library:

  • GATK HaplotypeCaller : java -Xmx30g -jar GenomeAnalysisTK_3-8.jar -nct 16 -T HaplotypeCaller -R GENOME --emitRefConfidence GVCF -I INPUT.bam -o OUTPUT.g.vcf -L $4 Results_Merge_F0_filtered.vcf

  • then i used a homemade script to identify genotype related to one (or the other) F0 phenotype.

BUUUUUUT :o
At this last step i mostly got homozygous SNP for my F2 samples...
I should get around 25% phenotype1 -- 25% phenotype2 -- 50% phenotype 1/2
I miss something but I don't know where.

Haplotypecaller calls variants at a deletion region

$
0
0

Hi,
I'm having a confusing problem when using haplotypecaller.

Basically, I'm using haplotypecaller calling variants among more than 400 M. tuberculosis samples, sequenced with Hiseq2500 platform. I followed the workflow for calling variants on cohort samples as described here: https://gatkforums.broadinstitute.org/gatk/discussion/3893/calling-variants-on-cohorts-of-samples-using-the-haplotypecaller-in-gvcf-mode

I find a problem with some samples when checking the SNPs called by this procedure. For example, as in Sample1, as show in this figure

,there seems to be a deletion at the position 2866805. However, the GATK3.8 called a SNP at this position, as shown in the excerpt from the vcf file below:

NC_000962.3 2866805 . C G 8160 . AC=1;AF=1.00;AN=1;DP=182;FS=0.000;GQ_MEAN=8190.00;MLEAC=1;MLEAF=1.00;MQ=50.38;MQ0=0;NCC=0;QD=31.09;SOR=0.917 GT:AD:GQ:PL 1:0,176:99:8190,0

In total, haplotypecaller called 11 snps at this deletion region.

So I'm confused that why haplotypecaller called a snp variant when bam file shows there is a deletion? I would really appreciate if you could help me to figure this out. Thank you in advance!

P.S. after finding this problem, we also tried UnifiedGenotyper on Sample1, and the variants at the deletion region were not called this time.

SampleList annotation returns only one sample while being used in a variant call on two samples

$
0
0

Hi there,

I have used HaplotypeCaller from GATK4 to call variants on two affected siblings. I put SampleList annotation in the command but only one sample is seen in all of the variants in the resulting VCF file.
Mind you I have done this before on GATK 3x versions and never had such a problem. I could use the "set" column to see variants regarding each sample easily.

gatk HaplotypeCaller \ -R ~/Arvand/hg19/ucsc.hg19.fasta \ -I ~/Arvand/5137D_recalibrated.bam \ -I ~/Arvand/5137E_recalibrated.bam -O affected_raw.vcf.gz \ -bamout affected_bamout.bam -A BaseQuality -A ChromosomeCounts -A Coverage -A DepthPerAlleleBySample -A RMSMappingQuality -A OxoGReadCounts -A QualByDepth -A FisherStrand -A StrandOddsRatio -A SampleList --genotyping-mode DISCOVERY -D ~/Arvand/hg19/dbsnp_147.hg19.vcf.gz --native-pair-hmm-threads 20

`

Inconsistent results with HaplotypeCaller on haploid organism

$
0
0

Hello GATK team,

I would appreciate some help in understanding how GATK works in GVCF mode on my data.
Here is my data example I'm usign GATK v3.8:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 328-16 983-16 NC_018661.1 859953 . C T > 31035.30 SnpCluster AC=15;AF=0.536;AN=28;BaseQRankSum=-1.134e+00;ClippingRankSum=0.00;DP=7961;FS=3.292;MLEAC=15;MLEAF=0.536;MQ=56.52;MQRankSum=0.00;QD=7.76;ReadPosRankSum=-5.480e-01;SOR=0.486 GT:AD:DP:GQ:PL 0:354,0:354:99:0,106 1:157,157:314:99:361,0

  • First thing weird is that the variant is in heterocygosis with the highest GQ (99) when we are analyzing an haploid sample, this is different from the explanation given in this post

  • Second issue appears when we observe this position using IGV in our aligned reads using bwa mem(bam format). Here we see that both samples seem to have this site as AD 50%, but HaplotypeCaller calls it totally different.

This are the parameters we use for one sample:
java -Djava.io.tmpdir=/processing_Data/bioinformatics/services_and_colaborations/CNM/bacteriologia/SRVCNM062_20180214_ECOLI07_SS_S/TMP/ -Xmx10g -jar /opt/g atk/gatk-3.8.0/GenomeAnalysisTK.jar -T HaplotypeCaller -R /processing_Data/bioinformatics/services_and_colaborations/CNM/bacteriologia/SRVCNM062_20180214_ECO LI07_SS_S/REFERENCES/GCF_000299475.1_ASM29947v1_genomic_NoPlasmid.fna -I /processing_Data/bioinformatics/services_and_colaborations/CNM/bacteriologia/SRVCNM0 62_20180214_ECOLI07_SS_S/ANALYSIS/20180205_ECOLI0601/Alignment/BAM/328-16/328-16.woduplicates.bam -o /processing_Data/bioinformatics/services_and_colaboratio ns/CNM/bacteriologia/SRVCNM062_20180214_ECOLI07_SS_S/ANALYSIS/20180205_ECOLI0601/variant_calling/variants_gatk/variants/328-16.g.vcf -stand_call_conf 30 --em itRefConfidence GVCF -ploidy 1 -S LENIENT -log /processing_Data/bioinformatics/services_and_colaborations/CNM/bacteriologia/SRVCNM062_20180214_ECOLI07_SS_S/A NALYSIS/20180205_ECOLI0601/variant_calling/variants_gatk/snp_indels.vcf-HaplotypeCaller.log

How is this even possible? (I have infinite-checked that bam files are the same used in IGV and passed to GATK, you never know...)

Could be the effect referred in this thread be somehow affecting the variant calling? Should we use BP_Resolution? Which is the main difference between GVCF and BP_RESOLUTION mode?

Our first idea is select by AD our GVCFs using JEXL expressions but as GVCF has reference blocks with no AD the command fails:

ERROR MESSAGE: Invalid JEXL expression detected for select-0 with message ![35,47]: 'vc.getGenotype('328-16').getAD().1.floatValue() / vc.getGenotype('328-16').getDP() > 0.90;' attempting to call method on null

I could filter them manually before GenotypeGVCFs but, it is a good practice? As I read in this thread this is not recommended, obviously because we override GATK model which takes a lot more of variables into account...
Any ideas? We are kind of struggling, maybe is something trivial but we can't see it, any help will be much appreciate.

Thanks very much in advance,
Best Regards
Sara

HaplotypeCaller in Gatk4 vs Gatk3.5

HaplotypeCaller (gvcf mode) on whole genome vs chromosome by chromosome

$
0
0

I'm currently running my first real use of GATK. I was worried about running HaplotypeCaller on whole geneomes given some of the reports I've seen on these forums about how long it can take to run. In contrast, I was pleasantly surprised with the current GATK it is proceeding well (~7 day estimate on dog wgs). But it seems it could be much faster if I divided it up by chromosome with the -L flag.

I see that the advice is to not use the -L flag for whole genome analysis [1]. But the wording in that link seems open: it is not necessary, but if it would help efficiency it might be worthwhile.

I've found a related question on the forums here [2], but it seems the descrepancy discussed in that thread is suspected to be due to downsampling and not actually the result of a chromosome-by-chromosome use of HaplotypeCaller.

Again, I'm content with a ~7 day run time in order to take proper care of our data. I wouldn't want to sacrifice power or accuracy for a shorter runtime, but if there is really no trade-off, a chromosomal approach would be even better. So I'm curious if there is a downside to partitioning the HaplotypeCaller step by chromosome?

[1] http://gatkforums.broadinstitute.org/gatk/discussion/4133/when-should-i-use-l-to-pass-in-a-list-of-intervals
[2] http://gatkforums.broadinstitute.org/gatk/discussion/5008/haplotypecaller-on-whole-genome-or-chromosome-by-chromosome-different-results


HaplotypeCaller on sliced BAMs

$
0
0

Hello
I'm trying to do germline calling using HaplotypeCaller. However, I'm only interested in obtaining germline variants for a subset of the genome. To save space and compute resources, I was hoping to use a sliced BAM that includes reads spanning only my regions of interest. Can HaplotypeCaller handle such sliced BAM files and correctly infer germline SNPs for those regions ?

SNP calling on inverted repeats

$
0
0

Dear GATK team,

I have encountered a problem when I used the HaploTypeCaller for variant detection on about 100 plastid genomes. The plastid genome is haploid and contains two large inverted repeats (which are presumably almost 100% identical, though inverted). However, no variants are detected on either of these regions and the SNPs/indels are only reported in the non-repeated regions.
I would expect that intra-individual polymorphisms on the inverted repeats would not be detected, since the mapping algoritm from BWA or similar can't assign the reads accurately to either of the repeated regions. However, there are variants between samples that are present in both inverted repeats and I would expect that the haplotype caller should find these. I used VarScan on the same set of bam files and had no problem in detecting variants in the inverted repeats.

I ran the following command:
java -jar ~/Prog/GenomeAnalysisTK.jar -T HaplotypeCaller -R reference.fasta -ploidy 1 -I "$i".recal.bam -o ../plastid_snp/gvcf_hap/"$i".g.vcf -ERC GVCF --variant_index_type LINEAR --variant_index_parameter 128000

The resulting gvcf files contain only polymorphisms in the non-repeated DNA, thus it's not a problem of the variant filtering step.

I was wondering whether you have an idea why the haplotype caller doesn't call the variants in the inverted repeats? Did you ever encounter similar problems? Any ideas/inputs would be highly appreciated. I could imgaine that the problem has to do with BWA assigning a lower mapping score to reads that are not uniquely mapped.

Of course, a simple workaround is to delete one of the repeats from the reference before read mapping.

Kind regards,
Marco

NullPointerException in HaplotypeCaller 4.0.1.1

$
0
0

Dear GATK team
I am calling variants using HaplotypeCaller on both WGS data form a normal tissue samle and RNA seq data on tumor tissue. Settings for HC are slightly different for the RNA seq data but the problem only arises when running HC on the WGS data. We are following Best Practices.
I am using Oracle JDK 1.8.0 144 Java HotSpot(TM) 64-Bit Server VM, but also tried Open JDK 64-Bit Server VM v1.8.0 161 and GATK version 4.0.1.1.
I am running using the WDL/Cromwell setup and scatter-gather so as you can see in the following command, I am not using --native-pair-hmm-threads (I saw in some previous posts that the old -nct could produce some errors).

It could be related to memory so I tried playing around with the Java settings like setting them from -Xmx4g to -Xms8000m which I saw was used here: https://github.com/gatk-workflows/gatk4-germline-snps-indels/blob/master/haplotypecaller-gvcf-gatk4.hg38.wgs.inputs.json. It doesn't any any difference, the error is still produced... I also tried deleting the GCLimits. Should I try something else? The -Duser.country is for some confusion between using ',' and '.' for floats, our server is set to Danish language (for no reason) and we use commas for decimals.

This is my command (some of the pats have been abbreviated for clarity:

$gatk4.0.1.1 --java-options "-XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx4g -Duser.country=en_US.UTF-8 -Duser.language=en_US.UTF-8" HaplotypeCaller \
-R $longpath/gatk-legacy-bundles/b37/human_g1k_v37_decoy.fasta \
-O Normal-056-WGS.vcf.gz \
-I $longpath/call-GatherBamFiles_normal/execution/Normal-056-WGS.bam \
--max-alternate-alleles 3 \
--contamination-fraction-to-filter 0.00172 \
--read-filter OverclippedReadFilter \
--standard-min-confidence-threshold-for-calling 30 \
-L $longpath/gatk-legacy-bundles/b37/scattered_wgs_intervals/scatter-50/temp_0024_of_50/scattered.interval_list

Stacktrace (sorry for the long paths but hopefully only the last part is important):

Using GATK jar /services/tools/gatk/4.0.1.1/gatk-package-4.0.1.1-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx4g -Duser.country=en_US.UTF-8 -Duser.language=en_US.UTF-8 -jar /services/tools/gatk/4.0.1.1/gatk-package-4.0.1.1-local.jar HaplotypeCaller -R /home/projects/dp_00005/apps/bonkolab_cromwell/tmp_wdir/Sample_021-056/cromwell-executions/WGS_normal_RNAseq_tumor_SNV_wf/4da5f4da-0dfc-4d55-a3bb-865eb51d6838/call-HaplotypeCaller_normal/shard-23/inputs/home/databases/gatk-legacy-bundles/b37/human_g1k_v37_decoy.fasta -O Normal-056-WGS.vcf.gz -I /home/projects/dp_00005/apps/bonkolab_cromwell/tmp_wdir/Sample_021-056/cromwell-executions/WGS_normal_RNAseq_tumor_SNV_wf/4da5f4da-0dfc-4d55-a3bb-865eb51d6838/call-HaplotypeCaller_normal/shard-23/inputs/home/projects/dp_00005/apps/bonkolab_cromwell/tmp_wdir/Sample_021-056/cromwell-executions/WGS_normal_RNAseq_tumor_SNV_wf/4da5f4da-0dfc-4d55-a3bb-865eb51d6838/call-GatherBamFiles_normal/execution/Normal-056-WGS.bam --max-alternate-alleles 3 --contamination-fraction-to-filter 0.00172 --read-filter OverclippedReadFilter --standard-min-confidence-threshold-for-calling 30 -L /home/projects/dp_00005/apps/bonkolab_cromwell/tmp_wdir/Sample_021-056/cromwell-executions/WGS_normal_RNAseq_tumor_SNV_wf/4da5f4da-0dfc-4d55-a3bb-865eb51d6838/call-HaplotypeCaller_normal/shard-23/inputs/home/databases/gatk-legacy-bundles/b37/scattered_wgs_intervals/scatter-50/temp_0024_of_50/scattered.interval_list
Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/home/projects/dp_00005/apps/bonkolab_cromwell/tmp_wdir/Sample_021-056/cromwell-executions/WGS_normal_RNAseq_tumor_SNV_wf/4da5f4da-0dfc-4d55-a3bb-865eb51d6838/call-HaplotypeCaller_normal/shard-23/execution/tmp.47DEgM
11:29:03.730 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/services/tools/gatk/4.0.1.1/gatk-package-4.0.1.1-local.jar!/com/intel/gkl/native/libgkl_compression.so
11:29:03.961 INFO  HaplotypeCaller - ------------------------------------------------------------
11:29:03.962 INFO  HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.0.1.1
11:29:03.963 INFO  HaplotypeCaller - For support and documentation go to https://software.broadinstitute.org/gatk/
11:29:03.963 INFO  HaplotypeCaller - Executing as s143372@risoe-r03-cn026 on Linux v3.10.0-514.10.2.el7.x86_64 amd64
11:29:03.963 INFO  HaplotypeCaller - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_144-b01
11:29:03.963 INFO  HaplotypeCaller - Start Date/Time: March 19, 2018 11:29:03 AM CET
11:29:03.963 INFO  HaplotypeCaller - ------------------------------------------------------------
11:29:03.963 INFO  HaplotypeCaller - ------------------------------------------------------------
11:29:03.964 INFO  HaplotypeCaller - HTSJDK Version: 2.14.1
11:29:03.964 INFO  HaplotypeCaller - Picard Version: 2.17.2
11:29:03.964 INFO  HaplotypeCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 1
11:29:03.964 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
11:29:03.964 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
11:29:03.964 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
11:29:03.964 INFO  HaplotypeCaller - Deflater: IntelDeflater
11:29:03.964 INFO  HaplotypeCaller - Inflater: IntelInflater
11:29:03.964 INFO  HaplotypeCaller - GCS max retries/reopens: 20
11:29:03.964 INFO  HaplotypeCaller - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
11:29:03.965 INFO  HaplotypeCaller - Initializing engine
11:29:04.807 INFO  IntervalArgumentCollection - Processing 40724607 bp from intervals
11:29:04.833 INFO  HaplotypeCaller - Done initializing engine
11:29:04.863 INFO  HaplotypeCallerEngine - Disabling physical phasing, which is supported only for reference-model confidence output
11:29:05.604 INFO  NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/services/tools/gatk/4.0.1.1/gatk-package-4.0.1.1-local.jar!/com/intel/gkl/native/libgkl_utils.so
11:29:05.618 INFO  NativeLibraryLoader - Loading libgkl_pairhmm_omp.so from jar:file:/services/tools/gatk/4.0.1.1/gatk-package-4.0.1.1-local.jar!/com/intel/gkl/native/libgkl_pairhmm_omp.so
11:29:05.682 WARN  IntelPairHmm - Flush-to-zero (FTZ) is enabled when running PairHMM
11:29:05.683 INFO  IntelPairHmm - Available threads: 1
11:29:05.683 INFO  IntelPairHmm - Requested threads: 4
11:29:05.683 WARN  IntelPairHmm - Using 1 available threads, but 4 were requested
11:29:05.683 INFO  PairHMM - Using the OpenMP multi-threaded AVX-accelerated native PairHMM implementation
11:29:05.759 INFO  ProgressMeter - Starting traversal
11:29:05.765 INFO  ProgressMeter -        Current Locus  Elapsed Minutes     Regions Processed   Regions/Minute
11:29:06.858 INFO  VectorLoglessPairHMM - Time spent in setup for JNI call : 0.001355153
11:29:06.859 INFO  PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 0.023359896
11:29:06.859 INFO  SmithWatermanAligner - Total compute time in java Smith-Waterman : 0.06 sec
11:29:06.860 INFO  HaplotypeCaller - Shutting down engine
[March 19, 2018 11:29:06 AM CET] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 0.05 minutes.
Runtime.totalMemory()=2041511936
java.lang.NullPointerException
    at java.util.Collections$UnmodifiableMap.<init>(Collections.java:1446)
    at java.util.Collections.unmodifiableMap(Collections.java:1433)
    at org.broadinstitute.hellbender.tools.walkers.genotyper.StandardCallerArgumentCollection.getSampleContamination(StandardCallerArgumentCollection.java:89)
    at org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCallerGenotypingEngine.assignGenotypeLikelihoods(HaplotypeCallerGenotypingEngine.java:141)
    at org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCallerEngine.callRegion(HaplotypeCallerEngine.java:566)
    at org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller.apply(HaplotypeCaller.java:218)
    at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.processReadShard(AssemblyRegionWalker.java:295)
    at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.traverse(AssemblyRegionWalker.java:271)
    at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:893)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:136)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:179)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:198)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:153)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:195)
    at org.broadinstitute.hellbender.Main.main(Main.java:277)

Thank you so much for your help!

  • Nanna

skip "indel realignment" and recalibration"

$
0
0

Hi to all
can I skip "indel realignment" and re-calibration" steps, when I am using HaplotypeCaller ?

Cromwell: dead letters encountered

$
0
0

I am using Cromwell to run haplotypecaller-gvcf-gatk4.wdl. But it doesn't wok. Also docker is not invoked.

[2018-03-24 15:55:59,90] [info] Running with database db.url = jdbc:hsqldb:mem:0f146367-d1ec-42c5-b5a2-c495444c78a2;shutdown=false;hsqldb.tx=mvcc
[2018-03-24 15:56:06,05] [info] Running migration RenameWorkflowOptionsInMetadata with a read batch size of 100000 and a write batch size of 100000
[2018-03-24 15:56:06,07] [info] [RenameWorkflowOptionsInMetadata] 100%
[2018-03-24 15:56:06,16] [info] Running with database db.url = jdbc:hsqldb:mem:be02d29c-fa92-44ad-aaa6-5a1630cad08d;shutdown=false;hsqldb.tx=mvcc
[2018-03-24 15:56:06,54] [info] Slf4jLogger started
[2018-03-24 15:56:06,74] [info] Metadata summary refreshing every 2 seconds.
[2018-03-24 15:56:06,78] [info] WriteMetadataActor configured to flush with batch size 200 and process rate 5 seconds.
[2018-03-24 15:56:06,78] [info] KvWriteActor configured to flush with batch size 200 and process rate 5 seconds.
[2018-03-24 15:56:06,79] [info] CallCacheWriteActor configured to flush with batch size 100 and process rate 3 seconds.
[2018-03-24 15:56:07,48] [info] JobExecutionTokenDispenser - Distribution rate: 50 per 1 seconds.
[2018-03-24 15:56:07,50] [info] SingleWorkflowRunnerActor: Submitting workflow
[2018-03-24 15:56:07,54] [info] WDL (Unspecified version) workflow 5382cf0c-16ae-44dc-a817-047c34161ae2 submitted
[2018-03-24 15:56:07,54] [info] SingleWorkflowRunnerActor: Workflow submitted 5382cf0c-16ae-44dc-a817-047c34161ae2
[2018-03-24 15:56:07,54] [info] 1 new workflows fetched
[2018-03-24 15:56:07,54] [info] WorkflowManagerActor Starting workflow 5382cf0c-16ae-44dc-a817-047c34161ae2
[2018-03-24 15:56:07,55] [info] WorkflowManagerActor Successfully started WorkflowActor-5382cf0c-16ae-44dc-a817-047c34161ae2
[2018-03-24 15:56:07,55] [info] Retrieved 1 workflows from the WorkflowStoreActor
[2018-03-24 15:56:08,57] [info] MaterializeWorkflowDescriptorActor [5382cf0c]: Call-to-Backend assignments: HaplotypeCallerGvcf_GATK4.HaplotypeCaller -> Local, HaplotypeCallerGvcf_GATK4.MergeGVCFs -> Local
[2018-03-24 15:56:08,67] [warn] Local [5382cf0c]: Key/s [memory, disks] is/are not supported by backend. Unsupported attributes will not be part of job executions.
[2018-03-24 15:56:08,67] [warn] Couldn't find a suitable DSN, defaulting to a Noop one.
[2018-03-24 15:56:08,68] [info] Using noop to send events.
[2018-03-24 15:56:08,69] [warn] Local [5382cf0c]: Key/s [memory, disks] is/are not supported by backend. Unsupported attributes will not be part of job executions.
[2018-03-24 15:56:14,59] [info] Message [cromwell.docker.DockerHashActor$DockerHashSuccessResponse] from Actor[akka://cromwell-system/user/HealthMonitorDockerHashActor#1021524121] to Actor[akka://cromwell-system/deadLetters] was not delivered. [1] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
[2018-03-24 15:56:47,76] [info] Message [cromwell.docker.DockerHashActor$DockerHashSuccessResponse] from Actor[akka://cromwell-system/user/HealthMonitorDockerHashActor#1021524121] to Actor[akka://cromwell-system/deadLetters] was not delivered. [2] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
[2018-03-24 16:02:27,07] [info] Message [cromwell.docker.DockerHashActor$DockerHashSuccessResponse] from Actor[akka://cromwell-system/user/HealthMonitorDockerHashActor#1021524121] to Actor[akka://cromwell-system/deadLetters] was not delivered. [3] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
[2018-03-24 16:03:02,06] [info] Message [cromwell.docker.DockerHashActor$DockerHashSuccessResponse] from Actor[akka://cromwell-system/user/HealthMonitorDockerHashActor#1021524121] to Actor[akka://cromwell-system/deadLetters] was not delivered. [4] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.

Viewing all 1335 articles
Browse latest View live