Quantcast
Channel: haplotypecaller — GATK-Forum
Viewing all 1335 articles
Browse latest View live

DP is not reporting filtered read numbers

$
0
0

Hi Everyone,

I am using GATK version 3.3 and ran haplotype caller with -mmq 30 and -mbq 20. The output of one of the positions in the VCF is below.

chr13 58264978 . C . . . GT:AD:DP:GQ:PL 0/0:22,1:23:54:0,54,655

I used IGV to validate this position. IGV reports, like GATK does that there are 23 total reads covering that position. Although, there are also 4 reads that have a phred score of less than 20 at the position above. The DP statistic seems to be recording the number of unfiltered reads, but not the filtered ones. Do you know why this is?

Thank you.

Best,
Sam


Running the HaplotypeCaller

$
0
0

Hey there,

I cannot figure out why I did not get the HaplotypeCaller to work properly. Maybe someone can please help me.
I wanted to use only Chromsome20 for variant calling. I used a BAM and BAM.BAI file from 1000 genomes (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00096/alignment/). For the reference genome I used only Chromsome20 from this site (http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/). I created a dict and index file with picard tools.

If I run the HaplotypeCaller like this:
java -jar GenomeAnalysisTK.jar -R chr20.fa -T HaplotypeCaller -I HG00096.chrom20.ILLUMINA.bwa.GBR.low_coverage.20120522.bam
I get the following error:
##### ERROR MESSAGE: Badly formed genome loc: Contig 1 given as location, but this contig isn't present in the Fasta sequence dictionary

If I run the same with the -L argument (-L 20) I get:
#### ERROR MESSAGE: Input files reads and reference have incompatible contigs: The following contigs included in the intervals to process have different indices in the sequence dictionaries for the reads vs. the reference: [20]. As a result, the GATK engine will not correctly process reads from these contigs. You should either fix the sequence dictionaries for your reads so that these contigs have the same indices as in the sequence dictionary for your reference, or exclude these contigs from your intervals. This error can be disabled via -U ALLOW_SEQ_DICT_INCOMPATIBILITY, however this is not recommended as the GATK engine will not behave correctly..
##### ERROR reads contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT, GL000207.1, GL000226.1, GL000229.1, GL000231.1, GL000210.1, GL000239.1, GL000235.1, GL000201.1, GL000247.1, GL000245.1, GL000197.1, GL000203.1, GL000246.1, GL000249.1, GL000196.1, GL000248.1, GL000244.1, GL000238.1, GL000202.1, GL000234.1, GL000232.1, GL000206.1, GL000240.1, GL000236.1, GL000241.1, GL000243.1, GL000242.1, GL000230.1, GL000237.1, GL000233.1, GL000204.1, GL000198.1, GL000208.1, GL000191.1, GL000227.1, GL000228.1, GL000214.1, GL000221.1, GL000209.1, GL000218.1, GL000220.1, GL000213.1, GL000211.1, GL000199.1, GL000217.1, GL000216.1, GL000215.1, GL000205.1, GL000219.1, GL000224.1, GL000223.1, GL000195.1, GL000212.1, GL000222.1, GL000200.1, GL000193.1, GL000194.1, GL000225.1, GL000192.1, NC_007605, hs37d5]
##### ERROR reference contigs = [20]

Additionally, I adapted the fasta, fasta index and dictionary file from calling the chromsome "chr20" to "20" because previously it said:
##### ERROR MESSAGE: Input files reads and reference have incompatible contigs: No overlapping contigs found.

I really don't know what I am doing wrong! Any help is appreciated!
Regards,
Kristina

Unexpected genotypes in GenotypeGVCF output

$
0
0

Hello GATK Team,

I have 21 bam files that I ran through HaplotypeCaller in GVCF mode followed by GenotypeGVCF, using Version=3.4-0-g7e26428. I found a few entries that I am having a difficult time understanding.

Here is one of the entries in question (quick note: each sample is from a pool of 3-5 animals, explaining the high ploidy):

chr1   88988835    rs396411987 C   G   226099.87   .   AC=61;AF=0.455;AN=134;BaseQRankSum=-1.176e+00;ClippingRankSum=0.054;DB;DP=29307;FS=0.000;MLEAC=61;MLEAF=0.455;MQ=59.98;MQRankSum=0.603;QD=11.32;ReadPosRankSum=1.30;SOR=0.696   GT:AD:DP:GQ:PL  0/0/0/0/0/1/1/1/1/1:396,449,0,0:845:28:9249,2087,965,413,120,0,28,219,649,1588,7879 0/1/1/1/1/1/1/1/1/1:275,2150,0,0:2425:99:52058,17674,11483,7903,5424,3571,2143,1051,287,0,3240  0/0/0/1/1/1/1/1:626,1134,0,0:1760:99:24237,5532,2600,1118,316,0,197,1292,11237  0/0/1/1/1/1/1/1:225,824,0,0:1049:99:18893,5121,2835,1577,772,257,0,116,3439 ./././././././././.:0,0 ./././././././././.:0,0 ./././././././.:0,0 0/0/0/0/0/0/0/1:824,105,0,0:929:99:1309,0,239,705,1369,2290,3644,6013,20091 0/0/0/0/0/1/1/1:786,499,0,0:1285:99:9169,1197,249,0,139,633,1609,3600,16598 0/0/1/1/1/1:557,1132,0,2:1691:99:24696,4545,1719,433,0,563,9784 0/0/0/0/0/0/0/1:922,140,0,0:1062:99:1825,0,202,685,1401,2410,3908,6544,22289    0/0/0/0/1/1/1/1/1/1:650,828,0,0:1478:26:16909,4072,1966,903,311,26,0,254,906,2400,12402 ./././././././.:0,0 0/0/0/0/0/0/0/0/1/1:846,179,0,0:1025:95:2546,95,0,178,520,1015,1689,2617,3986,6389,20368    0/0/0/0/1/1/1/1:875,986,0,0:1861:99:20237,3745,1410,380,0,136,885,2819,17626    0/0/0/0/0/0/0/0:570,0:570:23:0,23,50,82,120,170,241,361,1800    0/0/0/0/0/1/1/1:798,604,0,0:1402:20:11395,1675,423,0,20,429,1344,3302,16391 0/0/0/0/1/1/1/1:644,730,0,0:1374:96:14884,2778,1049,285,0,96,643,2062,12779 0/0/0/0/0/1/1/1:613,417,0,0:1030:74:7703,1065,243,0,74,433,1174,2710,12915  0/0/0/1/1/1/1/1:239,513,0,0:752:13:11159,2667,1309,604,198,0,13,378,4191    ./././././././.:0,0

So, for the samples with genotypes present, AD is obviously quite high (in the 1000s, generally). It surprised me, then, that some samples don't have information (represented by ./././././././.). I went back to the original sample gvcf files and pulled the entries from a subset that were represented by ./././././././.:

group14.g.vcf:chr1 88988835    rs396411987 C   G,A,   16897.68    .   BaseQRankSum=-0.709;ClippingRankSum=0.843;DB;DP=1729;MLEAC=5,0,0;MLEAF=0.500,0.00,0.00;MQ=59.96;MQRankSum=0.621;ReadPosRankSum=1.987    GT:AD:DP:GQ:SB  0/0/0/0/0/1/1/1/1/1:876,844,2,0:1722:99:443,433,424,422
group15.g.vcf:chr1 88988835    rs396411987 C   G,T,   690.35  .   BaseQRankSum=-1.311;ClippingRankSum=-0.019;DB;DP=1168;MLEAC=1,0,0;MLEAF=0.125,0.00,0.00;MQ=59.98;MQRankSum=0.603;ReadPosRankSum=3.624   GT:AD:DP:GQ:SB  0/0/0/0/0/0/0/1:1082,78,4,0:1164:99:542,540,44,38

A consistent difference between samples that are included and those that are not is the presence of multiple alternative alleles in the excluded samples (G,A and G,T in the above example). Is this the source of my troubles? Is there a way of forcing GATK to include those samples in the VCF, especially given that the support for the third allele seems pretty weak (2 reads out of ~1722 in sample 14 above)? It seems to me that these samples should reasonably be included in the output of GenotypeGVCF.

Apologies if the answer is somewhere in the forums/guide - I did check, but didn't find anything.

Thanks,
Russ

GATK VQSR Errors GATK3.4/3.5 Valid VCF - but Allele not defined

$
0
0

We are having problems running GATK VQSR on recent VCF files generated using GATK HaplotypeCaller in the N+1 mode.

When running VariantRecalibrator we get the following error (with both 3.4 and 3.5), the error occurs.

##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 3.5-0-g36282e4): 
…
##### ERROR MESSAGE: The allele with index 5 is not defined in the REF/ALT columns in the record
##### ERROR ------------------------------------------------------------------------------------------

JAVA=~/tools/jre1.8.0_66/bin/java
REF=~/refs/bosTau6.fasta
GATK=~/tools/GenomeAnalysisTK.jar

$JAVA -jar $GATK -T VariantRecalibrator -R $REF -input GATK-HC-3.4-DAMONA_10_Jan_2016.vcf.gz \
        -resource:ill770k,known=false,training=true,truth=true,prior=15.0 /home/aeonsim/refs/LIC-seqdHDs.AC1plus.DP10x.vcf.gz \
        -resource:VQSRMend,known=false,training=true,truth=false,prior=10.0  /home/projects/bos_taurus/damona/vcfs/recombination/GATK-HC-95.0tr-GQ40-10Dec-2015run.mend.Con.vcf.gz \
        -an DP -an QD -an FS -an SOR -an MQ -an MQRankSum -an ReadPosRankSum -mode SNP \
        -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 97.5 -tranche 95.0 -tranche 92.5 -tranche 90.0 \
        -recalFile GATK-HC-DAMONA-Jan-2016.recal -tranchesFile GATK-HC-DAMONA-Jan-2016.tranches \
        -rscriptFile GATK-HC-DAMONA-Jan-2016.R

The VCF files were created using GenotypeGVCFs (v3.4-46-gbc02625) on GVCF files created with GATK HaplotypeCaller (N+1 mode, same version) which had been combined using GATK CombineGVCFs to merge ~750 whole bovine genome GVCFs to of 20-90 individuals. All stages occurred to finish successfully with out error until the VQSR stage. All processing of the GATK files was done using GATK tools (v3.4-46) and running the ValidateVariants tool on the files only gives warnings about “Reference allele is too long (116)” nothing about alleles not being defined in the Ref/Alt column.

java -jar GenomeAnalysisTK.jar -T ValidateVariants -R bosTau6.fasta  -V GATK.chr29.vcf.gz --dbsnp BosTau6_
dbSNP140_NCBI.vcf.gz -L chr29
ValidateVariants - Reference allele is too long (116) at position chr29:2513143; skipping that record. Set --referenceWindowStop >= 116
…
Successfully validated the input file.  Checked 531640 records with no failures.

Older versions of GATK worked successfully on part of this dataset last year, but now we've completed the dataset and rerun everything with the latest version of GATK (at the time 3.4-46) using only the GATK tools, GATK VQSR is unable to process the files that were created by it's own HaplotypeCaller (and from the same version), while validation tools supplied with GATK claim there is no problem and programs like bcftools are happily processing the files.

It appears to me that the problem may be around the introduction of the * in the Alt column for VCF4.2 (?) and a failure of VQSR to fully support this?

DP statistic matches bamfile before realignment and not the bamout file

$
0
0

Hi Everyone,

I am using GATK version 3.3 and I noticed that the numbers in the DP and AD fields actually match the original bam file before realignment and does not match the bamout file produced by haplotype caller. For example, at chromosome 1 position 6253228 the DP is 27 and the AD is 27,0. The number of reads (counted by IGV) in the original bam file at that position is also 27. However, in the "bamout" bam file the number of reads is 56. Below is the line from the VCF file.

chr1 6253228 . C . . . GT:AD:DP:GQ:PL 0/0:27,0:27:75:0,75,1125

Is the DP and AD fields supposed to match the original bamfile or the bamout bam file numbers? When I produce the "bamout" bamfile I use parameters -forceActive and -disableOptimizations.

Thank you.

Best,
Sam

Prevent HaplotypeCaller from writing the whole gvcf to standard out

$
0
0

Hey there,

I am using the haplotypecaller on our cluster and the stdout is written to a logfile.
Unfortunately HC is writing the whole gvcf to the standard out although I specified an output file.
This means really huge log files that I can not handle.
Is there a way to prevent this behaviour and only get errors/ warnings from the haplotypecaller?

All the best

IIEBR

...you did not provide enough memory to run this program,pooled,haplotypecaller

$
0
0

Hello,

I am using haplotypecaller version 3.4-46 with pooled data (ploidy 44) and I am running in to the error 'There was a failure because you did not provide enough memory to run this program'. I have a very high coverage (~10000x at each position). Searching in the forums, I have found suggestions of down sampling, (which haplotypecaller does itself in the version I am using) and changing the minPruning parameter (which I am afraid may further compromise the sensitivity of the variant calling given the samples are pooled). Previously I used unifiedgenotyper on the same data and I had no problems. Is there something else I could try to get rid of the memory issues or should I stick with unifiedgenotyper in my case? Thank you for your support.

Error in force calling variants using HaplotypeCaller

$
0
0

Hi,

I would like to force call a list of variants across my cohort using HaplotypeCaller to get more accurate QC metrics for each variant. I am using the following command:

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ucsc.hg19.fasta -et NO_ET -K my.key -I my.cohort.list --alleles my.vcf -L my.vcf -out_mode EMIT_ALL_SITES -gt_mode GENOTYPE_GIVEN_ALLELES -stand_call_conf 30.0 -stand_emit_conf 0.0 -dt NONE -o final_my.vcf

Here is a link to the input VCF file: VCF File

Unfortunately, I keep running into the following error (I've tried GATK ver3.3 and ver3.5):

INFO 18:49:21,288 ProgressMeter - chr1:11177077 21138.0 49.5 m 39.0 h 69.4% 71.3 m 21.8 m
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace
java.lang.IndexOutOfBoundsException: Index: 3, Size: 3
at java.util.ArrayList.rangeCheck(ArrayList.java:635)
at java.util.ArrayList.get(ArrayList.java:411)
at htsjdk.variant.variantcontext.VariantContext.getAlternateAllele(VariantContext.java:845)
at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCallerGenotypingEngine.assignGenotypeLikelihoods(HaplotypeCallerGenotypingEngine.java:248)
at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:1059)
at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:221)
at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:709)
at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:705)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:274)
at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:78)
at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:319)
at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 3.3.0-mssm-0-gaa95802):
##### ERROR
##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
##### ERROR Visit our website and forum for extensive documentation and answers to
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: Index: 3, Size: 3
##### ERROR ------------------------------------------------------------------------------------------

Would appreciate your help in solving this issue.


Incorrect AD values in HC-called vcf and combined gvcf

$
0
0

Hi,

I am using GATK v3.2.2 following the recommended practices (...HC -> CombineGVCFs -> GenotypeGVCFs ...) and while looking through suspicious variants I came across a few hetz with AD=X,0. Tracing them back I found two inconsistencies (bugs?);

1) Reordering of genotypes when combining gvcfs while the AD values are kept intact, which leads to an erronous AD for a heterozygous call. Also, I find it hard to understand why the 1bp insertion is emitted in the gvcf - there is no reads supporting it:

  • single sample gvcf
    1 26707944 . A AG,G,<NON_REF> 903.73 . [INFO] GT:AD:DP:GQ:PL:SB 0/2:66,0,36,0:102:99:1057,1039,4115,0,2052,1856,941,3051,1925,2847:51,15,27,9

  • combined gvcf
    1 26707944 . A G,AG,<NON_REF> . . [INFO] GT:AD:DP:MIN_DP:PL:SB [other_samples] ./.:66,0,36,0:102:.:1057,0,1856,1039,2052,4115,941,1925,3051,2847:51,15,27,9 [other_samples]

  • vcf
    1 26707944 . A G 3169.63 . [INFO] [other_samples] 0/1:66,0:102:99:1057,0,1856 [other_samples]

2) Incorrect AD is taken while genotyping gvcf files:

  • single sample gvcf:
    1 1247185 rs142783360 AG A,<NON_REF> 577.73 . [INFO] GT:AD:DP:GQ:PL:SB 0/1:13,20,0:33:99:615,0,361,654,421,1075:7,6,17,3
  • combined gvcf
    1 1247185 rs142783360 AG A,<NON_REF> . . [INFO] [other_samples] ./.:13,20,0:33:.:615,0,361,654,421,1075:7,6,17,3 [other_samples]

  • vcf
    1 1247185 . AG A 569.95 . [INFO] [other_samples] 0/1:13,0:33:99:615,0,361 [other_samples]

I have found multiple such cases here, and no errors nor warnings in the logs. I checked also with calls that I had done before on these samples, but in a smaller batch. There the AD values were correct, but there were plenty of other hetz with AD=X,0... I haven't looked closer into those.

Are these bugs that have been fixed in 3.3? Or maybe my brain is not working properly today and I miss sth obvious?

Best regards,
Paweł

Mismatch in the read count/Allele count in bamout and g.vcf files and other bam files

$
0
0

Hello,
I am using GATK pipeline for variant calling in Targeted Exome samples.
While trying to view the variants in IGV at different levels of variant calling steps (aligning .bam files generated at BaseRecalibration and HaplotypeCaller steps against reference.fasta), i find that the total number of reads aligning at a particular variant location varies. ie.
after BaseRecalibration and before HaplotypeCalling total read count at a variant location (as per IGV- BR.bam alignment) is 274. It shows A count is 1 and C count is 273. But after HaplotypeCaller step, The total read count is 460 (as per IGV-bamout alignment). It also shows that A count is 195 and C count is 265. At the same time the variant record from the corresponding g.vcf file shows something like this:

17 48253171 . C A,<NON_REF> 1802.77 . BaseQRankSum=4.848;ClippingRankSum=-1.291;DP=388;MLEAC=1,0;MLEAF=0.500,0.00;MQ=36.72;MQ0=0;MQRankSum=-14.293;ReadPosRankSum=-14.235 GT:AD:DP:GQ:PL:SB 0/1:217,104,0:321:99:1831,0,6990,2478,7307,9785:217,0,104,0

I understand that HC reassembles the active regions. But then also I see a drastic different in the read/Allele counts.
My questions are:
1. Why the total read count(even Alternate Allele count) is increased suddenly in HC step compared to the previous step(s)?
2. Why the Allele counts and Read depth in the g.vcf file and that in the corresponding bamout files (IGV view) are different?
3. In the above situation, which allele count should be considered bamout(ie IGV) or g.vcf counts?

Thanking you in advance,
Aswathy

Several Annotations not working in GATK Haplotype Caller

$
0
0

I am using Genotype Given Allele with Haplotype Caller
I am trying to explicitely request all annotations that the documentation says are compatible with the Haplotype caller (and that make sense for a single sample .. e.g. no hardy weinberg ..)

the following annotations all have "NA"
GCContent(GC) HomopolymerRun(Hrun) TandemRepeatAnnotator (STR RU RPA)
.. but are valid requests because I get no errors from GATK.

This is the command I ran (all on one line)

java -Xmx40g -jar /data5/bsi/bictools/alignment/gatk/3.4-46/GenomeAnalysisTK.jar -T HaplotypeCaller --input_file /data2/external_data/Weinshilboum_Richard_weinsh/s115343.beauty/Paired_analysis/secondary/Paired_10192014/IGV_BAM/pair_EX167687/s_EX167687_DNA_Blood.igv-sorted.bam --alleles:vcf /data2/external_data/Kocher_Jean-Pierre_m026645/s109575.ez/Sequencing_2016/OMNI.vcf --phone_home NO_ET --gatk_key /projects/bsi/bictools/apps/alignment/GenomeAnalysisTK/3.1-1/Hossain.Asif_mayo.edu.key --reference_sequence /data2/bsi/reference/sequence/human/ncbi/hg19/allchr.fa --minReadsPerAlignmentStart 1 --disableOptimizations --dontTrimActiveRegions --forceActive --out /data2/external_data/Kocher_Jean-Pierre_m026645/s109575.ez/Sequencing_2016/EX167687.0.0375.chr22.vcf --logging_level INFO -L chr22 --downsample_to_fraction 0.0375 --downsampling_type BY_SAMPLE --genotyping_mode GENOTYPE_GIVEN_ALLELES --standard_min_confidence_threshold_for_calling 20.0 --standard_min_confidence_threshold_for_emitting 0.0 --annotateNDA --annotation BaseQualityRankSumTest --annotation ClippingRankSumTest --annotation Coverage --annotation FisherStrand --annotation GCContent --annotation HomopolymerRun --annotation LikelihoodRankSumTest --annotation MappingQualityRankSumTest --annotation NBaseCount --annotation QualByDepth --annotation RMSMappingQuality --annotation ReadPosRankSumTest --annotation StrandOddsRatio --annotation TandemRepeatAnnotator --annotation DepthPerAlleleBySample --annotation DepthPerSampleHC --annotation StrandAlleleCountsBySample --annotation StrandBiasBySample --excludeAnnotation HaplotypeScore --excludeAnnotation InbreedingCoeff

Log file is below( Notice "weird" WARNings about) "StrandBiasBySample annotation exists in input VCF header"..
which make no sense because the header is empty other than the barebone fields.

This is the barebone VCF
head /data2/external_data/Kocher_Jean-Pierre_m026645/s109575.ez/Sequencing_2016/OMNI.vcf

fileformat=VCFv4.2

CHROM POS ID REF ALT QUAL FILTER INFO

chr1 723918 rs144434834 G A 30 PASS .
chr1 729632 rs116720794 C T 30 PASS .
chr1 752566 rs3094315 G A 30 PASS .
chr1 752721 rs3131972 A G 30 PASS .
chr1 754063 rs12184312 G T 30 PASS .
chr1 757691 rs74045212 T C 30 PASS .
chr1 759036 rs114525117 G A 30 PASS .
chr1 761764 rs144708130 G A 30 PASS .

This is the output

INFO 10:03:06,080 HelpFormatter - ---------------------------------------------------------------------------------
INFO 10:03:06,082 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.4-46-gbc02625, Compiled 2015/07/09 17:38:12
INFO 10:03:06,083 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO 10:03:06,083 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO 10:03:06,086 HelpFormatter - Program Args: -T HaplotypeCaller --input_file /data2/external_data/Weinshilboum_Richard_weinsh/s115343.beauty/Paired_analysis/secondary/Paired_10192014/IGV_BAM/pair_EX167687/s_EX167687_DNA_Blood.igv-sorted.bam --alleles:vcf /data2/external_data/Kocher_Jean-Pierre_m026645/s109575.ez/Sequencing_2016/OMNI.vcf --phone_home NO_ET --gatk_key /projects/bsi/bictools/apps/alignment/GenomeAnalysisTK/3.1-1/Hossain.Asif_mayo.edu.key --reference_sequence /data2/bsi/reference/sequence/human/ncbi/hg19/allchr.fa --minReadsPerAlignmentStart 1 --disableOptimizations --dontTrimActiveRegions --forceActive --out /data2/external_data/Kocher_Jean-Pierre_m026645/s109575.ez/Sequencing_2016/EX167687.0.0375.chr22.vcf --logging_level INFO -L chr22 --downsample_to_fraction 0.0375 --downsampling_type BY_SAMPLE --genotyping_mode GENOTYPE_GIVEN_ALLELES --standard_min_confidence_threshold_for_calling 20.0 --standard_min_confidence_threshold_for_emitting 0.0 --annotateNDA --annotation BaseQualityRankSumTest --annotation ClippingRankSumTest --annotation Coverage --annotation FisherStrand --annotation GCContent --annotation HomopolymerRun --annotation LikelihoodRankSumTest --annotation MappingQualityRankSumTest --annotation NBaseCount --annotation QualByDepth --annotation RMSMappingQuality --annotation ReadPosRankSumTest --annotation StrandOddsRatio --annotation TandemRepeatAnnotator --annotation DepthPerAlleleBySample --annotation DepthPerSampleHC --annotation StrandAlleleCountsBySample --annotation StrandBiasBySample --excludeAnnotation HaplotypeScore --excludeAnnotation InbreedingCoeff
INFO 10:03:06,093 HelpFormatter - Executing as m037385@franklin04-213 on Linux 2.6.32-573.8.1.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_20-b26.
INFO 10:03:06,094 HelpFormatter - Date/Time: 2016/01/19 10:03:06
INFO 10:03:06,094 HelpFormatter - ---------------------------------------------------------------------------------
INFO 10:03:06,094 HelpFormatter - ---------------------------------------------------------------------------------
INFO 10:03:06,545 GenomeAnalysisEngine - Strictness is SILENT
INFO 10:03:06,657 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Fraction: 0.04
INFO 10:03:06,666 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 10:03:07,012 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.35
INFO 10:03:07,031 HCMappingQualityFilter - Filtering out reads with MAPQ < 20
INFO 10:03:07,170 IntervalUtils - Processing 51304566 bp from intervals
INFO 10:03:07,256 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files
INFO 10:03:07,595 GenomeAnalysisEngine - Done preparing for traversal
INFO 10:03:07,595 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO 10:03:07,595 ProgressMeter - | processed | time | per 1M | | total | remaining
INFO 10:03:07,596 ProgressMeter - Location | active regions | elapsed | active regions | completed | runtime | runtime
INFO 10:03:07,596 HaplotypeCaller - Disabling physical phasing, which is supported only for reference-model confidence output
WARN 10:03:07,709 StrandBiasTest - StrandBiasBySample annotation exists in input VCF header. Attempting to use StrandBiasBySample values to calculate strand bias annotation values. If no sample has the SB genotype annotation, annotation may still fail.
WARN 10:03:07,709 StrandBiasTest - StrandBiasBySample annotation exists in input VCF header. Attempting to use StrandBiasBySample values to calculate strand bias annotation values. If no sample has the SB genotype annotation, annotation may still fail.
INFO 10:03:07,719 HaplotypeCaller - Using global mismapping rate of 45 => -4.5 in log10 likelihood units
INFO 10:03:37,599 ProgressMeter - chr22:5344011 0.0 30.0 s 49.6 w 10.4% 4.8 m 4.3 m
INFO 10:04:07,600 ProgressMeter - chr22:11875880 0.0 60.0 s 99.2 w 23.1% 4.3 m 3.3 m
Using AVX accelerated implementation of PairHMM
INFO 10:04:29,924 VectorLoglessPairHMM - libVectorLoglessPairHMM unpacked successfully from GATK jar file
INFO 10:04:29,925 VectorLoglessPairHMM - Using vectorized implementation of PairHMM
WARN 10:04:29,938 AnnotationUtils - Annotation will not be calculated, genotype is not called
WARN 10:04:29,938 AnnotationUtils - Annotation will not be calculated, genotype is not called
WARN 10:04:29,939 AnnotationUtils - Annotation will not be calculated, genotype is not called
INFO 10:04:37,601 ProgressMeter - chr22:17412465 0.0 90.0 s 148.8 w 33.9% 4.4 m 2.9 m
INFO 10:05:07,602 ProgressMeter - chr22:18643131 0.0 120.0 s 198.4 w 36.3% 5.5 m 3.5 m
INFO 10:05:37,603 ProgressMeter - chr22:20133744 0.0 2.5 m 248.0 w 39.2% 6.4 m 3.9 m
INFO 10:06:07,604 ProgressMeter - chr22:22062452 0.0 3.0 m 297.6 w 43.0% 7.0 m 4.0 m
INFO 10:06:37,605 ProgressMeter - chr22:23818297 0.0 3.5 m 347.2 w 46.4% 7.5 m 4.0 m
INFO 10:07:07,606 ProgressMeter - chr22:25491290 0.0 4.0 m 396.8 w 49.7% 8.1 m 4.1 m
INFO 10:07:37,607 ProgressMeter - chr22:27044271 0.0 4.5 m 446.4 w 52.7% 8.5 m 4.0 m
INFO 10:08:07,608 ProgressMeter - chr22:28494980 0.0 5.0 m 496.1 w 55.5% 9.0 m 4.0 m
INFO 10:08:47,609 ProgressMeter - chr22:30866786 0.0 5.7 m 562.2 w 60.2% 9.4 m 3.8 m
INFO 10:09:27,610 ProgressMeter - chr22:32908950 0.0 6.3 m 628.3 w 64.1% 9.9 m 3.5 m
INFO 10:09:57,610 ProgressMeter - chr22:34451306 0.0 6.8 m 677.9 w 67.2% 10.2 m 3.3 m
INFO 10:10:27,611 ProgressMeter - chr22:36013343 0.0 7.3 m 727.5 w 70.2% 10.4 m 3.1 m
INFO 10:10:57,613 ProgressMeter - chr22:37387478 0.0 7.8 m 777.1 w 72.9% 10.7 m 2.9 m
INFO 10:11:27,614 ProgressMeter - chr22:38534891 0.0 8.3 m 826.8 w 75.1% 11.1 m 2.8 m
INFO 10:11:57,615 ProgressMeter - chr22:39910054 0.0 8.8 m 876.4 w 77.8% 11.4 m 2.5 m
INFO 10:12:27,616 ProgressMeter - chr22:41738463 0.0 9.3 m 926.0 w 81.4% 11.5 m 2.1 m
INFO 10:12:57,617 ProgressMeter - chr22:43113306 0.0 9.8 m 975.6 w 84.0% 11.7 m 112.0 s
INFO 10:13:27,618 ProgressMeter - chr22:44456937 0.0 10.3 m 1025.2 w 86.7% 11.9 m 95.0 s
INFO 10:13:57,619 ProgressMeter - chr22:45448656 0.0 10.8 m 1074.8 w 88.6% 12.2 m 83.0 s
INFO 10:14:27,620 ProgressMeter - chr22:46689073 0.0 11.3 m 1124.4 w 91.0% 12.5 m 67.0 s
INFO 10:14:57,621 ProgressMeter - chr22:48062438 0.0 11.8 m 1174.0 w 93.7% 12.6 m 47.0 s
INFO 10:15:27,622 ProgressMeter - chr22:49363910 0.0 12.3 m 1223.6 w 96.2% 12.8 m 29.0 s
INFO 10:15:57,623 ProgressMeter - chr22:50688233 0.0 12.8 m 1273.2 w 98.8% 13.0 m 9.0 s
INFO 10:16:12,379 VectorLoglessPairHMM - Time spent in setup for JNI call : 0.061128124000000006
INFO 10:16:12,379 PairHMM - Total compute time in PairHMM computeLikelihoods() : 22.846350295
INFO 10:16:12,380 HaplotypeCaller - Ran local assembly on 25679 active regions
INFO 10:16:12,434 ProgressMeter - done 5.1304566E7 13.1 m 15.0 s 100.0% 13.1 m 0.0 s
INFO 10:16:12,435 ProgressMeter - Total runtime 784.84 secs, 13.08 min, 0.22 hours
INFO 10:16:12,435 MicroScheduler - 727347 reads were filtered out during the traversal out of approximately 4410423 total reads (16.49%)
INFO 10:16:12,435 MicroScheduler - -> 2 reads (0.00% of total) failing BadCigarFilter
INFO 10:16:12,436 MicroScheduler - -> 669763 reads (15.19% of total) failing DuplicateReadFilter
INFO 10:16:12,436 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter
INFO 10:16:12,436 MicroScheduler - -> 57582 reads (1.31% of total) failing HCMappingQualityFilter
INFO 10:16:12,437 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter
INFO 10:16:12,437 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter
INFO 10:16:12,437 MicroScheduler - -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter
INFO 10:16:12,438 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter

Incorrect missing genotypes using HaplotypeCaller

$
0
0

Dear GATK community,

I am using HaplotypeCaller for variant discovery and I have found some strange results in my VCF.
It appears that this walker is making no calls in positions with reads of support for them in the original BAM.

For instance this position:

chr7 302528 rs28436118 A G 31807.9 PASS . GT:AD:DP:GQ:PL 1/1:0,256:256:99:8228,767,0 ./.:98,0:98:.:. ./.:81,0:81:.:. 1/1:0,134:134:99:4287,401,0

was not called for 2 of the 4 samples available. However, in both samples where the genotype is missing there are many reads supporting an homozygous reference call (0/0).

Do you have any idea of why this is happening?
Cheers
JMFA> ********

HaplotypeCaller Scores that are multiple of 3 indicate much better concordance than those that are n

$
0
0

I was comparing the concordance of genotype calls as a function of Genotype Quality (GQ) for the GATK 3.5 Haplotyper.
(using genotype given allele and emiting all sites, even those with score of 0).

My "truth" are genotypes from Illumina OMNI 2.5M arrays .. and I am comparing the genotype calls from Illumina exome arrays.

I notice that the Genotype Quality (GQ) scores of the HaplotypeCaller are focussed at intervals of "3".
e.g. many many more scores at (orders of magnitude)
GQ=0,3,6,9,12,15, ...
than at
GQ=1,2, 4,5, 7,8, 10,11 , 13,14

This unequal distribution in itself is surprising, but not an issue.

I am noticing that the rate of concordance is NOT monotonically proportional to GQ.
Those genotypes with non-multiple of 3 scores have much worst concordance.

If I limit to multiple of 3, the GQ are monotonic.
Concordance(GQ=12)>Concordance(GQ=9)>Concordance(GQ=6)

but systematically, the non-multiple of 3 have worst concordance by orders of magnitude.
e.g. Concordance(10 or 11) <Concordance(3)

Why is that?
So .. trying to look at the source code, for PairHMM (where the likelyhoods are computed),
I am identifying this value "TRISTATE" correction (value 3.0) that is differentially applied to reads with "N".
..
but I don't see 'N" in alignments for those non-multiple of 3 alignment.

I looked at the "raw" bam file (not the output of haplotyper) for
number of examples for scores from 5 to 22 .. coverage 2..14

None has pathologic features.
- no N in alignment
- only example one overlapped soft-clipped bases
- usually only 1 read supporting the alternate allele (or one read supporting the reference)
- usually all the reads have 100M cigars.
- Variant is not next to the end of a soft-clipped reads.

Is that a bug in the HMMPair scoring?

p.s. The BAM has 32-offset qualities.

Accelerate HaplotypeCaller step

$
0
0

Hello everyone,

I am using GATK in a clinical context for NGS diagnosis. The issue is that the HaplotypeCaller take some time, too much time actually (2h per patient).
I tried this things :
- reduce the bam file size by keeping only the genomic regions of my diagnosis genes but it looks like it still run all the hg19 genome.
- ask "only variants" with the output_mode option but the output file is exactly the same than the default one.
- use several CPU thread, but 1 CPU = 147 min, 2 CPU = 89 min, 3 CPU = 80 min. And I don't have this much CPU available so it is not interesting above 2 CPU , and still not fast enough.

I can't use the data thread option right now, would it allow me to gain more time than the CPU option ?
There is the interval option but I don't think it would allow me to gain enough time since I have gene of interest on almost all chromosomes.

I would appreciate to have your guidance regarding this problem. How would you do to make this HaplotypeCaller step faster ?

Many thanks in advance.

Christ

Is new MQ jittering functionality recommended?

$
0
0

The release notes for 3.5 state "Added new MQ jittering functionality to improve how VQSR handles MQ". My understanding is that in order to use this, we will need to use the --MQCapForLogitJitterTransform argument in VariantRecalibrator. I have a few questions on this:
1) Is the above correct, i.e. the new MQ jittering functionality is only used if --MQCapForLogitJitterTransform is set to something other than the default value of zero?
2) Is the use of MQCapForLogitJitterTransform recommended?
3) If we do use MQCapForLogitJitterTransform, the tool documentation states "We recommend to either use --read-filter ReassignOriginalMQAfterIndelRealignment with HaplotypeCaller or use a MQCap=max+10 to take that into account". Is one of these to be preferred over the other? Given that it seems that sites that have been realigned can have values up to 70, but sites that have not can have values no higher than 60, it seems to me that the ReassignOriginalMQAfterIndelRealignment with HaplotypeCaller option might be preferred, but I would like to check before running.


Should I use UnifiedGenotyper or HaplotypeCaller to call variants on my data?

$
0
0

Use HaplotypeCaller!

The HaplotypeCaller is a more recent and sophisticated tool than the UnifiedGenotyper. Its ability to call SNPs is equivalent to that of the UnifiedGenotyper, its ability to call indels is far superior, and it is now capable of calling non-diploid samples. It also comprises several unique functionalities such as the reference confidence model (which enables efficient and incremental variant discovery on ridiculously large cohorts) and special settings for RNAseq data.

As of GATK version 3.3, we recommend using HaplotypeCaller in all cases, with no exceptions.

Caveats for older versions

If you are limited to older versions for project continuity, you may opt to use UnifiedGenotyper in the following cases:

  • If you are working with non-diploid organisms (UG can handle different levels of ploidy while older versions of HC cannot)
  • If you are working with pooled samples (also due to the HC’s limitation regarding ploidy)
  • If you want to analyze more than 100 samples at a time (for performance reasons) (versions 2.x)

PCR indel model for HaplotypeCaller

$
0
0

Dear GATK team,

I'm a bit confused by --pcr_indel_model argument in HaplotypeCaller. As a can see from the docs, this argument is not required, but in its description I still read the following: "VERY IMPORTANT: when using PCR-free sequencing data we definitely recommend setting this argument to NONE". Does some PCR-bias-oriented filtration is performed by default (so, for PCR-free datasets I should set it to NONE), or actually I don't need to set this argument to NONE (even processing PCR-free datasets) if I simply don't use it?

Best regards,
Svyatoslav

Can HaplotypeCaller determine ploidy of an unknown sample?

$
0
0

I have NGS data from an organism that is sometimes found to be aneuploid (but usually diploid.) I am trying to determine the ploidy of the individual I have sequenced from my population of interest. I have ~100X coverage, but the best reference is so distantly-related that the normal SNP coverage plots that are used to estimate ploidy are too messy to interpret. I was hoping that I could use HaplotypeCaller here to help determine the number of haplotypes per chromosome (where 2 haplotypes/alleles per position indicates diploidy). I can't figure out how to interpret the resulting VCF file well enough to address this question.

If HaplotypeCaller will not be sufficient, are there any other frequently used options to determine ploidy (or number of alleles/haplotypes per position)?

Thanks for any assistance.

Alex

HaplotypeCaller - Code exception

$
0
0

Dear GATK Team,
I'm trying to run HaplotypeCaller, on a Bowtie2 bam file, but I'm having a Code exception.
Any idea how to fix it?
Best regards,
Miguel Machado

Command:
/scratch/mpmachado/NGStools/java_oracle/jre1.8.0_45/bin/java -jar /scratch/mpmachado/NGStools/gatk/GenomeAnalysisTK.jar -T HaplotypeCaller -R /scratch/mpmachado/trabalho1diversidade/genomaReferencia/dadosGenoma/NC_004368.fna -I /scratch/mpmachado/trabalho1diversidade/genomaReferencia/outputs/bowtie/GBS_01.sorted_position.bam --out output.raw.snps.indels.vcf --genotyping_mode DISCOVERY --output_mode EMIT_ALL_SITES --sample_ploidy 1

INFO 16:21:59,507 HelpFormatter - --------------------------------------------------------------------------------
INFO 16:21:59,511 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.4-0-g7e26428, Compiled 2015/05/15 03:25:41
INFO 16:21:59,511 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO 16:21:59,511 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO 16:21:59,516 HelpFormatter - Program Args: -T HaplotypeCaller -R /scratch/mpmachado/trabalho1diversidade/genomaReferencia/dadosGenoma/NC_004368.fna -I /scratch/mpmachado/trabalho1diversidade/genomaReferencia/outputs/bowtie/GBS_01.sorted_position.bam --out output.raw.snps.indels.vcf --genotyping_mode DISCOVERY --output_mode EMIT_ALL_SITES --sample_ploidy 1
INFO 16:21:59,521 HelpFormatter - Executing as mpmachado@dawkins on Linux 2.6.32-5-amd64 i386; Java HotSpot(TM) Server VM 1.8.0_45-b14.
INFO 16:21:59,521 HelpFormatter - Date/Time: 2015/07/07 16:21:59
INFO 16:21:59,521 HelpFormatter - --------------------------------------------------------------------------------
INFO 16:21:59,521 HelpFormatter - --------------------------------------------------------------------------------
INFO 16:22:00,257 GenomeAnalysisEngine - Strictness is SILENT
INFO 16:22:00,360 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 500
INFO 16:22:00,370 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 16:22:00,401 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.03
INFO 16:22:00,410 HCMappingQualityFilter - Filtering out reads with MAPQ < 20
INFO 16:22:01,869 GATKRunReport - Uploaded run statistics report to AWS S3

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.NullPointerException
at java.util.TreeMap.compare(TreeMap.java:1290)
at java.util.TreeMap.put(TreeMap.java:538)
at java.util.TreeSet.add(TreeSet.java:255)
at org.broadinstitute.gatk.utils.sam.ReadUtils.getSAMFileSamples(ReadUtils.java:70)
at org.broadinstitute.gatk.engine.samples.SampleDBBuilder.addSamplesFromSAMHeader(SampleDBBuilder.java:66)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.initializeSampleDB(GenomeAnalysisEngine.java:846)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:296)
at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:106)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.4-0-g7e26428):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: Code exception (see stack trace for error itself)
ERROR ------------------------------------------------------------------------------------------

RNA-seq variant calling and merging of sample replicates

$
0
0

Hello, and thanks for making all the GATK tools! I have recently started to try my hand at variant calling of my RNA-seq data, following the GATK Best Practices more or less verbatim, only excluding indel alignment (because I am only interested in SNPs at this point) and the BQSR (partly because I have very high quality data, but mostly because I couldn't get it to work in the workflow).

I have three replicates for each of my samples, and my question is where, if at all, I should merge the data from them. I am not sure if I can (or even should!) merge the FASTQ files before the alignment step, or merge the aligned BAM files, or something else. I read that for aligners such as BWA the options are (more or less) equivalent, but seeing as the RNA-seq Best Practice workflow using STAR... What would be the "correct" way to do it, if at all? How would merging (at some level) affect the speed of the workflow, and can I optimise that somehow?

If it's a bad idea to do merging, how would I determine the "true" variant from my three resulting VCF-files at the end, for cases where they differ?

Viewing all 1335 articles
Browse latest View live