Why a true variant is not getting called by Haplotypecaller.

March 15, 2016, 6:20 am

≫ Next: Calling whole-genome haplotypes for Chloroplast-captured Pooled Samples

≪ Previous: Using GATK haplotype caller and SHAPEIT genotype likelihoods

I am using HaplotypeCaller for calling the variants for 120 gene based target sequence. For gene PMS2, there is a variant with coverage 21 (in that position) Allele fraction of the alternate allele is 5 reads ( 24%). The mapping quality of the reads are mostly 0 , I geuss because of the very similar pseudogene PMS2CL (mapping done with hg19 using BWA-MEM). This variant is not getting called by haplotypecaller, but is actually a true variant (found with sanger sequencing).
I compared the BAM file with bamout file, both are similar.
I also tried mapping the sequence with custom reference sequence (target region based). when I used that bam file for variant calling, It called that specific variant (though it also increased the coverage depth and increased the number of variants many fold, which are false positives).

I wonder what can be the possible explanation to this. what is the cutoff criteria, which haplotypecaller is using in this case? why the variant is not getting called at first place?

Here is the link to screenshot of a PMS2 variant with coverage 21 (also atached as file)
https://drive.google.com/file/d/0Bwibh75M75p_bGJrNlpyRTVSNHVZRDMzUFB0UDFOV2gyM2Rj/view?usp=sharing

↧

Calling whole-genome haplotypes for Chloroplast-captured Pooled Samples

March 15, 2016, 5:20 pm

≫ Next: HaplotypeCaller and reads mapped to multiple locations

≪ Previous: Why a true variant is not getting called by Haplotypecaller.

I'm trying to call whole-chloroplast genome haplotypes for a pooled chloroplast-captured DNA sample from a non-model organism (no well-established variants). The reads are Illumina 100 bp PE reads, and have already undergone some clipping (adapter-trimming and quality control) and have been aligned to a reference genome. The pool represents 20 individuals. I want to know if there is a way in GATK to call the frequency of whole-genome haplotypes (or else, is there a way currently in existence elsewhere? ) If necessary, I can generate a panel of known haplotypes.

Currently, I have been using HaplotypeCaller to call SNPs and then filtering those by hand in Excel. I have already tried increasing the maximum active region size to larger than the whole reference genome (~150,000 bp), with a corresponding increase in the max reads per sample value, but this doesn't seem to have come up with whole-genome haplotypes.

↧

HaplotypeCaller and reads mapped to multiple locations

March 16, 2016, 8:36 am

≫ Next: missing SNPs after gvcf combine and slow combination step

≪ Previous: Calling whole-genome haplotypes for Chloroplast-captured Pooled Samples

Dear GATK team,

I've been trying to use GATK to call SNPs from RNA-Seq data mapped to a transcriptome assembly. I used Bowtie2 for the read mapping. I apologize if the information is already posted, but it seemed hard to find out about this information, so I hoped to get some advice or pointed to the right place - How does the HaplotypeCaller handle reads mapped to multiple places? I used paired-end reads for read mapping.

Thank you very much for any feedback you might have.

Sincerely,

Xin

↧

missing SNPs after gvcf combine and slow combination step

March 17, 2016, 7:42 am

≫ Next: I get very different MQ values when using GVCF vs BP_RESOLUTION

≪ Previous: HaplotypeCaller and reads mapped to multiple locations

Hi, I'm using GATK ver 3.4 for SNP calling and I have some question about it. My data set has 500 samples, and I used genome data as reference for bowtie/GATK

1) I called SNP by sample (gvcf) with haplotype and then combined gvcf, however, the combination takes a long time, the GATK wants to recreate gvcf.idx files (4 of my gatk mission stuck at this step), one gatk combination finished after about 20 days calculation. I also try to use '-nct' to improve this, but it still stuck at preparing idx files.

2) For that finished gatk combination data set, I also used Unifiedgenotype with Gr.sorted.bam as input to call SNPs. The result is output with Gr.sorted.bam has 5 times more SNPs number than gvcf combination, and most missing SNPs could be found in individual gvcf files but missing in final result.

Could you help me with these? Thank you!

↧

I get very different MQ values when using GVCF vs BP_RESOLUTION

March 19, 2016, 10:20 am

≫ Next: In term of algorithms, what is the difference between haplotypecaller and unifiedgenotyper

≪ Previous: missing SNPs after gvcf combine and slow combination step

Hello! I had a question about the difference between using HaplotypeCaller's --emitRefConfidence GVCF vs BP_RESOLUTION. Maybe the answer is obvious or in the forum somewhere already but I couldn't spot it...

First, some context: I'm working with GATK v. 3.5.0 in a haploid organism. I have 34 samples, from which 5 are very similar to the reference (they are backcrosses) while the rest are strains from a wild population. Originally I used --emitRefConfidence GVCF followed by GenotypeGVCF. While checking the output VCF file, I realized that the five backcrosses had a much lower DP in average than the other samples (but this doesn't make sense due to difference in reads numbers or anything like that, since they were run in the same lane, etc). I assume this happened because there are long tracks without any variant compare to the reference in those samples, and the GVCF blocks end up assigning a lower depth for a great amount of sites in those samples compare to the much more polymorphic ones. In any case, I figured I could just get all sites using BP_RESOLUTION so to obtain the "true" DP values per site. However, when I tried to do that, the resulting VCF file had very low MQ values! Can you explain why this happened?

This is the original file with --emitRefConfidence GVCF:

$ bcftools view -H 34snps.vcf | head -n3 | cut -f1-8
chromosome_1    57  .   A   G   309.4   .   AC=4;AF=0.235;AN=17;DP=582;FS=0;MLEAC=4;MLEAF=0.235;MQ=40;QD=34.24;SOR=2.303
chromosome_1    81  .   G   A   84.49   .   AC=2;AF=0.065;AN=31;DP=603;FS=0;MLEAC=2;MLEAF=0.065;MQ=44.44;QD=30.63;SOR=2.833
chromosome_1    88  .   T   C   190.75  .   AC=1;AF=0.091;AN=11;BaseQRankSum=-0.762;ClippingRankSum=0.762;DP=660;FS=7.782;MLEAC=1;MLEAF=0.091;MQ=29.53;MQRankSum=-1.179;QD=21.19;ReadPosRankSum=-1.666;SOR=1.414

And this is with --emitRefConfidence BP_RESOLUTION:

$ bcftools view -H 34allgenome_snps.vcf | head -n3 | cut -f1-8
chromosome_1    57  .   A   G   307.28  .   AC=4;AF=0.211;AN=19;DP=602;FS=0;MLEAC=4;MLEAF=0.211;MQ=8.23;QD=34.24;SOR=2.204
chromosome_1    81  .   G   A   84.49   .   AC=2;AF=0.065;AN=31;DP=750;FS=0;MLEAC=2;MLEAF=0.065;MQ=5.53;QD=30.63;SOR=2.833
chromosome_1    88  .   T   C   190.75  .   AC=1;AF=0.091;AN=11;BaseQRankSum=-1.179;ClippingRankSum=0.762;DP=796;FS=7.782;MLEAC=1;MLEAF=0.091;MQ=4.8;MQRankSum=-1.179;QD=21.19;ReadPosRankSum=-1.666;SOR=1.414

I find it particularly strange since the mapping quality of the backcrosses should in fact be slightly better in average (around 59 for the original BAM file) than the other more polymorphic samples (around 58)...

Thank you very much!

↧

In term of algorithms, what is the difference between haplotypecaller and unifiedgenotyper

March 20, 2016, 12:55 pm

≫ Next: Monitor Progress of Haplotype Caller

≪ Previous: I get very different MQ values when using GVCF vs BP_RESOLUTION

Under every poster of GATK asking which is better, HC or UG, Geraldine always said HC.

So Is there any documents talking about the detail algorithms HC and UG are using, so that I can get a clear idea why HC is better?

Thanks.

↧

Monitor Progress of Haplotype Caller

March 20, 2016, 2:47 pm

≫ Next: HaplotypeCaller on whole genome or chromosome by chromosome: different results

≪ Previous: In term of algorithms, what is the difference between haplotypecaller and unifiedgenotyper

Hello, we have a GATK automatic pipeline all set up. We noticed a run that looked finished but then noticed that some SNP/InDel calls were missing. We were able to see that Haplotype Caller died unexpectedly. It's progress meter only said it had finished a little more than 16%. The pipeline continued to process on this incomplete vcf file.

Am I correct in assuming that the only way we'd ever be able automatically note this error is if we monitor the progress meter of Haplotype caller because the vcf is being written as we go? Is there any way to write the vcf all at once at the end? This doesn't seem like a great idea anyway because of resource issues, but I am just curious. We cannot depend solely on the existence of an output vcf file unless this is the last step in the process. Is there any phrase we can monitor for at the end of the Haplotype caller that we can be sure won't appear elsewhere in the file? Like, "Done", "Finished", etc? Thank you!

↧

HaplotypeCaller on whole genome or chromosome by chromosome: different results

January 9, 2015, 6:07 am

≫ Next: EMIT_ALL_SITES in HaplotypeCaller

≪ Previous: Monitor Progress of Haplotype Caller

Hi,

I'm working on targeted resequencing data and I'm doing a multi-sample variant calling with the HaplotypeCaller. First, I tried to call the variants in all the targeted regions by doing the calling at one time on a cluster. I thus specified all the targeted regions with the -L option.

Then, as it was taking too long, I decided to cut my interval list, chromosome by chromosome and to do the calling on each chromosome. At the end, I merged the VCFs files that I had obtained for the callings on the different chromosomes.

Then, I compared this merged VCF file with the vcf file that I obtained by doing the calling on all the targeted regions at one time. I noticed 1% of variation between the two variants lists. And I can't explain this stochasticity. Any suggestion?

Thanks!

Maguelonne

↧

EMIT_ALL_SITES in HaplotypeCaller

August 16, 2013, 2:12 pm

≫ Next: Can I apply the germline variant joint calling workflow to my RNAseq data?

≪ Previous: HaplotypeCaller on whole genome or chromosome by chromosome: different results

If I run HaplotypeCaller with a VCF file as the intervals file, -stand_emit_conf 0, and -out_mode EMIT_ALL_SITES, should I get back an output VCF with all the sites from the input VCF, whether or not there was a variant call there? If not, is there a way to force output even if the calls are 0/0 or ./. for everyone in the cohort?

I have been trying to run HC with the above options, but I can't understand why some variants are included in my output file and others aren't. Some positions are output with no alternate allele and GTs of 0 for everyone. However, other positions that I know have coverage are not output at all.

Thanks,

Elise

↧

Can I apply the germline variant joint calling workflow to my RNAseq data?

April 1, 2016, 12:25 pm

≫ Next: Disable AVX for pairHMM

≪ Previous: EMIT_ALL_SITES in HaplotypeCaller

We have not yet validated the joint genotyping methods (HaplotypeCaller in -ERC GVCF mode per-sample then GenotypeGVCFs per-cohort) on RNAseq data. Our standard recommendation is to process RNAseq samples individually as laid out in the RNAseq-specific documentation.

However, we know that a lot of people have been trying out the joint genotyping workflow on RNAseq data, and there do not seem to be any major technical problems. You are welcome to try it on your own data, with the caveat that we cannot guarantee correctness of results, and may not be able to help you if something goes wrong. Please be sure to examine your results carefully and critically.

If you do pursue this, you will need to pre-process your samples according to our RNA-specific documentation, then switch to the GVCF workflow at the HaplotypeCaller stage. For filtering, it will be up to you to determine whether the hard filtering or VQSR filtering method produce best results. We have not tested any of this so we cannot provide a recommendation. Be prepared to do a lot of analysis to validate the quality of your results.

Good luck!

↧

Disable AVX for pairHMM

April 6, 2016, 11:29 am

≫ Next: Counting number of reads affected by N's in CGIAR

≪ Previous: Can I apply the germline variant joint calling workflow to my RNAseq data?

Hi,

I want to switch back to Java LOGLESS_CACHING implementation for PairHMM instead of AVX, how can I make this?
I think that I may need to change some argument but I don't know where to start.

Thanks,
Jay

↧

Counting number of reads affected by N's in CGIAR

April 6, 2016, 2:06 pm

≫ Next: Memory heap size problem with Haplotype caller

≪ Previous: Disable AVX for pairHMM

Hi,
I work on plant species. I am using GATK on variant discovery in RNAseq data.
I am not able to decide whether I should use option --filter_reads_with_N_cigar or
-U ALLOW_N_CIGAR_READS.
What do you suggest?

I Would like to count the number of reads affected by N's in the CIGAR? Could you please suggest any tool?

Secondly, I am using Haplotype Caller for variant discovery. It is running very slow (on 12 CPU).
Is it okay to use UnifiedGenotyper instead of Haplotype Caller on RNAseq data?

Thanks

↧

Memory heap size problem with Haplotype caller

May 29, 2015, 9:00 am

≫ Next: Bug in HaplotypeCaller?

≪ Previous: Counting number of reads affected by N's in CGIAR

Hello,
I am trying to run HaplotypeCaller on my processed bam file but it keeps running out of memory.
I have tried increasing the heap size to 90G but it still crash. This might have to do with the type of analysis I am doing...
The sample are pool of pigs (6 individual so a ploidy of 12 for this particular sample) that have been sequenced on targeted regions, I use the bed file that has been given with the kit to narrow done the calling to the targeted regions. I have also reduce the number of alternative allele from 6 to 3. But it still crash after a while. Is there any other parameters I should try to modify to reduce the memory usage?
I have attached the log file if you want to have a look at all the parameters.
Cheers,

Julien

↧

Bug in HaplotypeCaller?

April 6, 2016, 6:28 pm

≫ Next: HaplotypeCaller says dict does not exist, but it does!

≪ Previous: Memory heap size problem with Haplotype caller

I was just running haplotype caller on a pool, and it gave an error message which (if I understand the options I gave it correctly) should not occur. My command was:

nice -n 5 java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ref.fasta -o Vars.vcf --forceActive -ploidy 18 -I input.bam --max_alternate_alleles 2 --genotyping_mode DISCOVERY

and the error stack trace I got was:

ERROR ------------------------------------------------------------------------------------------

ERROR stack trace

java.lang.IllegalArgumentException: the combination of ploidy (18) and number of alleles (17) results in a very large number of genotypes (> 2147483647). You need to limit ploidy or the number of alternative alleles to analyze this locus
at org.broadinstitute.gatk.tools.walkers.genotyper.GenotypeLikelihoodCalculator.(GenotypeLikelihoodCalculator.java:214)
at org.broadinstitute.gatk.tools.walkers.genotyper.GenotypeLikelihoodCalculators.getInstance(GenotypeLikelihoodCalculators.java:327)
at org.broadinstitute.gatk.tools.walkers.genotyper.InfiniteRandomMatingPopulationModel.getLikelihoodsCalculator(InfiniteRandomMatingPopulationModel.java:145)
at org.broadinstitute.gatk.tools.walkers.genotyper.InfiniteRandomMatingPopulationModel.singleSampleLikelihoods(InfiniteRandomMatingPopulationModel.java:137)
at org.broadinstitute.gatk.tools.walkers.genotyper.InfiniteRandomMatingPopulationModel.calculateLikelihoods(InfiniteRandomMatingPopulationModel.java:115)
at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCallerGenotypingEngine.calculateGLsForThisEvent(HaplotypeCallerGenotypingEngine.java:695)
at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCallerGenotypingEngine.assignGenotypeLikelihoods(HaplotypeCallerGenotypingEngine.java:269)
at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:924)
at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:228)
at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:709)
at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:705)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:274)
at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:78)
at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:315)
at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:106)

ERROR ------------------------------------------------------------------------------------------

ERROR A GATK RUNTIME ERROR has occurred (version 3.5-0-g36282e4):

ERROR

ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.

ERROR If not, please post the error message, with stack trace, to the GATK forum.

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions http://www.broadinstitute.org/gatk

ERROR

ERROR MESSAGE: the combination of ploidy (18) and number of alleles (17) results in a very large number of genotypes (> 2147483647). You need to limit ploidy or the number of alternative alleles to analyze this locus

ERROR ------------------------------------------------------------------------------------------

If I understand the options correctly, I told GATK to only use two possible alternates, so I'm not sure why this error message is showing up.

Incidentally, earlier I got an error message saying GATK had run out of memory with a very similar command. If necessary, I can send you my input file and reference genome. (I'm working with pooled chloroplast data for this file).

↧

HaplotypeCaller says dict does not exist, but it does!

April 5, 2016, 12:39 pm

≫ Next: parallelizing HC on PBS with Queue

≪ Previous: Bug in HaplotypeCaller?

Hi All,

I am running HaplotypeCaller and getting the error:

ERROR MESSAGE: Fasta dict file /net/rcnfs02/srv/export/duraisingh_lab/share_root/data/Plasmodium_knowlesi/jva/PlasmoDB-26_PknowlesiH_Genome_02.dict for reference /net/rcnfs02/srv/export/duraisingh_lab/share_root/data/Plasmodium_knowlesi/jva/PlasmoDB-26_PknowlesiH_Genome_02.fasta does not exist

BUT... the dictionary DOES exist! I made it with CreateSequenceDictionary.jar and it looks OK.

The reference dict and fasta are symbolically linked to the working directory. I did some googling on this but no luck.

Best,

Jon

↧

parallelizing HC on PBS with Queue

April 6, 2016, 2:42 am

≫ Next: vcf file generated using HaplotypeCaller does not contain dat lines

≪ Previous: HaplotypeCaller says dict does not exist, but it does!

Hi!!!
I'm attempting to use Queue on PBSPro HPC cluster. I have tested the functionality of a custom scala script for Haplotype Caller and it is runnable. However, following the discussion on GATK forum, I should need a job scheduler to dispatch queue output on several nodes..could you give me some advice or examples of the type of scheduler I need in a PBSpro system?
I tried to run Queue on a single node and it seems working faster..the question is: when I run Queue on a single node I actually 'multithreading' HC or I'm wrong?
Thanks a lot
Best Regards
Marco

↧

vcf file generated using HaplotypeCaller does not contain dat lines

April 6, 2016, 12:37 pm

≫ Next: Best strategy to "fix" the Haplotype Caller - GenotypeGVCF "missing DP field" bug??

≪ Previous: parallelizing HC on PBS with Queue

Hi,

I typed the following command for finding snps and indels:
java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R chr21/chr21.fa -I chr21/alignments/human38chr21.sorted.bam -o chr21/humanoutput.raw.snps.indels.vcf

I even got the vcf file but it contains only the header and does not contain the data lines.What maybe wrong?
I hae attached the file containing the stack trace.

↧

Best strategy to "fix" the Haplotype Caller - GenotypeGVCF "missing DP field" bug??

April 7, 2016, 9:31 am

≫ Next: Keep "species" info from BAM to VCF

≪ Previous: vcf file generated using HaplotypeCaller does not contain dat lines

Hi,

I've run into the (already reported http://gatkforums.broadinstitute.org/dsde/discussion/5598/missing-depth-dp-after-haplotypecaller ) bug of the missing DP format field in my callings.

I've run the following (relevant) commands:

Haplotype Caller -> Generate GVCF:

    java -Xmx${xmx} ${gct} -Djava.io.tmpdir=${NEWTMPDIR} -jar ${gatkpath}/GenomeAnalysisTK.jar \
       -T HaplotypeCaller \
       -R ${ref} \
       -I ${NEWTMPDIR}/${prefix}.realigned.fixed.recal.bam \
       -L ${reg} \
       -ERC GVCF \
       -nct ${nct} \
       --genotyping_mode DISCOVERY \
       -stand_emit_conf 10 \
       -stand_call_conf 30  \
       -o ${prefix}.raw_variants.annotated.g.vcf \
       -A QualByDepth -A RMSMappingQuality -A MappingQualityRankSumTest -A ReadPosRankSumTest -A FisherStrand -A StrandOddsRatio -A Coverage

That generates GVCF files that DO HAVE the DP field for all reference positions, but DO NOT HAVE the DP format field for any called variant (but still keep the DP in the INFO field):

18      11255   .       T       <NON_REF>       .       .       END=11256       GT:DP:GQ:MIN_DP:PL      0/0:18:48:18:0,48,720
18      11257   .       C       G,<NON_REF>     229.77  .       BaseQRankSum=1.999;DP=20;MLEAC=1,0;MLEAF=0.500,0.00;MQ=60.00;MQRankSum=-1.377;ReadPosRankSum=0.489      GT:AD:GQ:PL:SB  0/1:10,8,0:99:258,0,308,288
18      11258   .       G       <NON_REF>       .       .       END=11260       GT:DP:GQ:MIN_DP:PL      0/0:17:48:16:0,48,530

Later, I ran Genotype GVCF joining all the samples with the following command:

java -Xmx${xmx} ${gct} -Djava.io.tmpdir=${NEWTMPDIR} -jar ${gatkpath}/GenomeAnalysisTK.jar \
   -T GenotypeGVCFs \
   -R ${ref} \
   -L ${pos} \
   -o ${prefix}.raw_variants.annotated.vcf \
   --variant ${variant} [...]

This generated vcf files where the DP field is present in the format description, it IS present in the Homozygous REF samples, but IS MISSING in any Heterozygous or HomoALT samples.

22  17280388    .   T   C   18459.8 PASS    AC=34;AF=0.340;AN=100;BaseQRankSum=-2.179e+00;DP=1593;FS=2.526;InbreedingCoeff=0.0196;MLEAC=34;MLEAF=0.340;MQ=60.00;MQRankSum=0.196;QD=19.76;ReadPosRankSum=-9.400e-02;SOR=0.523    GT:AD:DP:GQ:PL  0/0:29,0:29:81:0,81,1118    0/1:20,22:.:99:688,0,682    1/1:0,27:.:81:1018,81,0 0/0:22,0:22:60:0,60,869 0/1:20,10:.:99:286,0,664    0/1:11,17:.:99:532,0,330    0/1:14,14:.:99:431,0,458    0/0:28,0:28:81:0,81,1092    0/0:35,0:35:99:0,99,1326    0/1:14,20:.:99:631,0,453    0/1:13,16:.:99:511,0,423    0/1:38,29:.:99:845,0,1231   0/1:20,10:.:99:282,0,671    0/0:22,0:22:63:0,63,837 0/1:8,15:.:99:497,0,248 0/0:32,0:32:90:0,90,1350    0/1:12,12:.:99:378,0,391    0/1:14,26:.:99:865,0,433    0/0:37,0:37:99:0,105,1406   0/0:44,0:44:99:0,120,1800   0/0:24,0:24:72:0,72,877 0/0:30,0:30:84:0,84,1250    0/0:31,0:31:90:0,90,1350    0/1:15,25:.:99:827,0,462    0/0:35,0:35:99:0,99,1445    0/0:29,0:29:72:0,72,1089    1/1:0,32:.:96:1164,96,0 0/0:21,0:21:63:0,63,809 0/1:21,15:.:99:450,0,718    1/1:0,40:.:99:1539,120,0    0/0:20,0:20:60:0,60,765 0/1:11,9:.:99:293,0,381 1/1:0,35:.:99:1306,105,0    0/1:18,14:.:99:428,0,606    0/0:32,0:32:90:0,90,1158    0/1:24,22:.:99:652,0,816    0/0:20,0:20:60:0,60,740 1/1:0,30:.:90:1120,90,0 0/1:15,13:.:99:415,0,501    0/0:31,0:31:90:0,90,1350    0/1:15,18:.:99:570,0,480    0/1:22,13:.:99:384,0,742    0/1:19,11:.:99:318,0,632    0/0:28,0:28:75:0,75,1125    0/0:20,0:20:60:0,60,785 1/1:0,27:.:81:1030,81,0 0/0:30,0:30:90:0,90,1108    0/1:16,16:.:99:479,0,493    0/1:14,22:.:99:745,0,439    0/0:31,0:31:90:0,90,1252
22  17280822    .   G   A   5491.56 PASS    AC=8;AF=0.080;AN=100;BaseQRankSum=1.21;DP=1651;FS=0.000;InbreedingCoeff=-0.0870;MLEAC=8;MLEAF=0.080;MQ=60.00;MQRankSum=0.453;QD=17.89;ReadPosRankSum=-1.380e-01;SOR=0.695   GT:AD:DP:GQ:PL  0/0:27,0:27:72:0,72,1080    0/0:34,0:34:90:0,90,1350    0/1:15,16:.:99:528,0,491    0/0:27,0:27:60:0,60,900 0/1:15,22:.:99:699,0,453    0/0:32,0:32:90:0,90,1350    0/0:37,0:37:99:0,99,1485    0/0:31,0:31:87:0,87,1305    0/0:40,0:40:99:0,108,1620   0/1:20,9:.:99:258,0,652 0/0:26,0:26:72:0,72,954 0/1:16,29:.:99:943,0,476    0/0:27,0:27:69:0,69,1035    0/0:19,0:19:48:0,48,720 0/0:32,0:32:81:0,81,1215    0/0:36,0:36:99:0,99,1435    0/0:34,0:34:99:0,99,1299    0/0:35,0:35:99:0,102,1339   0/0:38,0:38:99:0,102,1520   0/0:36,0:36:99:0,99,1476    0/0:31,0:31:81:0,81,1215    0/0:31,0:31:75:0,75,1125    0/0:35,0:35:99:0,99,1485    0/0:37,0:37:99:0,99,1485    0/0:35,0:35:90:0,90,1350    0/0:20,0:20:28:0,28,708 0/1:16,22:.:99:733,0,474    0/0:32,0:32:90:0,90,1350    0/0:35,0:35:99:0,99,1467    0/1:27,36:.:99:1169,0,831   0/0:28,0:28:75:0,75,1125    0/0:36,0:36:81:0,81,1215    0/0:35,0:35:90:0,90,1350    0/0:28,0:28:72:0,72,1080    0/0:31,0:31:81:0,81,1215    0/0:37,0:37:99:0,99,1485    0/0:31,0:31:84:0,84,1260    0/0:39,0:39:99:0,101,1575   0/0:37,0:37:96:0,96,1440    0/0:34,0:34:99:0,99,1269    0/0:30,0:30:81:0,81,1215    0/0:36,0:36:99:0,99,1485    0/1:17,17:.:99:567,0,530    0/0:26,0:26:72:0,72,1008    0/0:18,0:18:45:0,45,675 0/0:33,0:33:84:0,84,1260    0/0:25,0:25:61:0,61,877 0/1:9,21:.:99:706,0,243 0/0:35,0:35:81:0,81,1215    0/0:35,0:35:99:0,99,1485

I've just discovered this issue, and I need to run an analysis trying on the differential depth of coverage in different regions, and if there is a DP bias between called/not-called samples.

I have thousands of files and I've spent almost 1 year generating all these callings, so redoing the callings is not an option.

What would be the best/fastest strategy to either fix my final vcfs with the DP data present in all intermediate gvcf files (preferably) or, at least, extracting this data for all snps and samples?

Thanks in advance,

Txema

PS: Recalling the individual samples from bamfiles is not an option. Fixing the individual gvcfs and redoing the joint GenotypeGVCFs could be.

↧

Keep "species" info from BAM to VCF

April 11, 2016, 2:55 am

≫ Next: ERROR:The requested extended must fully contain the requested span

≪ Previous: Best strategy to "fix" the Haplotype Caller - GenotypeGVCF "missing DP field" bug??

Hello,
I am using HaplotypeCaller (GATK v3.5) with an input BAM file which has a header line like this (just a fake example):

@SQ SN:chr1 LN:100000 SP:Arabis thal AS:2 M5:8668a646eada2f4 UR:file:refgenome_Atha_v2.fa

But the output VCF only has a subset of this information:

##contig=<ID=chr1,length=100000>
##reference=file:///home/me/tmp/refgenome_Atha_v2.fa

Is there a way to obtain something like this instead? (i.e. also indicate species, assembly and MD5 sum)

##contig=<ID=chr1,length=100000,assembly=2,md5=8668a646eada2f4,species="Arabis thal">

The information in the BAM file initially comes from a "dict" file generated by Picard CreateSequenceDictionary. So I tried to feed this "dict" file with the VCF file to Picard UpdateVcfSequenceDictionary, but it didn't give me species nor mD5 sum:

##contig=<ID=chr1,length=100000,assembly=2>

Thank you in advance,
Tim

↧

ERROR:The requested extended must fully contain the requested span

April 14, 2016, 7:50 pm

≫ Next: per-sample DP is missing in called genotypes

≪ Previous: Keep "species" info from BAM to VCF

I got this error when running HaplotypeCaller:

ERROR MESSAGE: The requested extended must fully contain the requested span

My command is:
java -Xmx4g -jar ~/GATK_3.2/GenomeAnalysisTK.jar -T HaplotypeCaller -R ucsc.hg19.fasta -I Sample_55348_recal.bam --dbsnp dbsnp_138.hg19.vcf -L capture_targets_buffered10bases.bed --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -o Sample_55348_rawcall_HCgvcf.vcf

What does the error mean?

↧