Adding dbSNP rs identifiers to VCF files

November 6, 2014, 1:08 pm

≫ Next: SNP calling using pooled RNA-seq data

≪ Previous: HaplotypeCaller on all dbSNP positions

Dear GATK team,

I produced VCF files from RNA-Seq data as described in https://www.broadinstitute.org/gatk/guide/article?id=3891 My VCF files contain genomic coordinates but not dbSNP rs IDs. Is this because I omitted the optional --dbSNP parameter when performing the -T HaplotypeCaller step? If so, can I add the IDs to the VCF files after they have been generated? If not, and i have to repeat the HaplotypeCaller step, what is the correct dnSNP.vcf file for hg38 that you guys recommend?

Thanks, Joe

↧

SNP calling using pooled RNA-seq data

November 3, 2014, 5:43 am

≫ Next: What are the recommendations of VariantFiltration for variants derived by HaplotypeCaller?

≪ Previous: Adding dbSNP rs identifiers to VCF files

Hello,

First of all, thank you for your detailed best practice pipeline for SNP calling from RNA-seq data.

I have pooled RNA seq data which I need to call SNP from. Each library consists of a pooled sample of 2-3 individuals of the same sex-tissue combination.

I was wondering if Haplotype caller can handle SNP calling from pooled sequences or is it better if I use FreeBayes?

I understand that these results come from experimenting with the data but it would be great if you could share your experiences with me on this.

Cheers, Homa

↧

What are the recommendations of VariantFiltration for variants derived by HaplotypeCaller?

November 12, 2014, 7:40 pm

≫ Next: same setting, different results (GATK realignment,recalibration,haptypecaller,hard filtering)

≪ Previous: SNP calling using pooled RNA-seq data

I have 23 samples and I want to look over 63807197 bp region. Many thanks before.

Kind regards, Angelica

↧

same setting, different results (GATK realignment,recalibration,haptypecaller,hard filtering)

March 12, 2014, 10:12 am

≫ Next: GATK version 3.2-2 - possible bug

≪ Previous: What are the recommendations of VariantFiltration for variants derived by HaplotypeCaller?

When I run GATK with identical settings on some amplicon sequencing data from MiSeq (150KB region), I get different numbers of variants (approximately 10% difference), even after setting -dfrac to 1. What is the cause for this variation? How to make results reproducible?

Thank you so much!

RESULT

#8.1.vcf and 8.2.vcf are the raw VCFs from two runs with filters applied
java -Xmx4g -jar gatk.jar -T CombineVariants -R human_g1k_v37.fasta -o 8merge_union.vcf -V 8.1.vcf -V 8.2.vcf
grep Intersection 8merge_union.vcf > 8merge_intersection.vcf
grep -v Intersection 8merge_union.vcf > 8merge_NOintersection.vcf
grep PASS 8merge_NOintersection.vcf > 8merge_NOintersection.PASS.vcf

wc -l *vcf
   711 8.1.vcf
   642 8.2.vcf
   308 8merge_intersection.vcf
    86 8merge_NOintersection.PASS.vcf
   462 8merge_NOintersection.vcf
   770 8merge_union.vcf

Please find 8.1.vcf and 8.2.vcf in attachment.

COMMAND:

##realign
java -Xmx4g -jar /home/user/Downloads/gatk2.8/GenomeAnalysisTK.jar -T RealignerTargetCreator -I 8.bam -R ~public/project/seqlib/g1k_v37/human_g1k_v37.fasta -known ~public/project/seqlib/gatk/Mills_and_1000G_gold_standard.indels.b37.vcf -o 8.intervals -nt 6 -L /home/user/projects/data/collaborator_enzyme_compare/collaborator_capture2.bed -dfrac 1
java -Xmx4g -jar /home/user/Downloads/gatk2.8/GenomeAnalysisTK.jar -T IndelRealigner -I 8.bam -R ~public/project/seqlib/g1k_v37/human_g1k_v37.fasta -targetIntervals 8.intervals --out 8.realign.bam -known ~public/project/seqlib/gatk/Mills_and_1000G_gold_standard.indels.b37.vcf --consensusDeterminationModel USE_READS -LOD 0.4 -L /home/user/projects/data/collaborator_enzyme_compare/collaborator_capture2.bed -dfrac 1
#
##base recal
java -Xmx4g -jar /home/user/Downloads/gatk2.8/GenomeAnalysisTK.jar -T BaseRecalibrator -I 8.realign.bam -R ~public/project/seqlib/g1k_v37/human_g1k_v37.fasta --default_platform ILLUMINA -knownSites ~public/project/seqlib/gatk/dbsnp_137.b37.vcf -knownSites ~public/project/seqlib/gatk/Mills_and_1000G_gold_standard.indels.b37.vcf -nct 6 -o 8.recal_data -L /home/user/projects/data/collaborator_enzyme_compare/collaborator_capture2.bed -dfrac 1
java -Xmx6g -jar /home/user/Downloads/gatk2.8/GenomeAnalysisTK.jar -T PrintReads -I 8.realign.bam -R ~public/project/seqlib/g1k_v37/human_g1k_v37.fasta -o 8.recal.bam -BQSR 8.recal_data -nct 1  -L /home/user/projects/data/collaborator_enzyme_compare/collaborator_capture2.bed -dfrac 1

#variant calling
java -Xmx8g -jar /home/user/Downloads/gatk2.8/GenomeAnalysisTK.jar -T HaplotypeCaller  -R ~public/project/seqlib/g1k_v37/human_g1k_v37.fasta -I 8.recal.bam -o 8.raw.vcf --dbsnp ~public/project/seqlib/gatk/dbsnp_137.b37.vcf -nct 6 -L /home/user/projects/data/collaborator_enzyme_compare/collaborator_capture2.bed -dfrac 1

#variant filtering
java -Xmx4g -jar /home/user/Downloads/gatk2.8/GenomeAnalysisTK.jar -T SelectVariants -R ~public/project/seqlib/g1k_v37/human_g1k_v37.fasta --variant 8.raw.vcf -o 8.raw.indel.vcf -selectType INDEL -L /home/user/projects/data/collaborator_enzyme_compare/collaborator_capture2.bed -dfrac 1

java -Xmx4g -jar /home/user/Downloads/gatk2.8/GenomeAnalysisTK.jar -T SelectVariants -R ~public/project/seqlib/g1k_v37/human_g1k_v37.fasta --variant 8.raw.vcf -o 8.raw.snp.vcf -selectType SNP -L /home/user/projects/data/collaborator_enzyme_compare/collaborator_capture2.bed -dfrac 1

#use hard filtering due to small input file
#SNP
java -Xmx4g -jar /home/user/Downloads/gatk2.8/GenomeAnalysisTK.jar -T VariantFiltration -R ~public/project/seqlib/g1k_v37/human_g1k_v37.fasta --variant 8.raw.snp.vcf --filterName QDFilter --filterExpression 'QD<2.0' --filterName FSFilter --filterExpression 'FS>60.0' --filterName MQFilter --filterExpression 'MQ<40.0' --filterName MaqQualRankSumFilter --filterExpression 'MappingQualityRankSum<-12.5' --filterName ReadPosFilter --filterExpression 'ReadPosRankSum<-8.0' -o 8.filter.snp.vcf -L /home/user/projects/data/collaborator_enzyme_compare/collaborator_capture2.bed -dfrac 1
#indel
java -Xmx4g -jar /home/user/Downloads/gatk2.8/GenomeAnalysisTK.jar -T VariantFiltration -R ~public/project/seqlib/g1k_v37/human_g1k_v37.fasta --variant 8.raw.indel.vcf --clusterWindowSize 10 --filterExpression 'QD<2.0' --filterName QDFilter --filterExpression 'ReadPosRankSum<-20.0' --filterName ReadPosFilter --filterExpression 'FS>200.0' --filterName FSFilter -o 8.filter.indel.vcf -L /home/user/projects/data/collaborator_enzyme_compare/collaborator_capture2.bed -dfrac 1
java -Xmx4g -jar /home/user/Downloads/gatk2.8/GenomeAnalysisTK.jar -T CombineVariants -R ~public/project/seqlib/g1k_v37/human_g1k_v37.fasta --variant 8.filter.snp.vcf --variant 8.filter.indel.vcf -o 8.filter.vcf -L /home/user/projects/data/collaborator_enzyme_compare/collaborator_capture2.bed -dfrac 1

↧

GATK version 3.2-2 - possible bug

November 13, 2014, 10:16 am

≫ Next: Homozygous SNP

≪ Previous: same setting, different results (GATK realignment,recalibration,haptypecaller,hard filtering)

Hi, I had called a VCF using haplotypecaller, following general guidelines with the GATK version 3.2-2. I had encountered a possible bug where a sample is shown to have a HET genotype, yet the AD column returns a '0' for the alt allele. However there is evidence of the 'alt' allele in the original bam file for the sample. This error is seen for some snps and indel alike. Variant example is pasted below (IGV screenshot attached for the bam)

2 46839487 . G GA 45.66 PASS AC=3;AF=0.042;AN=72;BaseQRankSum=0.218;ClippingRankSum=0.894;DP=749;FS=0.000;GQ_MEAN=31.19;GQ_STDDEV=29.02;InbreedingCoeff=-0.1296;MLEAC=2;MLEAF=0.028;MQ=60.00;MQ0=0;MQRankSum=-4.700e-02;NCC=0;QD=3.04;ReadPosRankSum=0.00;VQSLOD=-6.710e-02;culprit=FS GT:AD:DP:GQ:PL 0/0:20,0:20:60:0,60,823 0/0:41,0:43:99:0,122,1117 0/0:22,0:22:63:0,63,945 0/0:20,0:20:60:0,60,774 0/0:22,2:27:26:0,26,590 0/0:20,0:20:60:0,60,833 0/0:30,0:30:62:0,62,1077 0/0:20,0:20:60:0,60,847 0/0:42,0:42:16:0,16,1323 0/0:25,0:25:51:0,51,765 0/0:27,0:27:63:0,63,945 0/0:27,0:27:63:0,63,945 0/0:16,0:18:0:0,0,373 0/0:24,0:24:0:0,0,610 0/1:15,0:17:23:23,0,607 0/1:12,3:18:28:28,0,312 0/0:8,0:8:0:0,0,145 0/0:13,0:16:38:0,38,348 0/0:11,0:11:21:0,21,315 0/0:13,0:13:27:0,27,405 0/0:13,1:16:23:0,23,469 0/0:19,0:21:8:0,8,418 0/0:23,0:23:33:0,33,589 0/1:8,0:12:71:71,0,187 0/0:19,0:19:0:0,0,438 0/0:7,0:7:21:0,21,248 0/0:9,0:9:0:0,0,199 0/0:8,0:8:21:0,21,315 0/0:18,0:18:0:0,0,396 0/0:26,0:26:0:0,0,585 0/0:24,0:24:0:0,0,469 0/0:14,0:16:41:0,41,375 0/0:6,0:6:0:0,0,133 0/0:17,0:19:51:0,51,460 0/0:15,0:15:11:0,11,373 0/0:12,0:12:0:0,0,333

Is this a bug?

-Uma

↧

Homozygous SNP

November 10, 2014, 10:29 am

≫ Next: what does this ERROR mean - "Problem detecting index type"

≪ Previous: GATK version 3.2-2 - possible bug

Hi,

I have the below variant from GATK Haplotype caller and annotated as 1/1 which means homozygous for the alternate allele.

chr1    10023229    .   G   A   101.03  .   AC=2;AF=1.00;AN=2;BaseQRankSum=-0.742;DP=49;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=8.21;MQ0=42;MQRankSum=0.742;QD=2.06;ReadPosRankSum=-0.742 GT:AD:DP:GQ:PL 1/1:42,7:47:12:129,12,0

However, the AD column shows there are 42 reads for reference and 7 reads for alternate allele. Could someone comment on this snp being reported as homozygous for alternate allele despite of having very few reads supporting it.

↧

what does this ERROR mean - "Problem detecting index type"

November 17, 2014, 3:15 am

≫ Next: GenotypeGVCFs hangs on some positions

≪ Previous: Homozygous SNP

I ran GenotypeGVCFs on all the vcf files generated by HC using the following command ( All these file are indexed by HC):

java -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -R /ref/human_g1k_v37.fasta -o HC_raw.vcf -V gVCF_n1.vcf -V gVCF_n2.vcf -V gVCF_n3.vcf -V gVCF_n4.vcf -V gVCF_n5.vcf -V gVCF_n6.vcf

ERROR ------------------------------------------------------------------------------------------

ERROR A USER ERROR has occurred (version 3.3-0-g37228af):

ERROR

ERROR This means that one or more arguments or inputs in your command are incorrect.

ERROR The error message below tells you what is the problem.

ERROR

ERROR If the problem is an invalid argument, please check the online documentation guide

ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.

ERROR

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions http://www.broadinstitute.org/gatk

ERROR

ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.

ERROR

ERROR MESSAGE: Problem detecting index type

ERROR ------------------------------------------------------------------------------------------

↧

GenotypeGVCFs hangs on some positions

November 17, 2014, 3:59 am

≫ Next: Parameter to set the min-freq

≪ Previous: what does this ERROR mean - "Problem detecting index type"

Hi all,

I am attempting to use the HaplotyperCaller / CombineGVCFs / GenotypeGVCFs to call variants on chrom X and Y of 769 samples (356 males, 413 females) sequenced at 12x coverage (WG sequening, but right not only calling X and Y).

I have called the samples according to the best practises using the HaplotypeCaller, using ploidy = 1 for males on X and Y and ploidy =2 for females on X, e.g.:

INFO 16:28:45,750 HelpFormatter - Program Args: -R /gcc/resources/b37/indices/human_g1k_v37.fa -T HaplotypeCaller -L X -ploidy 1 -minPruning 3 --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -I /target/gpfs2/gcc/groups/gonl/projects/trio-analysis/rawdata_release2/A102.human_g1k_v37.trio_realigned.bam --sample_name A102a -o /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/A102a.chrX.hc.g.vcf

Then I have used CombineGVCFs to combine my samples in batches of 100 samples. Now I am attempting to genotype them and I face the same issue on both X (males + females) and Y (males only): It starts running fine and then just hang on a certain position. At first it crashed asking for additional memory but now with 24Gb or memory it simply stays at a single position for 24hrs until my job eventually stops due to walltime. Here is the chrom X output:

INFO  15:00:39,501 HelpFormatter - Program Args: -R /gcc/resources/b37/indices/human_g1k_v37.fa -T GenotypeGVCFs -ploidy 1 --dbsnp /gcc/resources/b37/snp/dbSNP/dbsnp_138.b37.vcf -stand_call_conf 10 -stand_emit_conf 10 --max_alternate_alleles 4 -o /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.vcf -L X -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.1.g.vcf -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.2.g.vcf -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.3.g.vcf -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.4.g.vcf -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.5.g.vcf -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.6.g.vcf -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.7.g.vcf -V /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.8.g.vcf
INFO  15:00:39,507 HelpFormatter - Executing as lfrancioli@targetgcc15-mgmt on Linux 3.0.80-0.5-default amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_51-b13.
INFO  15:00:39,507 HelpFormatter - Date/Time: 2014/11/12 15:00:39
INFO  15:00:39,508 HelpFormatter - --------------------------------------------------------------------------------
INFO  15:00:39,508 HelpFormatter - --------------------------------------------------------------------------------
INFO  15:00:40,951 GenomeAnalysisEngine - Strictness is SILENT
INFO  15:00:41,153 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO  15:57:53,416 RMDTrackBuilder - Writing Tribble index to disk for file /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.4.g.vcf.idx
INFO  17:09:39,597 RMDTrackBuilder - Writing Tribble index to disk for file /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.5.g.vcf.idx
INFO  18:21:00,656 RMDTrackBuilder - Writing Tribble index to disk for file /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.6.g.vcf.idx
INFO  19:30:46,624 RMDTrackBuilder - Writing Tribble index to disk for file /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.7.g.vcf.idx
INFO  20:22:38,368 RMDTrackBuilder - Writing Tribble index to disk for file /gcc/groups/gonl/tmp01/lfrancioli/chromX/hc/results/gonl.chrX.hc.8.g.vcf.idx
WARN  20:26:45,716 FSLockWithShared$LockAcquisitionTask - WARNING: Unable to lock file /gcc/resources/b37/snp/dbSNP/dbsnp_138.b37.vcf.idx because an IOException occurred with message: No locks available.
INFO  20:26:45,718 RMDTrackBuilder - Could not acquire a shared lock on index file /gcc/resources/b37/snp/dbSNP/dbsnp_138.b37.vcf.idx, falling back to using an in-memory index for this GATK run.
INFO  20:33:29,491 IntervalUtils - Processing 155270560 bp from intervals
INFO  20:33:29,628 GenomeAnalysisEngine - Preparing for traversal
INFO  20:33:29,635 GenomeAnalysisEngine - Done preparing for traversal
INFO  20:33:29,636 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO  20:33:29,637 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining
INFO  20:33:29,638 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime
INFO  20:33:29,948 GenotypeGVCFs - Notice that the -ploidy parameter is ignored in GenotypeGVCFs tool as this is automatically determined by the input variant files
INFO  20:33:59,642 ProgressMeter -         X:65301         0.0    30.0 s      49.6 w        0.0%    19.8 h      19.8 h
INFO  20:34:59,820 ProgressMeter -         X:65301         0.0    90.0 s     149.1 w        0.0%    59.4 h      59.4 h
...
INFO  20:52:01,064 ProgressMeter -        X:177301         0.0    18.5 m    1837.7 w        0.1%    11.3 d      11.2 d
INFO  20:53:01,066 ProgressMeter -        X:177301         0.0    19.5 m    1936.9 w        0.1%    11.9 d      11.9 d
...
INFO  14:58:25,243 ProgressMeter -        X:177301         0.0    18.4 h   15250.3 w        0.1%    96.0 w      95.9 w
INFO  14:59:38,112 ProgressMeter -        X:177301         0.0    18.4 h   15250.3 w        0.1%    96.1 w      96.0 w
INFO  15:00:47,482 ProgressMeter -        X:177301         0.0    18.5 h   15250.3 w        0.1%    96.2 w      96.1 w
=>> PBS: job killed: walltime 86440 exceeded limit 86400

I would really appreciate if you could give me some pointer as how to handle this situation.

Thanks! Laurent

↧

Parameter to set the min-freq

November 17, 2014, 6:44 pm

≫ Next: Is there a paper describing the »Haplotype Caller algorithm?

≪ Previous: GenotypeGVCFs hangs on some positions

Hi,

Is there any parameter to set the minimum variant frequency? What is the parameters name? I have read the documentation, but I cannot find it.

Thank you,

↧

Is there a paper describing the »Haplotype Caller algorithm?

November 17, 2014, 6:40 pm

≫ Next: Does the Haplotype Caller perform Linkage Disequilibirum as part of the variant callling algorithm?

≪ Previous: Parameter to set the min-freq

Hi,

I'd like to ask you if there is a paper describing the Haplotype Caller algorithm, if you could please send me the reference. I have tried to find it, but I only found the paper on GATK which is great, but it doesn't describe in detail the Haplotype Caller algorithm.

thank you,

↧

Does the Haplotype Caller perform Linkage Disequilibirum as part of the variant callling algorithm?

November 17, 2014, 6:34 pm

≫ Next: Why is HaplotypeCaller dropping half of my reads?

≪ Previous: Is there a paper describing the »Haplotype Caller algorithm?

Hi,

I'd like to know if the Haplotype Caller perform Linkage Disequilibirum (LD) to decide wether to call a variant or not as part of the variant callling algorithm?

I've read documentation about the Haplotype Caller, but I cannot find anything about LD, so should I assume that it doesn't do LD?

Thank you,

↧

Why is HaplotypeCaller dropping half of my reads?

November 19, 2014, 6:29 am

≫ Next: HaplotypeCaller SB field for multiple-alts

≪ Previous: Does the Haplotype Caller perform Linkage Disequilibirum as part of the variant callling algorithm?

Hi I have been trying HaplotypeCaller to find SNPs and INDELS in viral read data (haploid) but am finding that it throws away around half of my reads and I don't understand why. A small proportion (8%) are filtered out duplicates and 0.05% fail on mapping quality but I can't account for the majority of lost reads. I appreciate that GATK wasn't built for viral sequences but would you have an idea of what could be causing this? I use the following command after marking duplicates and realigning around indels: java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R Ref.fasta -I realigned_reads.bam --genotyping_mode DISCOVERY -ploidy 1 -bamout reassembled.bam -o rawvariants.vcf I have also tried the same file with UnifiedGenotype and I get the result I expect i.e. most of my reads are retained and I have SNP calls that agree with a VCF constructed in a different program so I assume the reads are lost as part of the local realignment?

Thanks Kirstyn

↧

HaplotypeCaller SB field for multiple-alts

November 22, 2014, 11:53 am

≫ Next: Split joint variants-calling files into SNP and Indel

≪ Previous: Why is HaplotypeCaller dropping half of my reads?

The SB field of HaplotypeCaller output is not described terribly well as far as I can find.
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">

What exactly happens when there are multiple alternate alleles? For example:
scaffold_1 2535 . T C,TTC,<NON_REF> 611.83 . DP=15;MLEAC=1,1,0;MLEAF=0.500,0.500,0.00;MQ=60.00;MQ0=0 GT:AD:DP:GQ:PL:SB 1/2:0,5,10,0:15:99:630,397,379,210,0,180,607,394,210,604:0,0,12,3
It doesn't seem to be particularly informative in this case (a case which is rather common for our data).

If it isn't already part of the possible annotations...
Perhaps the most sensible approach would be to output field with num-fwd, num-rev for each allele (rev, alt1, alt2, ...). SDP for "strand-depth" might be a reasonable name.

↧

Split joint variants-calling files into SNP and Indel

November 25, 2014, 8:46 am

≫ Next: HaplotypeCaller 3.3-0 Homozygous variant calls

≪ Previous: HaplotypeCaller SB field for multiple-alts

Hi. I used haplotypecaller for variants calling. After variant recalibration, there is a vcf contains both SNP and Indel.

Is there any quick way to split it into two vcf: SNP and Indel.

Thanks. Lei

↧

HaplotypeCaller 3.3-0 Homozygous variant calls

November 26, 2014, 2:46 pm

≫ Next: haplotypecaller with VectorLoglessPairHMM without speedup.

≪ Previous: Split joint variants-calling files into SNP and Indel

Hi, I just finished running HaplotypeCaller version 3.3-0 separately on 6 exome samples with the new best practices.

java -Xmx8g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R hg19/hg19_Ordered.fa -I K87/HG19_Analysis/K87-929_final.recalibrated_final.bam --dbsnp dbsnp_138_hg19_Ordered.vcf --pair_hmm_implementation VECTOR_LOGLESS_CACHING -ERC GVCF -variant_index_type LINEAR -variant_index_parameter 128000 --output_mode EMIT_VARIANTS_ONLY -gt_mode DISCOVERY --pcr_indel_model CONSERVATIVE -o ./Haplotypes_929.vcf

Many Variant sites are called as homozygous alt (1/1), but none of these sites that are processed to infer haplotype are called as homozygous alt in their PGT field, they are all called as hets, PGT=0|1. for example:

GT:AD:DP:GQ:PGT:PID:PL:SB 1/1:0,29,0:29:93:0|1:121483392_C_G:1331,93,0,1331,93,1331:0,0,13,16

The allelic depths agree with the phased genotype but out of all 6 exomes processed, not a single 1/1 is also phased as 1|1.

This seemed odd, but I continued with GenotyeGVCF:

java -Xmx32g -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -R hg19/hg19_Ordered.fa -V Haplotypes_450.vcf -V Haplotypes_452.vcf -V Haplotypes_925.vcf -V Haplotypes_926.vcf -V Haplotypes_927.vcf -V Haplotypes_929.vcf -D dbsnp_138_hg19_Ordered.vcf -ped K87/HG19_Analysis/K87_6.ped -o Haplotypes_K87_GVCFs.vcf

I'm looking at the output vcf as it's being generated and now there are homozygous alt calls but they conflict with the associated Allelic Depths:

.... GT:AD:DP:GQ:PGT:PID:PL .... 1/1:0,29:29:85:1|1:33957151_G_T:948,85,0 .....

Full Line: chr1 33957152 rs4403594 T G 3166.96 . AC=12;AF=1.00;AN=12;DB;DP=99;FS=0.000;GQ_MEAN=48.50;GQ_STDDEV=27.55;MLEAC=12;MLEAF=1.00;MQ=39.65;MQ0=0;NCC=0;QD=32.32;SOR=0.693 GT:AD:DP:GQ:PGT:PID:PL 1/1:0,9:9:27:.:.:330,27,0 1/1:0,5:5:15:.:.:141,15,0 1/1:0,29:29:85:1|1:33957151_G_T:948,85,0 1/1:0,20:20:60:.:.:722,60,0 1/1:0,24:24:71:.:.:685,71,0 1/1:0,11:11:33:.:.:366,33,0

Can you help me interpret what seems to me as conflicting results?

Cheers,

Patrick

↧

haplotypecaller with VectorLoglessPairHMM without speedup.

November 27, 2014, 12:50 pm

≫ Next: Effect of performing RNA-Seq with a highly fragmented reference genome: MAQ-values and genotyping Ha

≪ Previous: HaplotypeCaller 3.3-0 Homozygous variant calls

Hi team!

I am testing haplotypecaller with VectorLoglessPairHMM on a singel BAM. There are two weird things.

There is no speedup going from -nct 1 to -nct 10.
There is no speedup implementing VectorLoglessPairHMM.

I am very sorry, but here is the first lines of the log file. Hope you have a suggestion for what I can do to speed up the haplotypecaller successfully.

```sh INFO 21:37:58,043 HelpFormatter - -------------------------------------------------------------------------------- INFO 21:37:58,045 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.2-2-gec30cee, Compiled 2014/07/17 15:22:03 INFO 21:37:58,045 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 21:37:58,045 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 21:37:58,048 HelpFormatter - Program Args: -T HaplotypeCaller -R /mnt/users/torfn/Projects/BosTau/Reference/Bos_taurus.UMD3.1.74.dna_rm.chromosome.ALL.fa -I /mnt/users/tikn/old_Backup2/cigene-pipeline-snp-detection/align_all/2052/2052_aln.posiSrt.withRG.dedup.bam --genotyping_mode DISCOVERY --dbsnp /mnt/users/torfn/Projects/BosTau/Reference/vcf_chr_ALL-dbSNP138.vcf -stand_emit_conf 10 -stand_call_conf 30 -minPruning 3 -o test.gatk.31.vcf -nct 10 --pair_hmm_implementation VECTOR_LOGLESS_CACHING INFO 21:37:58,052 HelpFormatter - Executing as tikn@m620-7 on Linux 2.6.32-504.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.7.0_71-mockbuild_2014_10_17_22_23-b00. INFO 21:37:58,052 HelpFormatter - Date/Time: 2014/11/27 21:37:58 INFO 21:37:58,052 HelpFormatter - -------------------------------------------------------------------------------- INFO 21:37:58,053 HelpFormatter - -------------------------------------------------------------------------------- INFO 21:37:58,331 GenomeAnalysisEngine - Strictness is SILENT INFO 21:37:58,521 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 250 INFO 21:37:58,538 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 21:37:58,866 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.33 INFO 21:37:58,892 HCMappingQualityFilter - Filtering out reads with MAPQ < 20 INFO 21:37:59,211 MicroScheduler - Running the GATK in parallel mode with 10 total threads, 10 CPU thread(s) for each of 1 data thread(s), of 32 processors available on this machine INFO 21:37:59,338 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files INFO 21:38:00,229 GenomeAnalysisEngine - Done preparing for traversal INFO 21:38:00,230 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 21:38:00,231 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 21:38:00,232 ProgressMeter - Location | active regions | elapsed | active regions | completed | runtime | runtime INFO 21:38:00,446 HaplotypeCaller - Using global mismapping rate of 45 => -4.5 in log10 likelihood units INFO 21:38:00,448 PairHMM - Performance profiling for PairHMM is disabled because HaplotypeCaller is being run with multiple threads (-nct>1) option Profiling is enabled only when running in single thread mode

Using AVX accelerated implementation of PairHMM INFO 21:38:04,922 VectorLoglessPairHMM - libVectorLoglessPairHMM unpacked successfully from GATK jar file INFO 21:38:04,923 VectorLoglessPairHMM - Using vectorized implementation of PairHMM INFO 21:38:30,237 ProgressMeter - 1:656214 0.0 30.0 s 49.6 w 0.0% 33.8 h 33.8 h INFO 21:39:30,239 ProgressMeter - 1:2160900 0.0 90.0 s 148.8 w 0.1% 30.8 h 30.8 h INFO 21:40:30,241 ProgressMeter - 1:3789347 0.0 2.5 m 248.0 w 0.1% 29.3 h 29.2 h INFO 21:41:30,242 ProgressMeter - 1:5347891 0.0 3.5 m 347.2 w 0.2% 29.0 h 29.0 h

```

kind reagards

Tim Knutsen

↧

Effect of performing RNA-Seq with a highly fragmented reference genome: MAQ-values and genotyping Ha

November 30, 2014, 2:28 pm

≫ Next: Calling complex pedigrees with HaplotypeCaller

≪ Previous: haplotypecaller with VectorLoglessPairHMM without speedup.

Hi, I am performing RNA-Seq to identify new polymorphisms in a species of sea star. Our short-term goal is to generate novel DNA sequences of coding genes for phylogenetic analysis. It is therefore important that polymorphisms be called accurately and that they can be phased.

Our reference genome is poorly assembled and comprises over 60,000 scaffolds and contigs. Subsequently, when paired-end RNA-Seq reads are aligned to this reference genome (using TopHat), the two halves of the pair are often mapped to different scaffolds or contigs. This seems to greatly lower the MAQ score, which in turn leads to HaplotypeCaller missing well-supported polymorphisms, because the reads that support them have MAQ values between 1 and 3.

The obvious solution for this is to set the --min-mapping-quality-score to 1 or 2, rather than the default of 20; and raising the --min_base_quality_score from the default value of 10 to maybe 25 or 30. This does, however, increase the risk of calling false positives from poorly aligned regions.

Has this situation been considered by the GATK development team, and is there a recommended way to account for it?

↧

Calling complex pedigrees with HaplotypeCaller

December 2, 2014, 8:32 am

≫ Next: Arguments for HaplotypeCaller or GenotypeGVCFs

≪ Previous: Effect of performing RNA-Seq with a highly fragmented reference genome: MAQ-values and genotyping Ha

Hi,

I want to use HaplotypeCaller to call families together. I have bam files for each individual in the 4 families I am studying, as well as a ped file describing the pedigree information. The problem is that these families have complex pedigrees, with the parents (mother and father), the children, and then one grandchild for each child (do not have information about the other parent of the grandchild). I would like to call these families with their complex pedigrees together, and I would like to call all the 4 families together to maximize the power of the calling. However, I'm not sure how to do that with just the -ped option. -ped seems to be designed for only one family or cohort, and I'm not sure it would work for me to feed it all my bams as inputs. Are there any other tools for GATK that I could use to call complex pedigrees?

The other possibility would be to call the 4 trios separately and each child-grandchild pair separately, but not sure how to do that either with just -ped. What would you recommend?

And finally, I get an error message saying that --emitRefConfidence only works for single sample mode. It seems that I should omit this option when I run HaplotypeCaller on my families, but are there any other options that I should use for cohort calling besides the standard ones (which I know to be --variant_index_type and --variant_index_parameter)?

Thanks, Alva

↧

Arguments for HaplotypeCaller or GenotypeGVCFs

December 3, 2014, 8:57 am

≫ Next: What exactly does the --minReadsPerAlignmentStart flag specify in HaplotypeCaller?

≪ Previous: Calling complex pedigrees with HaplotypeCaller

If I want the variants to be called only if they fit the following criteria:

1) Min. total coverage for consideration of heterozygous is 10.

2) Min. coverage of each of the two observed major basecalls to be called heterozygous is 5.

3) Min. percentage of each of the two observed major basecalls in order to be called heterozygous is 20.

4) Min. coverage in order for a position to be called homozygous is 6.

which command-line arguments in which tools (HaplotpyeCaller or GenotypeGVCFs) can I use to accomplish these? I cannot seem to find the proper arguments in the documentation. I apologize if I overlook.

Thank you

↧

What exactly does the --minReadsPerAlignmentStart flag specify in HaplotypeCaller?

December 4, 2014, 10:54 am

≪ Previous: Arguments for HaplotypeCaller or GenotypeGVCFs

Specifically, what does the 'start' component of this flag mean? Do the reads all have to start in exactly the same location? Alternatively, does the flag specify the total number of reads that must overlap a putative variant before that variant will be considered for calling?

↧