Recommended protocol for bootstrapping HaplotypeCaller and BaseRecalibrator outputs?

December 4, 2014, 11:17 am

≫ Next: Missing reference allele in GVCF file after running HaplotypeCaller

≪ Previous: What exactly does the --minReadsPerAlignmentStart flag specify in HaplotypeCaller?

I am identifying new sequence variants/genotypes from RNA-Seq data. The species I am working with is not well studied, and there are no available datasets of reliable SNP and INDEL variants.

For BaseRecallibrator, it is recommended that when lacking a reliable set of sequence variants: "You can bootstrap a database of known SNPs. Here's how it works: First do an initial round of SNP calling on your original, unrecalibrated data. Then take the SNPs that you have the highest confidence in and use that set as the database of known SNPs by feeding it as a VCF file to the base quality score recalibrator. Finally, do a real round of SNP calling with the recalibrated data. These steps could be repeated several times until convergence."

Setting up a script to run HaplotypeCaller and BaseRecallibrator in a loop should be fairly strait forward. What is a good strategy for comparing VCF files and assessing convergence?

↧

Missing reference allele in GVCF file after running HaplotypeCaller

December 5, 2014, 7:27 am

≫ Next: HaplotypeCaller: --min_base_quality_score vs -stand_call_conf and -stand_emit_conf flags

≪ Previous: Recommended protocol for bootstrapping HaplotypeCaller and BaseRecalibrator outputs?

Hi,

I used HaplotypeCaller in GVCF mode to generate a single sample GVCF, but when I checked my vcf file I see that the reference allele is not showing up:

22  1   .   N   <NON_REF>   .   .   END=16050022    GT:DP:GQ:MIN_DP:PL  0/0

0,0,0 22 16050023 . C <NON_REF> . . END=16050023 GT:DP:GQ:MIN_DP:PL 0/0:1:3:1:0,3,37 22 16050024 . A <NON_REF> . . END=16050026 GT:DP:GQ:MIN_DP:PL 0/0:2:6:2:0,6,73 22 16050027 . A <NON_REF> . . END=16050035 GT:DP:GQ:MIN_DP:PL 0/0:3:9:3:0,9,110 22 16050036 . A C,<NON_REF> 26.80 . BaseQRankSum=-0.736;ClippingRankSum=-0.736;DP=3;MLEAC=1,0;MLEAF=0.500,0.00;MQ=27.00;MQ0=0;MQRankSum=-0.736;ReadPosRankSum=0.736 GT:AD:DP:GQ:PL:SB 0/1:1,2,0:3:23:55,0,23,58,29,86:1,0,2,0 22 16050037 . G <NON_REF> . . END=16050037 GT:DP:GQ:MIN_DP:PL 0/0:3:9:3:0,9,109 22 16050038 . A <NON_REF> . . END=16050039 GT:DP:GQ:MIN_DP:PL 0/0:4:12:4:0,12,153

I am not sure where to start troubleshooting for this, since all the steps prior to using HaplotypeCaller did not generate any obvious errors.

The basic command that I used was: java -Xmx4g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R hs37d5.fa -I recal_1.bam -o raw_1.vcf -L 22 --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000

Have you encountered this problem before? Where should I start troubleshooting?

Thanks very much in advance, Alva

↧

HaplotypeCaller: --min_base_quality_score vs -stand_call_conf and -stand_emit_conf flags

December 5, 2014, 9:41 am

≫ Next: Adding read filters to HC

≪ Previous: Missing reference allele in GVCF file after running HaplotypeCaller

HaplotypeCaller has a --min_base_quality_score flag to specify the minimum "base quality required to consider a base for calling". It also has the -stand_call_conf and -stand_emit_conf that both of which use a "minimum phred-scaled confidence threshold".

What is the difference between the base quality flag and the phred-scaled confidence thresholds flags in deciding whether or not to call a variant?

Also, why are there separate flags for calling variants and emitting variants? To my way of thinking, calling and emitting variants are one-and-the-same.

Is there any way to specify a minimum minor-allele-frequency for calling variants in Haplotyper? Under the default settings, the program expects 1:1 ratio of each allele in a single diploid organism. However, stochastic variance in which RNA fragments are sequenced and retained could lead to departures from this ratio.

Finally, is there any way to specify the minimum level of coverage for a given variant for it to be considered for calling/genotyping?

↧

Adding read filters to HC

December 9, 2014, 4:16 am

≫ Next: not all sites emitted with GENOTYPE_GIVEN_ALLELES

≪ Previous: HaplotypeCaller: --min_base_quality_score vs -stand_call_conf and -stand_emit_conf flags

Dear GATK Team, If I specify a read filter using the -rf option, is that read filter added to the filters applied by default, or will that then be the only filter that is applied (so I would also need to specify the defaults to ensure they were all run.

e.g. I want to add a bad cigar filter...

-rf BadCigar

But I also want the default filters applied, namely: - NotPrimaryAlignmentFilter - FailsVendorQualityCheckFilter - DuplicateReadFilter - UnmappedReadFilter - MappingQualityUnavailableFilter - HCMappingQualityFilter - MalformedReadFilter

↧

not all sites emitted with GENOTYPE_GIVEN_ALLELES

December 10, 2014, 5:17 am

≫ Next: HaplotypeCaller, StrandBiasBySample Annotation

≪ Previous: Adding read filters to HC

I am running HC3.3-0 with the following options (e.g. GENOTYPE_GIVEN_ALLELES):

$java7 -Djava.io.tmpdir=tmp -Xmx3900m \
 -jar $jar \
 --analysis_type HaplotypeCaller \
 --reference_sequence $ref \
 --input_file $BAM \
 --intervals $CHROM \
 --dbsnp $dbSNP \
 --out $out \
 -stand_call_conf 0 \
 -stand_emit_conf 0 \
 -A Coverage -A FisherStrand -A HaplotypeScore -A MappingQualityRankSumTest -A QualByDepth -A RMSMappingQuality -A ReadPosRankSumTest \
 -L $allelesVCF \
 -L 20:60000-70000 \
 --interval_set_rule INTERSECTION \
 --genotyping_mode GENOTYPE_GIVEN_ALLELES \
 --alleles $allelesVCF \
 --emitRefConfidence NONE \
 --output_mode EMIT_ALL_SITES \

The file $allelesVCF contains these neighbouring SNPs:

20  60807   .   C   T   118.96  .
20  60808   .   G   A   46.95   .
20  61270   .   A   C   2870.18 .
20  61271   .   T   A   233.60  .

I am unable to call these neighbouring SNPs; despite reads being present in the file $BAM, which shouldn't matter anyway. I also tried adding --interval_merging OVERLAPPING_ONLY to the command line, but that didn't solve the problem. What am I doing wrong? I should probably add GATK breaker/misuser to my CV...

Thank you as always.

P.S. The CommandLineGATK documentation does not say, what the default value for --interval_merging is.

P.P.S. Iterative testing a bit slow, because HC always has to do this step:

HCMappingQualityFilter - Filtering out reads with MAPQ < 20

↧

HaplotypeCaller, StrandBiasBySample Annotation

December 11, 2014, 7:57 am

≫ Next: "GVCFs produced by HaplotypeCaller" vs "VCFs produced by UnifiedGenotyper (EMIT_ALL_SITES option)"

≪ Previous: not all sites emitted with GENOTYPE_GIVEN_ALLELES

Hello

I am using GATK 3.3-0 HaplotypeCaller for variant calling. When I run HaplotypeCaller with the command

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R all_chr_reordered.fa -I 30_S30.bam --genotyping_mode DISCOVERY -stand_emit_conf 10 -stand_call_conf 50 -o 30_S30_control1.vcf -L brca.intervals

I get all the variants I want, however, I also want to get the number of forward and reverse reads that support REF and ALT alleles. Therefore I use StrandBiasBySample annotation when running HaplotypeCaller with the command:

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R all_chr_reordered.fa -I 30_S30.bam --genotyping_mode DISCOVERY -stand_emit_conf 10 -stand_call_conf 50 -o 30_S30_control2.vcf -L brca.intervals -A StrandBiasBySample

The SB field is added, but a variant that was in 30_S30_control1.vcf is absent in 30_S30_control2.vcf. All the remaining variants are there. The only difference between two variant calls was adding -A StrandBiasBySample. What I'm wondering about is that why this one variant is absent.

the missing variant: 17 41276152 . CAT C 615.73 . AC=1;AF=0.500;AN=2;BaseQRankSum=0.147;ClippingRankSum=0.564;DP=639;FS=15.426;MLEAC=1;MLEAF=0.500;MQ=60.00;MQ0=0;MQRankSum=-1.054;QD=0.96;ReadPosRankSum=0.698;SOR=2.181 GT:AD:DP:GQ:PL 0/1:565,70:635:99:653,0,18310

So I decided to run HaplotypeCaller without -A StrandBiasBySample and later add the annotations with VariantAnnotator. Here is the command:

java -jar GenomeAnalysisTK.jar -T VariantAnnotator -R all_chr_reordered.fa -I 30_S30.bam --variant 30_S30_control1.vcf -L 30_S30_control1.vcf -o 30_S30_control1_SBBS.vcf -A StrandBiasBySample

However, the output vcf file 30_S30_control1_SBBS.vcf was not different from the input variant file 30_S30_control1.vcf except for the header, SB field wasn't added. Why was the SB field not added? Is there any other way to get the number of forward and reverse reads?

Please find 30_S30_control1.vcf, 30_S30_control2.vcf and 30_S30_control1_SBBS.vcf in attachment

Thanks

↧

"GVCFs produced by HaplotypeCaller" vs "VCFs produced by UnifiedGenotyper (EMIT_ALL_SITES option)"

December 10, 2014, 12:57 am

≫ Next: Same setting, different SNPs being called

≪ Previous: HaplotypeCaller, StrandBiasBySample Annotation

Hi GATK team, i would like to seek opinion from your team to find the best workflow that best fit my data. Previously i've been exploring both variant calling algorithms UnifiedGenotyper and HaplotypeCaller, and i would love to go for UnifiedGenotyper considering of the sensitivity and the analysis runtime. Due to my experimental cohort samples grows from time to time, so i've opt for single sample calling follow by joint-analysis using combineVariants instead of doing multiple-samples variant calling. However by doing so, i've experience few drawbacks from it (this issue was discussed at few forums). For a particular SNP loci, we wouldn't know whether the "./." reported for some of the samples are due to no reads covering that particular loci, or it doesn't pass certain criteria during variant calling performed previously, or it is a homo-reference base (which i concern this most and can't cope to lost this information).

Then, i found this "gvcf", and it is potentially to solve my problem (Thanks GATK team for always understand our researcher's need)!! Again, i'm insist of opt for unifiedGenotyper instead of haplotypeCaller to generate the gvcf, and reading from the forum at https://www.broadinstitute.org/gatk/guide/tagged?tag=gvcf, i would assume that as VCFs produced by "UnifiedGenotyper with --output_mode EMIT_ALL_SITES" to be something alike with the gvcf file produced by HaplotyperCaller. However i couldn't joint them up using either "CombineVariants" or "CombineGVCFs", most probably i think "UnifiedGenotyper with --output_mode EMIT_ALL_SITES" doesn't generate gvcf format.

Can you please give me some advice to BEST fit my need and with minimum runtime (UnifiedGenotyper would be my best choice), is there any method to joint the ALL_SITES vcf file produced by UnifiedGenotyper which i might probably missed out from the GATK page?

↧

Same setting, different SNPs being called

December 12, 2014, 2:42 am

≫ Next: Bug? PairHMM outputs to stdout instead of stderr

≪ Previous: "GVCFs produced by HaplotypeCaller" vs "VCFs produced by UnifiedGenotyper (EMIT_ALL_SITES option)"

Hey,

currently, I am working on 454 sequencing data. I am using GATK to call SNPs and indels, which works relatively well for the SNPs with the recommended filtering options. However, I was interested to raise sensitivity, which is why I started experimenting with the QD value. I restarted the whole GATK pipeline (including the genotyping) with the filtering option "QD<0.0" instead of "QD<2.0". As I did not change any other parameters, I was just expecting to see more SNPs being called. However, I came across a list of SNPs, which did contain some new SNPs, but which was also lacking one of the previously called SNPs (it was not even present in the raw HaplotypeCaller output). How could this have happened?

This is an extract from the vcf file containing the SNP that was previously called:

4 106156187 rs17253672 C T 490.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=2.341;ClippingRankSum=1.022;DB;DP=161;FS=4.193;MLEAC=1;MLEAF=0.500;MQ=60.00;MQ0=0;MQRankSum=-0.231;QD=3.05;ReadPosRankSum=1.347 GT:AD:DP:GQ:PL 0/1:10,12:22:99:519,0,262

I am very curious to hear whether anyone might have an explanation for my findings. Many thanks in advance!

↧

Bug? PairHMM outputs to stdout instead of stderr

December 12, 2014, 4:51 am

≫ Next: Code exception at Haplotypecalle step.

≪ Previous: Same setting, different SNPs being called

Hi,

I'm using 3.3-0-g37228af.

This command gets around a feature that looks like a bug to me:

 gatk -T HaplotypeCaller -R ref.fa -I in.bam -o /dev/stdout 2> log.txt | grep -v -e PairHMM -e "JNI call" | bgzip -s - > out.vcf.gz

As you see, one has to remove three lines of the stdout stream written by PairHMM. Like all other logging information, they should go to stderr.

The three lines look like this:

Using un-vectorized C++ implementation of PairHMM
Time spent in setup for JNI call : 3.3656100000000003E-4
Total compute time in PairHMM computeLikelihoods() : 0.008854789

Regard,s Ari

↧

Code exception at Haplotypecalle step.

December 11, 2014, 11:11 pm

≫ Next: HaplotypeCaller and ClippingRankSum

≪ Previous: Bug? PairHMM outputs to stdout instead of stderr

Here is my command I actually used.

java -jar -Xmx39g $gatk -T HaplotypeCaller -l INFO -R $reference -I $sample.bam --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 --dbsnp $dbsnp -o $sample.raw.snps.indels.g.vcf -S LENIENT > $sampleName""$processDay""HaplotypeCaller.log 2>&1 &&

I'm trying to use Haplotypecaller on some whole genome projects. There is no any error from mapping to BQSR. However, such error message happened ramdomly in term of the location of genome and subjects. Some subjects can be finished without any error by the same script. Could anyone give some idea about what was happened ?

ERROR ------------------------------------------------------------------------------------------

ERROR stack trace

java.util.ConcurrentModificationException at java.util.LinkedHashMap$LinkedHashIterator.nextEntry(Unknown Source) at java.util.LinkedHashMap$EntryIterator.next(Unknown Source) at java.util.LinkedHashMap$EntryIterator.next(Unknown Source) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCallerGenotypingEngine.addMiscellaneousAllele(HaplotypeCa at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCallerGenotypingEngine.assignGenotypeLikelihoods(Haplotyp at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:941) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:218) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler$ReadMapReduceJob.run(NanoScheduler.java:471) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source)

ERROR ------------------------------------------------------------------------------------------

ERROR A GATK RUNTIME ERROR has occurred (version 3.2-2-gec30cee):

ERROR

ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.

ERROR If not, please post the error message, with stack trace, to the GATK forum.

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions http://www.broadinstitute.org/gatk

ERROR

ERROR MESSAGE: Code exception (see stack trace for error itself)

ERROR ------------------------------------------------------------------------------------------

↧

HaplotypeCaller and ClippingRankSum

December 15, 2014, 2:16 pm

≫ Next: ERROR : IndexOutOfBoundsException: Index: 3, Size: 3 in getAlternateAllele, when genotyping variant

≪ Previous: Code exception at Haplotypecalle step.

Is there any way to turn off the ClippingRankSum? When I attempt to use --excludeAnnotation ClippingRankSum it doesn't seem to do anything. I'm trying to locate the source of some inconsistencies between runs and think this may be the culprit.

↧

ERROR : IndexOutOfBoundsException: Index: 3, Size: 3 in getAlternateAllele, when genotyping variant

December 17, 2014, 10:04 am

≫ Next: HC step 1: Defining ActiveRegions by measuring data entropy

≪ Previous: HaplotypeCaller and ClippingRankSum

Hi, I'm currently trying to use GATK to call variants from Human RNA seq data

So far, I've managed to do variant calling in all my samples following the GATK best practice guidelines. (using HaplotypeCaller in DISCOVERY mode on each sample separately)

But I'd like to go further and try to get the genotype in every sample, of each variant found in at least one sample. This, to differentiate for each variant, samples where that variant is absent (homozygous for reference allele) from samples where it is not covered (and therefore note genotyped).

To do so, I've first used CombineVariants to merge variants from all my samples and to create the list of variants to be genotype ${ALLELES}.vcf

I then try to regenotype my samples with HaplotypeCaller using the GENOTYPE_GIVEN_ALLELES mode and the same settings as before: my command is the following:

******java -jar ${GATKPATH}/GenomeAnalysisTK.jar -T HaplotypeCaller -R ${GENOMEFILE}.fa -I ${BAMFILE_CALIB}.bam --genotyping_mode GENOTYPE_GIVEN_ALLELES -alleles ${ALLELES}.vcf -out_mode EMIT_ALL_SITES -dontUseSoftClippedBases -stand_emit_conf 20 -stand_call_conf 20
-o ${SAMPLE}_genotypes_all_variants.vcf -mbq 25 -L ${CDNA_BED}.bed --dbsnp ${DBSNP}.vc**f

In doing so I invariably get the same error after calling 0.2% of the genome.

ERROR ------------------------------------------------------------------------------------------

ERROR stack trace

java.lang.IndexOutOfBoundsException: Index: 3, Size: 3 at java.util.ArrayList.rangeCheck(ArrayList.java:635) at java.util.ArrayList.get(ArrayList.java:411) at htsjdk.variant.variantcontext.VariantContext.getAlternateAllele(VariantContext.java:845) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCallerGenotypingEngine.assignGenotypeLikelihoods(HaplotypeCallerGenotypingEngine.java:248) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:1059) at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:221) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:709) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:705) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:274) at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:78) at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:319) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107)

ERROR ------------------------------------------------------------------------------------------

ERROR A GATK RUNTIME ERROR has occurred (version 3.3-0-g37228af):

ERROR

ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.

ERROR If not, please post the error message, with stack trace, to the GATK forum.

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions http://www.broadinstitute.org/gatk

ERROR

ERROR MESSAGE: Index: 3, Size: 3

ERROR ------------------------------------------------------------------------------------------

because the problem seemed to originate from getAlternateAllele, I tried to play with --max_alternate_alleles by setting it to 2 or 10, without success. I also checked my ${ALLELES}.vcf file to look for malformed Alternate alleles in the region where the GATK crashes (Chr 1, somewhere after 78Mb) , but I couldn't identify any... (I searched for Alternate alles that would not match the following extended regexpr '[ATGC,]+')

I would be grateful for any help you can provide. Thanks.

↧

HC step 1: Defining ActiveRegions by measuring data entropy

May 8, 2014, 3:56 pm

≫ Next: HaplotypeCaller Multisample Variant Calling

≪ Previous: ERROR : IndexOutOfBoundsException: Index: 3, Size: 3 in getAlternateAllele, when genotyping variant

This document describes the procedure used by HaplotypeCaller to define ActiveRegions on which to operate as a prelude to variant calling. For more context information on how this fits into the overall HaplotypeCaller method, please see the more general HaplotypeCaller documentation.

Summary

To define active regions, the HaplotypeCaller operates in three phases. First, it computes an activity score for each individual genome position, yielding the raw activity profile, which is a wave function of activity per position. Then, it applies a smoothing algorithm to the raw profile, which is essentially a sort of averaging process, to yield the actual activity profile. Finally, it identifies local maxima where the activity profile curve rises above the preset activity threshold, and defines appropriate intervals to encompass the active profile within the preset size constraints.

1. Calculating the raw activity profile

Active regions are determined by calculating a profile function that characterizes “interesting” regions likely to contain variants. The raw profile is first calculated locus by locus.

In the normal case (no special mode is enabled) the per-position score is the probability that the position contains a variant as calculated using the reference-confidence model applied to the original alignment.

If using the mode for genotyping given alleles (GGA) or the advanced-level flag -useAlleleTrigger, and the site is overlapped by an allele in the VCF file provided through the -alleles argument, the score is set to 1. If the position is not covered by a provided allele, the score is set to 0.

This operation gives us a single raw value for each position on the genome (or within the analysis intervals requested using the -L argument).

2. Smoothing the activity profile

The final profile is calculated by smoothing this initial raw profile following three steps. The first two steps consist in spreading individual position raw profile values to contiguous bases. As a result each position will have more than one raw profile value that are added up in the third and last step to obtain a final unique and smoothed value per position.

Unless one of the special modes is enabled (GGA or allele triggering), the position profile value will be copied over to adjacent regions if enough high quality soft-clipped bases immediately precede or follow that position in the original alignment. At time of writing, high-quality soft-clipped bases are those with quality score of Q29 or more. We consider that there are enough of such a soft-clips when the average number of high quality bases per soft-clip is 7 or more. In this case the site profile value is copied to all bases within a radius of that position as large as the average soft-clip length without exceeding a maximum of 50bp.
Each profile value is then divided and spread out using a Gaussian kernel covering up to 50bp radius centered at its current position with a standard deviation, or sigma, set using the -bandPassSigma argument (current default is 17 bp). The larger the sigma, the broader the spread will be.
For each position, the final smoothed value is calculated as the sum of all its profile values after steps 1 and 2.

3. Setting the ActiveRegion thresholds and intervals

The resulting profile line is cut in regions where it crosses the non-active to active threshold (currently set to 0.002). Then we make some adjustments to these boundaries so that those regions that are to be considered active, with a profile running over that threshold, fall within the minimum (fixed to 50bp) and maximum region size (customizable using -activeRegionMaxSize).

If the region size falls within the limits we leave it untouched (it's good to go).
If the region size is shorter than the minimum, it is greedily extended forward ignoring that cut point and we come back to step 1. Only if this is not possible because we hit a hard-limit (end of the chromosome or requested analysis interval) we will accept the small region as it is.
If it is too long, we find the lowest local minimum between the maximum and minimum region size. A local minimum is a profile value preceded by a large one right up-stream (-1bp) and an equal or larger value down-stream (+1bp). In case of a tie, the one further downstream takes precedence. If there is no local minimum we simply force the cut so that the region has the maximum active region size.

Of the resulting regions, those with a profile that runs over this threshold are considered active regions and progress to variant discovery and or calling whereas regions whose profile runs under the threshold are considered inactive regions and are discarded except if we are running HC in reference confidence mode.

There is a final post-processing step to clean up and trim the ActiveRegion:

Remove bases at each end of the read (hard-clipping) until there a base with a call quality equal or greater than minimum base quality score (customizable parameter -mbq, 10 by default).
Include or exclude remaining soft-clipped ends. Soft clipped ends will be used for assembly and calling unless the user has requested their exclusion (using -dontUseSoftClippedBases), if the read and its mate map to the same chromosome, and if they are in the correct standard orientation (i.e. LR and RL).
Clip off adaptor sequences of the read if present.
Discard all reads that no longer overlap with the ActiveRegion after the trimming operations described above.
Downsample remaining reads to a maximum of 1000 reads per sample, but respecting a minimum of 5 reads starting per position. This is performed after any downsampling by the traversal itself (-dt, -dfrac, -dcov etc.) and cannot be overriden from the command line.

↧

HaplotypeCaller Multisample Variant Calling

June 4, 2013, 12:41 pm

≫ Next: HaplotypeCaller error: "Index: unable to reposition file channel of index file"

≪ Previous: HC step 1: Defining ActiveRegions by measuring data entropy

Hey there!

I've been using HaplotypeCaller as part of a new whole genome variant calling pipeline I'm working on and I had a question about the number of samples to use. From my tests, it seems like increasing the number of samples to run HaplotypeCaller on simultaneously improves the accuracy no matter how many samples I add (I've tried 1, 4, and 8 samples at a time). Before I tried 16 samples, I was wondering if you could tell me if there's a point of diminishing returns for adding samples to HaplotypeCaller. It seems like with every sample I add, the time/sample increases, so I don't want to keep adding samples if it's not going to result in an improved call set, but if it does improve the results I'll deal with it and live with the longer run times. I should note that I'm making this pipeline for an experiment where there will be up to 50 individuals, and of those, there are family groups of 3-4 people. If running HaplotypeCaller on all 50 simultaneously would result in the best call set, that's what I'll do. Thanks! (By the way, I love the improvements you made with 2.5!)

Grant

↧

HaplotypeCaller error: "Index: unable to reposition file channel of index file"

December 27, 2014, 6:17 pm

≫ Next: Haplotype caller for SNP calling from RNA seq data

≪ Previous: HaplotypeCaller Multisample Variant Calling

Is this likely to be something wrong with my bam.bai file?

INFO 00:33:22,970 HelpFormatter - -------------------------------------------------------------------------------- INFO 00:33:22,975 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.3-0-g37228af, Compiled 2014/10/24 01:07:22 INFO 00:33:22,975 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 00:33:22,975 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 00:33:22,980 HelpFormatter - Program Args: -T HaplotypeCaller -R /scratch2/vyp-scratch2/reference_datasets/human_reference_sequence/human_g1k_v37.fasta -I Project_1410AHP-0004_ElenaSchiff_211samples/align/data//61_sorted_unique.bam --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -stand_call_conf 30.0 -stand_emit_conf 10.0 -L 22 --downsample_to_coverage 200 --GVCFGQBands 10 --GVCFGQBands 20 --GVCFGQBands 50 -o Project_1410AHP-0004_ElenaSchiff_211samples/gvcf/data//61_chr22.gvcf.gz INFO 00:33:22,989 HelpFormatter - Executing as rmhanpo@groucho-1-2.local on Linux 2.6.18-371.el5 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_45-b18. INFO 00:33:22,989 HelpFormatter - Date/Time: 2014/12/28 00:33:22 INFO 00:33:22,989 HelpFormatter - -------------------------------------------------------------------------------- INFO 00:33:22,989 HelpFormatter - -------------------------------------------------------------------------------- WARN 00:33:23,015 GATKVCFUtils - Creating Tabix index for Project_1410AHP-0004_ElenaSchiff_211samples/gvcf/data/61_chr22.gvcf.gz, ignoring user-specified index type and parameter INFO 00:33:23,649 GenomeAnalysisEngine - Strictness is SILENT INFO 00:33:23,823 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 200 INFO 00:33:23,831 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 00:33:23,854 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.02 INFO 00:33:23,862 HCMappingQualityFilter - Filtering out reads with MAPQ < 20 INFO 00:33:23,869 IntervalUtils - Processing 51304566 bp from intervals INFO 00:33:24,114 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files INFO 00:33:25,977 GATKRunReport - Uploaded run statistics report to AWS S3

ERROR ------------------------------------------------------------------------------------------

ERROR stack trace

org.broadinstitute.gatk.utils.exceptions.ReviewedGATKException: Index: unable to reposition file channel of index file Project_1410AHP-0004_ElenaSchiff_211samples/align/data/61_sorted_unique.bam.bai at org.broadinstitute.gatk.engine.datasources.reads.GATKBAMIndex.skipBytes(GATKBAMIndex.java:438) at org.broadinstitute.gatk.engine.datasources.reads.GATKBAMIndex.skipToSequence(GATKBAMIndex.java:295) at org.broadinstitute.gatk.engine.datasources.reads.GATKBAMIndex.readReferenceSequence(GATKBAMIndex.java:127) at org.broadinstitute.gatk.engine.datasources.reads.BAMSchedule.(BAMSchedule.java:104) at org.broadinstitute.gatk.engine.datasources.reads.BAMScheduler.getNextOverlappingBAMScheduleEntry(BAMScheduler.java:295) at org.broadinstitute.gatk.engine.datasources.reads.BAMScheduler.advance(BAMScheduler.java:184) at org.broadinstitute.gatk.engine.datasources.reads.BAMScheduler.populateFilteredIntervalList(BAMScheduler.java:104) at org.broadinstitute.gatk.engine.datasources.reads.BAMScheduler.createOverIntervals(BAMScheduler.java:78) at org.broadinstitute.gatk.engine.datasources.reads.IntervalSharder.shardOverIntervals(IntervalSharder.java:59) at org.broadinstitute.gatk.engine.datasources.reads.SAMDataSource.createShardIteratorOverIntervals(SAMDataSource.java:1173) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.getShardStrategy(GenomeAnalysisEngine.java:623) at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:315) at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248) at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155) at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107)

ERROR ------------------------------------------------------------------------------------------

ERROR A GATK RUNTIME ERROR has occurred (version 3.3-0-g37228af):

ERROR

ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.

ERROR If not, please post the error message, with stack trace, to the GATK forum.

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions http://www.broadinstitute.org/gatk

ERROR

ERROR MESSAGE: Index: unable to reposition file channel of index file Project_1410AHP-0004_ElenaSchiff_211samples/align/data/61_sorted_unique.bam.bai

ERROR ------------------------------------------------------------------------------------------

↧

Haplotype caller for SNP calling from RNA seq data

January 8, 2015, 7:59 am

≫ Next: Does UG and HC downsample before or after exclusion of marked duplicates?

≪ Previous: HaplotypeCaller error: "Index: unable to reposition file channel of index file"

Hello, I have a general question. Why is it recommended to use single sample for SNP calling from RNA-seq data when using Haplotype caller? Isn't possible to provide all the samples together for the SNP calling purpose? Is it a rather technical aspect of Haplotype Caller or it helps to reduce errors? Thanks a lot in advance for your help, Homa

↧

Does UG and HC downsample before or after exclusion of marked duplicates?

January 11, 2015, 10:03 pm

≫ Next: HaplotypeCaller on whole genome or chromosome by chromosome: different results

≪ Previous: Haplotype caller for SNP calling from RNA seq data

Does UG and HC downsample before or after exclusion of marked duplicates? I guess code is king for the answer...

↧

HaplotypeCaller on whole genome or chromosome by chromosome: different results

January 9, 2015, 6:07 am

≫ Next: haplotye Caller

≪ Previous: Does UG and HC downsample before or after exclusion of marked duplicates?

Hi,

I'm working on targeted resequencing data and I'm doing a multi-sample variant calling with the HaplotypeCaller. First, I tried to call the variants in all the targeted regions by doing the calling at one time on a cluster. I thus specified all the targeted regions with the -L option.

Then, as it was taking too long, I decided to cut my interval list, chromosome by chromosome and to do the calling on each chromosome. At the end, I merged the VCFs files that I had obtained for the callings on the different chromosomes.

Then, I compared this merged VCF file with the vcf file that I obtained by doing the calling on all the targeted regions at one time. I noticed 1% of variation between the two variants lists. And I can't explain this stochasticity. Any suggestion?

Thanks!

Maguelonne

↧

haplotye Caller

January 7, 2015, 1:16 pm

≫ Next: Workflow for fungal WGS

≪ Previous: HaplotypeCaller on whole genome or chromosome by chromosome: different results

I am using GATK version 3.3 (latest release) and following the best practice for variant calling. Steps i am using is 1. Run haplotye caller and get the gVCF file 2. Run GenotypeGVCF across the samples to the get the VCF file

The Question is about the Genotype calling. Here is an example:

GVCF file looks like this for a particular position chr1 14677 . G . . END=14677 GT:DP:GQ:MIN_DP:PL 0/0:9 :0: 9:0,0,203

VCF file looks like this after genotyping with multiple samples. chr1 14677 . G A 76.56 PASS . . GT:AD:DP:GQ:PL 0/1:8,1:9:4:4,0,180

In GVCF file there was no support for the alternate allele, but VCF file shows that there are 1 read supporting the alternate call.

Thanks ! Saurabh

↧

Workflow for fungal WGS

January 9, 2015, 8:47 am

≫ Next: HELP: haplotypecaller doesn't call the driver oncogenic variant!

≪ Previous: haplotye Caller

Hi all, I'm in a bit of a daze going through all the documentation and I wanted to do a sanity check on my workflow with the experts. I have ~120 WGS of a ~24Mb fungal pathogen. The end-product of my GATK workflow would be a high quality call set of SNPs, restricted to the sites for which we have confidence in the call across all samples (so sites which are not covered by sufficient high quality reads in one or more samples will be eliminated).

Therefore my workflow (starting from a sorted indexed BAM file of reads from a single sample, mapped to reference with bwa mem) is this:

01- BAM: Local INDEL realignment (RealignerTargetCreator/IndelRealigner)
02- BAM: MarkDuplicates
03- BAM: Local INDEL realignment second pass (RealignerTargetCreator/IndelRealigner)
04- BAM: Calling variants using HaplotypeCaller
05- VCF: Hard filter variants for truth set for BQSR (there is no known variant site databases so we can use our best variants from each VCF file for this). The filter settings are: "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" and we also filter out heterozygous positions using "isHet == 1".
06- VCF: Calculate BQSR table using the high quality hard-filtered variants from step 05.
07- BAM: Apply BQSR recalibration from previous step to BAM file from step 04.
08- BAM: Calling variants on recalibrated BAM file from previous step using HaplotypeCaller, also emitting reference sites using --output_mode EMIT_ALL_SITES \ and --emitRefConfidence GVCF \

Does this sound like a reasonable thing to do? What options should I use in step 8 in order for HC to tell me how confident it is, site-by-site about it's calls, including those that are homozygous reference? I notice that when using --output_mode EMIT_ALL_CONFIDENT_SITES \ and --emitRefConfidence GVCF \ I am missing a lot of the annotation I get when just outputting variant sites (e.g. QD).

↧