Quantcast
Channel: haplotypecaller — GATK-Forum
Viewing all 1335 articles
Browse latest View live

Ploidy and "Pooled" for the Haplotype Caller

$
0
0

Hello everyone,

I was reading the haplotype caller documentation and noticed the "--sample_ploidy/-ploidy" flag. The description reads "Ploidy (number of chromosomes) per sample. For pooled data, set to (Number of samples in each pool * Sample Ploidy)."

My question is, what exactly is a pooled experiment? Is it when I have multiple samples? I have separate files for each of my 8 samples and the organism only has one chromosome. So would the number I set be 8*1? Or is this pooled number for multiple samples within a file, and in which case, I would specify 1 instead of 8.

Thanks!
Raymosrunerx


Recommended protocol for bootstrapping HaplotypeCaller and BaseRecalibrator outputs?

$
0
0

I am identifying new sequence variants/genotypes from RNA-Seq data. The species I am working with is not well studied, and there are no available datasets of reliable SNP and INDEL variants.

For BaseRecallibrator, it is recommended that when lacking a reliable set of sequence variants:
"You can bootstrap a database of known SNPs. Here's how it works: First do an initial round of SNP calling on your original, unrecalibrated data. Then take the SNPs that you have the highest confidence in and use that set as the database of known SNPs by feeding it as a VCF file to the base quality score recalibrator. Finally, do a real round of SNP calling with the recalibrated data. These steps could be repeated several times until convergence."

Setting up a script to run HaplotypeCaller and BaseRecallibrator in a loop should be fairly strait forward. What is a good strategy for comparing VCF files and assessing convergence?

DP for INDEL is more than 100 and AD is 0

$
0
0

Hi all,

I have the below INDEL call from GATK-3.3 Haplotype caller.

chr17   39190954    .   G   GCAGCAGCTTGGCTGGCAGCAGCTGGTCTCA 770.52  PASS    AC=1;AF=0.500;AN=2;DP=138;    
FS=0.000;MLEAC=1;MLEAF=0.500;MQ=58.33;MQ0=0;QD=5.58;SOR=0.693   GT:AD:DP:GQ:PL  0/1:0,0:0:7:807,0,7

The command used:

java -Xmx10G -jar GenomeAnalysisTK.jar -R %s -T HaplotypeCaller -I %s -L %s -stand_emit_conf 10 -stand_call_conf 30     
--genotyping_mode DISCOVERY -o %s

DP in the INFO field is 138 and AD from the FORMAT field is 0,0. I understand that DP and AD are unfiltered and filtered depths. However, having 0 reads is something alarming. Could someone help me to understand the differing read depths.

Distribution of RGQ scores

$
0
0

I work with non-human genomes and commonly need the confidence of the reference sites, so I was happy to see the inclusion of the RGQ score in the format field of GenotypeGVCFs. However, I am a little confused as to what this score means (how it is calculated). Out of curiosity I plotted the distribution of RGQ and GQ scores over ~1Mbp. A few things jumped out that I was hoping you could explain:

(1) There are two peaks of GQ and RGQ scores, one at 99 - which is obviously just the highest confidence score and another at exactly GQ/RGQ=45. You can see this in the GQ/RGQ distribution below. I've excluded the sites where RGQ/GQ = 0 or 99 (RGQ = blue, GQ=red) is there some reason why so many GT calls == 45?

(2) There are very few GQ = 0 calls and ~96% are GQ=99 - but in the RGQ ~42% == 0 and 54%=99. Is there any explanation why so many RGQ scores == 0? I fear that filtering on RGQ will bias the data against reference calls and include a disproportionate number of variant calls.

./. genotype despite DP coverage equal to number of reads in reference AD field

$
0
0

I came across some unusual variants called by HaplotypeCaller running in gvcf mode while working on human WGS data (the example gvcf line can be seen below).
The genotype in almost all samples is undefined i.e. "./.", despite the good coverage reported in DP field (only one sample is identified as 0/1). Moreover, in "./." genotyped samples all reads fall into reference allele group of AD field, therefore I would anticipate "0/0" genotype rather than "./.".
I have also inspected several bam files visually and did not find any obvious mapping problems. I have attached two IGV snapshots of the variant region: first is from an example "./." genotyped patient and second one is from the only patient with variant. The region seems to have good 25-30x coverage with majority of mapping qualities equal to 60. However, apparently there is some other insertion nearby.
The GATK version I am using is 2015.1-3.4.0-1-ga5ca3fc and reference genome is GRCh38.

Could you please explain why the inferred genotype is "./." instead of "0/0" ?

Best,

Ewa

chr1 100474610 rs568102277 T TG 358.91 .
AC=1;AF=0.500;AN=2;BaseQRankSum=2.54;ClippingRankSum=0.419;DB;DP=4026;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;MQ0=0;MQRankSum=1.36;QD=13.29;ReadPosRankSum=0.814;SOR=0.551
GT:AD:DP:GQ:PGT:PID:PL ./.:36,0:36 ./.:37,0:37
./.:33,0:33 ./.:30,0:30 ./.:36,0:36 ./.:32,0:32
./.:36,0:36 ./.:32,0:32 ./.:31,0:31 ./.:37,0:37
./.:27,0:27 ./.:34,0:34 ./.:38,0:38 ./.:28,0:28
./.:29,0:29 ./.:31,0:31 ./.:25,0:25 ./.:24,0:24
./.:19,0:19 ./.:41,0:41 ./.:24,0:24 ./.:27,0:27 ./.:26,0:26
./.:28,0:28 ./.:31,0:31 ./.:38,0:38 ./.:27,0:27
./.:22,0:22 ./.:31,0:31 ./.:27,0:27 ./.:29,0:29
./.:28,0:28 ./.:34,0:34 ./.:20,0:20 ./.:26,0:26
./.:33,0:33 ./.:26,0:26 ./.:26,0:26 ./.:31,0:31
./.:32,0:32 ./.:34,0:34 ./.:27,0:27 ./.:28,0:28
./.:37,0:37 ./.:38,0:38 ./.:25,0:25 ./.:31,0:31
./.:37,0:37 ./.:31,0:31 ./.:32,0:32 ./.:30,0:30
./.:38,0:38 ./.:36,0:36 ./.:32,0:32 ./.:40,0:40
./.:32,0:32 ./.:42,0:42 ./.:37,0:37 ./.:29,0:29
./.:42,0:42 ./.:31,0:31 ./.:36,0:36 ./.:35,0:35
./.:31,0:31 ./.:35,0:35 ./.:32,0:32 ./.:30,0:30
./.:30,0:30 ./.:36,0:36 ./.:34,0:34 ./.:28,0:28
./.:37,0:37 ./.:34,0:34 ./.:24,0:24 ./.:31,0:31
./.:33,0:33 ./.:36,0:36 ./.:37,0:37 ./.:48,0:48
./.:25,0:25 ./.:39,0:39 ./.:26,0:26 ./.:23,0:23
./.:39,0:39 ./.:29,0:29 ./.:33,0:33 ./.:37,0:37
./.:27,0:27 ./.:29,0:29 ./.:42,0:42 ./.:28,0:28
./.:29,0:29 ./.:30,0:30 ./.:39,0:39 ./.:39,0:39
./.:35,0:35 ./.:31,0:31 ./.:29,0:29 ./.:23,0:23 ./.:30,0:30
./.:24,0:24 ./.:29,0:29 ./.:26,0:26 ./.:19,0:19
./.:26,0:26 ./.:16,0:16 ./.:27,0:27 ./.:24,0:24
./.:34,0:34 ./.:28,0:28 ./.:41,0:41 ./.:41,0:41
./.:39,0:39 ./.:24,0:24
0/1:11,16:27:99:1|0:100474609_G_GT:381,0,245 ./.:36,0:36
./.:26,0:26 ./.:27,0:27 ./.:29,0:29 ./.:29,0:29
./.:28,0:28 ./.:24,0:24 ./.:19,0:19 ./.:31,0:31
./.:33,0:33 ./.:23,0:23 ./.:25,0:25 ./.:31,0:31
./.:34,0:34 ./.:26,0:26

Problems with dbSNP file using the HaplotypeCaller

$
0
0

Hi,

I am having the following problem:
I use the HaplotypeCaller (GATK 3.3.0) for variant calling. To identify variants that are known according to dbSNP, I use the "--dbsnp" statement and define a dbSNP file (vcf file). I thought, that everything would work fine, but by coincidence I observed a (in my eyes really serious) problem: The same call is recognized in the case of one sample, but not in the case of another sample. These are the two important lines of the vcf files that get reported:

17 7579643 . CCCCCAGCCCTCCAGGT C 5066.73 PASS AC=2;AF=1.00;AN=2;BaseQRankSum=4.819;ClippingRankSum=-1.054;DP=231;FS=78.565;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;MQRankSum=-0.994;QD=21.93;ReadPosRankSum=-5.473;SOR=1.639;set=variant;EFF=INTRON(MODIFIER||||393|TP53|protein_coding|CODING|ENST00000445888|3|1) GT:AD:DP:GQ:PL 1/1:23,207:230:99:5104,251,0

17 7579643 rs59758982 CCCCCAGCCCTCCAGGT C 2868.73 PASS AC=2;AF=1.00;AN=2;BaseQRankSum=3.120;ClippingRankSum=0.256;DB;DP=134;FS=1.120;MLEAC=2;MLEAF=1.00;MQ=59.91;MQ0=0;MQRankSum=1.849;QD=21.41;ReadPosRankSum=-1.285;SOR=0.704;set=variant;EFF=INTRON(MODIFIER||||393|TP53|protein_coding|CODING|ENST00000445888|3|1) GT:AD:DP:GQ:PL 1/1:13,121:134:96:2906,96,0

As we exclude known variants for our analysis, it is essential that this step works correctly. Yet, I am pretty insecure what to do no. The variant seems to be well known (according to information on the ncbi homepage). Yet, why was it not identified in the other sample???

It would be great if anyone could help me. Many thanks in advance!

Sarah

Haplotype caller for RNA-seq

$
0
0

Hi,
I have been running HP for my RNA-seq data

java -Xmx16g -jar GenomeAnalysisTK.jar \
-T HaplotypeCaller \
-R $ref \
-I INPUT.bam \
-stand_call_conf 50.0 \
-stand_emit_conf 10.0 \
-o output.vcf

my process is killed when it was 82% completed. Is there a way to resume the run without running from the beginning ?

Thanks
Best Regards
T. Hamdi Kitapci

The Splitting of BAM file (RNA-seq) before calling variants is throwing error

$
0
0

First of all I thank for the Tool , I am using this GATK var calling for my RNA-seq data.. I have been following the commands said in the site but its stopping me at the splitting BAM file step with the following error,

ERROR
ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
ERROR
ERROR MESSAGE: Badly formed genome loc: Contig * given as location, but this contig isn't present in the Fasta sequence dictionary

The command I used is,
/opt/husar/bin/java-1.7 -jar /GenomeAnalysisTK-3.2-2.jar -T SplitNCigarReads -R /human_genome37_gatk.fa -I BM_ID_reorder.bam -o BM_ID_split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS

I tried do variant calling on the duplicate removed BAM file, which also throwed error message as,

ERROR
ERROR MESSAGE: SAM/BAM file BM_ID_reorder.bam is malformed: Reference index 1912602624 not found in sequence dictionary.
ERROR -

The command line I used for this,
/opt/husar/bin/java-1.7 -jar -Xincgc -Xmx1586M $NGSUTILDIR/java/GenomeAnalysisTK-3.2-2.jar -T HaplotypeCaller -R /human_genome37_gatk.fa -I BM_ID_reorder.bam -dontUseSoftClippedBases -stand_call_conf 20.0 -stand_emit_conf 20.0 -o BM_ID.vcf


HaplotypeCaller can not emit bamout in multi-threaded mode

$
0
0

Hi all,
I used multi-threading mode on HaplotypeCaller hoping to save some time. But seemed like bamout can not be emitted in multi-threading mode. I searched the answers. But I am still not sure if the latest 3.4-46 version can support multi-threading with bamout. BTW, I am still using the old 3.3-0 version. If you say yes, now 3.4 version can support multi-threading bam, then I will ask the computing core to update gatk for me. Or maybe I just delete the bamout option to save some time. But I really prefer not to do so because I need to check the depth and coverage of mapping results actually finally used for variant calling.
My command line:
java -Xmx12g -jar $GATK_JARS/GenomeAnalysisTK.jar \
-T HaplotypeCaller \
-nct 12 \
-R human_g1k_v37.fasta \
--dbsnp dbsnp_138.b37.vcf \
-I recal_realigned_b37.dedup.sorted.bam \
--genotyping_mode DISCOVERY \
-stand_emit_conf 10 \
-stand_call_conf 20 \
--emitRefConfidence GVCF \
--variant_index_type LINEAR \
--variant_index_parameter 128000 \
-o raw_var_TKDOME.g.vcf \
-bamout force_bamout_TKDOME_b37.bam -forceActive -disableOptimizations
BTW, is it necessary to add --variant_index_type LINEAR and --variant_index_parameter 128000 in Version 3.3?
Thank you very much!

Variant calls being missed - how to improve variant detection using Sanger sequencing data?

$
0
0

@Geraldine_VdAuwera and @Sheila - Please help!

I've read through many of your posts/responses regarding HaplotypeCaller not calling variants, and tried many of the suggestions you've made to others, but I'm still missing variants. My situation is a little different (I'm trying to identify variants from Sanger sequence reads) but I'm hoping you might have additional ideas or can see something I've overlooked. I hope I haven't given you too much information below, but I've seen it mentioned that too much info is better than not enough.

A while back, I generated a variant call set from Illumina Next Gen Sequencing data using UnifiedGenotyper (circa v2.7.4), identifying ~46,000 discordant variants between the genomes of two haploid strains of S. cerevisiae. Our subsequent experiments included Sanger sequencing ~95 kb of DNA across 17 different loci in these two strains. I don't think any of the SNP calls were false positives, and there were very, very few were false negatives.

Since then, we've constructed many strains by swapping variants at these loci between these two strains of yeast. To check if a strain was constructed successfully, we PCR the loci of interest, and Sanger sequencing the PCR product. I'm trying to use GATK (version 3.4-46) HaplotypeCaller (preferably, or alternatively UnifiedGenotyper) in a variant detection pipeline to confirm a properly constructed strain. I convert the .ab1 files to fastqs using EMBOSS seqret, map the Sanger reads using bwa mem ($ bwa mem -R $RG $refFasta $i > ${outDir}/samFiles/${fileBaseName}.sam), merge the sam files for each individual, and then perform the variant calling separately for each individual. I do not dedup (I actually intentionally leave out the -M flag in bwa), nor do I realign around indels (I plan to in the future, but there aren't any indels in any of the regions we are currently looking at), or do any BQSR in this pipeline. Also, when I do the genotyping after HaplotypCaller, I don't do joint genotyping, each sample (individual) gets genotyped individually.

In general, this pipeline does identify many variants from the Sanger reads, but I'm still missing many variant calls that I can clearly see in the Sanger reads. Using a test set of 36 individuals, I examined the variant calls made from 364 Sanger reads that cover a total of 63 known variant sites across three ~5kb loci (40 SNPs in locus 08a-s02, 9 SNPs in locus 10a-s01, 14 SNPs in locus 12c-s02). Below are some example calls to HaplotypeCaller and UnifiedGenotyper, as well as a brief summary statement of general performance using the given command. I've also included some screenshots from IGV showing the alignments (original bam files and bamOut files) and SNP calls from the different commands.

Ideally, I'd like to use the HaplotypeCaller since not only can it give me a variant call with a confidence value, but it can also give me a reference call with a confidence value. And furthermore, I'd like to stay in DISCOVERY mode as opposed to Genotype Given Alleles, that way I can also assess whether any experimental manipulations we've performed might have possibly introduced new mutations.

Again, I'm hoping someone can advice me on how to make adjustments to reduce the number of missed calls.

Call 1:
The first call to HaplotypeCaller I'm showing produced the least amount of variant calls at sites where I've checked the Sanger reads.

java -Xmx4g -jar $gatkJar \
    -R $refFasta \
    -T HaplotypeCaller \
    -I $inBam \
    -ploidy 1 \
    -nct 1 \
    -bamout ${inBam%.bam}_hapcallRealigned.bam \
    -forceActive \
    -disableOptimizations \
    -dontTrimActiveRegions \
    --genotyping_mode DISCOVERY \
    --emitRefConfidence BP_RESOLUTION \
    --interval_padding 500 \
    --intervals $outDir/tmp.intervals.bed \
    --min_base_quality_score 5 \
    --standard_min_confidence_threshold_for_calling 0 \
    --standard_min_confidence_threshold_for_emitting 0 \
    -A VariantType \
    -A SampleList \
    -A AlleleBalance \
    -A BaseCounts \
    -A AlleleBalanceBySample \
    -o $outDir/vcfFiles/${fileBaseName}_hc_bp_raw.g.vcf

Call 2:
I tried a number of different -kmerSize values [(-kmerSize 10 -kmerSize 25), (-kmerSize 9), (-kmerSize 10), (-kmerSize 12), (-kmerSize 19), (-kmerSize 12 -kmerSize 19), (maybe some others). I seemed to have the best luck when using -kmerSize 12 only; I picked up a few more SNPs (where I expected them), and only lost one SNP call as compared Call 1.

java -Xmx4g -jar $gatkJar \
    -R $refFasta \
    -T HaplotypeCaller \
    -I $inBam \
    -ploidy 1 \
    -nct 1 \
    -bamout ${inBam%.bam}_kmer_hapcallRealigned.bam \
    -forceActive \
    -disableOptimizations \
    -dontTrimActiveRegions \
    --genotyping_mode DISCOVERY \
    --emitRefConfidence BP_RESOLUTION \
    --interval_padding 500 \
    --intervals $outDir/tmp.intervals.bed \
    --min_base_quality_score 5 \
    --standard_min_confidence_threshold_for_calling 0 \
    --standard_min_confidence_threshold_for_emitting 0 \
    -kmerSize 12 \
    -A VariantType \
    -A SampleList \
    -A AlleleBalance \
    -A BaseCounts \
    -A AlleleBalanceBySample \
    -o $outDir/vcfFiles/${fileBaseName}_hc_bp_kmer_raw.g.vcf

Call 3:
I tried adjusting --minPruning 1 and --minDanglingBranchLength 1, which helped more than playing with kmerSize. I picked up many more SNPs compared to both Call 1 and Call 2 (but not necessarily the same SNPs I gained in Call 2).

java -Xmx4g -jar $gatkJar \
    -R $refFasta \
    -T HaplotypeCaller \
    -I $inBam \
    -ploidy 1 \
    -nct 1 \
    -bamout ${inBam%.bam}_adv_hapcallRealigned.bam \
    -forceActive \
    -disableOptimizations \
    -dontTrimActiveRegions \
    --genotyping_mode DISCOVERY \
    --emitRefConfidence BP_RESOLUTION \
    --interval_padding 500 \
    --intervals $outDir/tmp.intervals.bed \
    --min_base_quality_score 5 \
    --standard_min_confidence_threshold_for_calling 0 \
    --standard_min_confidence_threshold_for_emitting 0 \
    --minPruning 1 \
    --minDanglingBranchLength 1 \
    -A VariantType \
    -A SampleList \
    -A AlleleBalance \
    -A BaseCounts \
    -A AlleleBalanceBySample \
    -o $outDir/vcfFiles/${fileBaseName}_hc_bp_adv_raw.g.vcf

Call 4:
I then tried adding both --minPruning 1 --minDanglingBranchLength 1 and -kmerSize 12 all at once, and I threw in a --min_mapping_quality_score 5. I maybe did slightly better... than in Calls 1-4. I did actually lose 1 SNP compared to Calls 1-4, but I got most of the additional SNPs I got from using Call 3, as well as some of the SNPs I got from using Call 2.

java -Xmx4g -jar $gatkJar \
    -R $refFasta \
    -T HaplotypeCaller \
    -I $inBam \
    -ploidy 1 \
    -nct 1 \
    -bamout ${inBam%.bam}_hailMary_raw.bam \
    -forceActive \
    -disableOptimizations \
    -dontTrimActiveRegions \
    --genotyping_mode DISCOVERY \
    --emitRefConfidence BP_RESOLUTION \
    --interval_padding 500 \
    --intervals $outDir/tmp.intervals.bed \
    --min_base_quality_score 5 \
    --min_mapping_quality_score 10 \
    --standard_min_confidence_threshold_for_calling 0 \
    --standard_min_confidence_threshold_for_emitting 0 \
    --minPruning 1 \
    --minDanglingBranchLength 1 \
    -kmerSize 12 \
    -A VariantType \
    -A SampleList \
    -A AlleleBalance \
    -A BaseCounts \
    -A AlleleBalanceBySample \
    -o $outDir/vcfFiles/${fileBaseName}_hailMary_raw.g.vcf

Call 5:
As I mentioned above, I've experience better performance (or at least I've done a better job executing) with UnifiedGenotyper. I actually get the most SNPs called at the known SNP sites, in individuals where manual examination confirms a SNP.

java -Xmx4g -jar $gatkJar \
    -R $refFasta \
    -T UnifiedGenotyper \
    -I $inBam \
    -ploidy 1 \
    --output_mode EMIT_ALL_SITES \
    -glm BOTH \
    -dt NONE -dcov 0 \
    -nt 4 \
    -nct 1 \
    --intervals $outDir/tmp.intervals.bed \
    --interval_padding 500 \
    --min_base_quality_score 5 \
    --standard_min_confidence_threshold_for_calling 0 \
    --standard_min_confidence_threshold_for_emitting 0 \
    -minIndelCnt 1 \
    -A VariantType \
    -A SampleList \
    -A AlleleBalance \
    -A BaseCounts \
    -A AlleleBalanceBySample \
    -o $outDir/vcfFiles/${fileBaseName}_ug_emitAll_raw.vcf

I hope you're still with me :)

None of the above commands are calling all of the SNPs that I (maybe naively) would expect them to. "Examples 1-3" in the first attached screenshot are three individuals with reads (two reads each) showing the alternate allele. The map quality scores for each read are 60, and the base quality scores at this position for individual #11 are 36 and 38, and for the other individuals, the base quality scores are between 48-61. The reads are very clean upstream of this position, the next upstream SNP is about ~80bp away, and the downstream SNP at the position marked for "Examples 4-6" is ~160 bp away. Commands 1 and 2 do not elicit a SNP call for Examples 1-6, Command 3 get the calls at both positions for individual 10, Command 4 for gets the both calls for individuals 10 and the upstream SNP for individual 11. Command 5 (UnifiedGenotyper) gets the alt allele called in all 3 individuals at the upstream position, and the alt allele called for individuals 10 and 12 at the downstream position. Note that in individual 11, there is only one read covering the downstream variant position, where UnifiedGenotyper missed the call.

Here is the vcf output for those two positions from each command. Note that there are more samples in the per-sample breakdown for the FORMAT tags. The last three groups of FORMAT tags correspond to the three individuals I've shown in the screenshots.

Command 1 output

Examples 1-3    649036  .   G   .   .   .   AN=11;DP=22;VariantType=NO_VARIATION;set=ReferenceInAll GT:AD:DP:RGQ    .   .   .   .   .   .   .   .   .   .   .:0:0:0 0:0:2:0 0:2:2:89    0:0:2:0 0:2:2:84    0:0:2:0 0:2:2:89    0:0:2:0 0:2:2:89    0:0:2:0 0:0:2:0 0:0:2:0
Examples 4-6    649160  .   C   .   .   .   AN=11;DP=21;VariantType=NO_VARIATION;set=ReferenceInAll GT:AD:DP:RGQ    .   .   .   .   .   .   .   .   .   .   .:0:0:0 0:0:2:0 0:2:2:89    0:0:2:0 0:2:2:0 0:0:2:0 0:2:2:71    0:0:2:0 0:2:2:44    0:0:2:0 0:0:1:0 0:0:2:0

Command 2 output

Examples 1-3    649036  .   G   A   26.02   .   ABHom=1.00;AC=6;AF=0.545;AN=11;DP=18;MLEAC=1;MLEAF=1.00;MQ=60.00;Samples=qHZT-12c-s02_r2657_p4096_dJ-002,qHZT-12c-s02_r2657_p4096_dJ-004,qHZT-12c-s02_r2657_p4096_dJ-006,qHZT-12c-s02_r2657_p4096_dJ-008,qHZT-12c-s02_r2657_p4096_dJ-010,qHZT-12c-s02_r2657_p4096_dJ-011;VariantType=SNP;set=qHZT-12c-s02_r2657_p4096_dJ-002_merged_sorted_hc_bp_kmer_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-004_merged_sorted_hc_bp_kmer_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-006_merged_sorted_hc_bp_kmer_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-008_merged_sorted_hc_bp_kmer_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-010_merged_sorted_hc_bp_kmer_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-011_merged_sorted_hc_bp_kmer_raw.vcf  GT:AD:DP:GQ:PL:RGQ  .   .   .   .   .   .   .   .   .:0:0:.:.:0 1:0,2:.:56:56,0 0:2:2:.:.:89    1:0,1:.:45:45,0 0:2:2:.:.:84    1:0,1:.:45:45,0 0:2:2:.:.:89    1:0,1:.:45:45,0 0:2:2:.:.:89    1:0,1:.:45:45,0 1:0,2:.:88:88,0 0:0:2:.:.:0
Examples 4-6    649160  .   C   A   13.22   .   AC=3;AF=0.273;AN=11;DP=18;MLEAC=1;MLEAF=1.00;MQ=60.00;OND=1.00;Samples=qHZT-12c-s02_r2657_p4096_dJ-004,qHZT-12c-s02_r2657_p4096_dJ-008,qHZT-12c-s02_r2657_p4096_dJ-010;VariantType=SNP;set=qHZT-12c-s02_r2657_p4096_dJ-004_merged_sorted_hc_bp_kmer_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-008_merged_sorted_hc_bp_kmer_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-010_merged_sorted_hc_bp_kmer_raw.vcf   GT:AD:DP:GQ:PL:RGQ  .   .   .   .   .   .   .   .   .   .   .   .       .   .   .   .   .   .   .   .:0:0:.:.:0 0:0:2:.:.:0 0:2:2:.:.:89    1:0,1:.:43:43,0 0:2:2:.:.:0 0:0:2:.:.:0 0:2:2:.:.:71    1:0,0,1:.:37:37,0   0:2:2:.:.:44    1:0,1:.:34:34,0 0:0:1:.:.:0 0:0:2:.:.:0

Command 3 output

Examples 1-3    649036  .   G   A   36.01   .   ABHom=1.00;AC=3;AF=0.273;AN=11;DP=20;MLEAC=1;MLEAF=1.00;MQ=60.00;Samples=qHZT-12c-s02_r2657_p4096_dJ-002,qHZT-12c-s02_r2657_p4096_dJ-004,qHZT-12c-s02_r2657_p4096_dJ-006;VariantType=SNP;set=qHZT-12c-s02_r2657_p4096_dJ-002_merged_sorted_hc_bp_adv_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-004_merged_sorted_hc_bp_adv_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-006_merged_sorted_hc_bp_adv_raw.vcf    GT:AD:DP:GQ:PL:RGQ  .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .:0:0:.:.:0 1:0,2:.:66:66,0 0:2:2:.:.:89    1:0,1:.:45:45,0 0:2:2:.:.:84    1:0,1:.:45:45,0 0:2:2:.:.:89    0:0:2:.:.:0 0:2:2:.:.:89    0:0:2:.:.:0 0:0:2:.:.:0 0:0:2:.:.:0
Examples 4-6    649160  .   C   A   13.22   .   ABHom=1.00;AC=1;AF=0.091;AN=11;DP=20;MLEAC=1;MLEAF=1.00;MQ=60.00;Samples=qHZT-12c-s02_r2657_p4096_dJ-004;VariantType=SNP;set=qHZT-12c-s02_r2657_p4096_dJ-004_merged_sorted_hc_bp_adv_raw.vcf    GT:AD:DP:GQ:PL:RGQ  .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .:0:0:.:.:0 0:0:2:.:.:0 0:2:2:.:.:89    1:0,1:.:43:43,0 0:2:2:.:.:0 0:0:2:.:.:0 0:2:2:.:.:71    0:0:2:.:.:0 0:2:2:.:.:44    0:0:2:.:.:0 0:0:1:.:.:0 0:0:2:.:.:0

Command 4 output

Examples 1-3    649036  .   G   A   26.02   .   ABHom=1.00;AC=6;AF=0.545;AN=11;DP=18;MLEAC=1;MLEAF=1.00;MQ=60.00;Samples=qHZT-12c-s02_r2657_p4096_dJ-002,qHZT-12c-s02_r2657_p4096_dJ-004,qHZT-12c-s02_r2657_p4096_dJ-006,qHZT-12c-s02_r2657_p4096_dJ-008,qHZT-12c-s02_r2657_p4096_dJ-010,qHZT-12c-s02_r2657_p4096_dJ-011;VariantType=SNP;set=qHZT-12c-s02_r2657_p4096_dJ-002_merged_sorted_hailMary_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-004_merged_sorted_hailMary_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-006_merged_sorted_hailMary_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-008_merged_sorted_hailMary_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-010_merged_sorted_hailMary_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-011_merged_sorted_hailMary_raw.vcf  GT:AD:DP:GQ:PL:RGQ  .   .   .   .   .   .   .   .   .   .:0:0:.:.:0 1:0,2:.:56:56,0 0:2:2:.:.:89    1:0,1:.:45:45,0 0:2:2:.:.:84    1:0,1:.:45:45,0 0:2:2:.:.:89    1:0,1:.:45:45,0 0:2:2:.:.:89    1:0,1:.:45:45,0 1:0,2:.:88:88,0 0:0:2:.:.:0
Examples 4-6    649160  .   C   A   13.22   .   AC=3;AF=0.273;AN=11;DP=18;MLEAC=1;MLEAF=1.00;MQ=60.00;OND=1.00;Samples=qHZT-12c-s02_r2657_p4096_dJ-004,qHZT-12c-s02_r2657_p4096_dJ-008,qHZT-12c-s02_r2657_p4096_dJ-010;VariantType=SNP;set=qHZT-12c-s02_r2657_p4096_dJ-004_merged_sorted_hailMary_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-008_merged_sorted_hailMary_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-010_merged_sorted_hailMary_raw.vcf GT:AD:DP:GQ:PL:RGQ  .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .:0:0:.:.:0 0:0:2:.:.:0 0:2:2:.:.:89    1:0,1:.:43:43,0 0:2:2:.:.:0 0:0:2:.:.:0 0:2:2:.:.:71    1:0,0,1:.:37:37,0   0:2:2:.:.:44    1:0,1:.:34:34,0 0:0:1:.:.:0 0:0:2:.:.:0

Command 5 output

Examples 1-3    649036  .   G   A   26.02   .   ABHom=1.00;AC=7;AF=0.636;AN=11;DP=22;Dels=0.00;FS=0.000;MLEAC=1;MLEAF=1.00;MQ=60.00;MQ0=0;SOR=2.303;Samples=qHZT-12c-s02_r2657_p4096_dJ-002,qHZT-12c-s02_r2657_p4096_dJ-004,qHZT-12c-s02_r2657_p4096_dJ-006,qHZT-12c-s02_r2657_p4096_dJ-008,qHZT-12c-s02_r2657_p4096_dJ-010,qHZT-12c-s02_r2657_p4096_dJ-011,qHZT-12c-s02_r2657_p4096_dJ-012;VariantType=SNP;set=qHZT-12c-s02_r2657_p4096_dJ-002_merged_sorted_ug_emitAll_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-004_merged_sorted_ug_emitAll_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-006_merged_sorted_ug_emitAll_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-008_merged_sorted_ug_emitAll_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-010_merged_sorted_ug_emitAll_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-011_merged_sorted_ug_emitAll_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-012_merged_sorted_ug_emitAll_raw.vcf  GT:AD:DP:GQ:PL  ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./../.  ./. ./. ./. ./. ./. ./. 1:0,2:2:56:56,0 0:.:2   1:0,2:2:99:117,0    0:.:2   1:0,2:2:99:122,0    0:.:2   1:0,2:2:67:67,0 0:.:2   1:0,2:2:99:110,0    1:0,2:2:84:84,0 1:0,2:2:99:127,0
Examples 4-6    649160  .   C   A   46  .   ABHom=1.00;AC=5;AF=0.455;AN=11;DP=21;Dels=0.00;FS=0.000;MLEAC=1;MLEAF=1.00;MQ=60.00;MQ0=0;Samples=qHZT-12c-s02_r2657_p4096_dJ-004,qHZT-12c-s02_r2657_p4096_dJ-006,qHZT-12c-s02_r2657_p4096_dJ-008,qHZT-12c-s02_r2657_p4096_dJ-010,qHZT-12c-s02_r2657_p4096_dJ-012;VariantType=SNP;set=qHZT-12c-s02_r2657_p4096_dJ-004_merged_sorted_ug_emitAll_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-006_merged_sorted_ug_emitAll_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-008_merged_sorted_ug_emitAll_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-010_merged_sorted_ug_emitAll_raw.vcf-qHZT-12c-s02_r2657_p4096_dJ-012_merged_sorted_ug_emitAll_raw.vcf  GT:AD:DP:GQ:PL  ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./../.  ./. ./. ./. ./. ./. ./. 0:.:2   0:.:2   1:0,2:2:76:76,0 0:.:2   1:0,2:2:70:70,0 0:.:2   1:0,1:2:37:37,0 0:.:2   1:0,2:2:60:60,0 0:.:1   1:0,2:2:75:75,0

There are many more examples of missed SNP calls. When using the HaplotypeCaller, I'm missing ~23% of the SNP calls. So...what can I do to tweak my variant detection pipeline so that I don't miss so many SNP calls?

As I mentioned, I'm currently getting better results with the UnifiedGenotyper walker. I'm only missing about 2% of all Alt SNP calls. Also, about half of that 2% are improperly being genotyped as Ref by Command #5. It appears to me that most of the variant calls I'm missing using the UnifiedGenotyper are at positions where I only have a single Sanger read covering the base, and the base quality score starts to fall below 25 (such as in individual #11 in the first attached screen shot, base quality score was 20). Attached is a second IGV screenshot of a different locus where I've also missed SNP calls using Command 5 (Examples 7-9). I've also included the read details for those positions, as well as the VCF file output from Command 5. I have seen at least one instance where I had two Sanger reads reporting an alternate allele, however, UG did not call the variant. In that case though, the base quality scores in both reads were very low (8); mapping quality was 60 for both reads.

Does anyone have any suggestions as to how I might alter any of the parameters to reduce (hopefully eliminate) the missed SNP calls. I think I would accept false positives over false negatives in this case. Or does anyone have any other idea as to what my problem might be?

Thanks so much!
Matt Maurer

Command 5 output for second screen shot file:

VCF output
The samples shown in the second attached screen shot correspond to the 11th and 12th groupings in the per-sample breakdown of the FORMAT tags.

Examples 7-8    163422  .   G   C   173 .   ABHom=1.00;AC=2;AF=0.182;AN=11;DP=14;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=1;MLEAF=1.00;MQ=60.00;MQ0=0;SOR=1.609;Samples=qHZT-08a-s02_r2657_p4094_dJ-002,qHZT-08a-s02_r2657_p4094_dJ-008;VariantType=SNP;set=qHZT-08a-s02_r2657_p4094_dJ-002_merged_sorted_ug_emitAll_raw.vcf-qHZT-08a-s02_r2657_p4094_dJ-008_merged_sorted_ug_emitAll_raw.vcf GT:AD:DP:GQ:PL  0:.:1   1:0,4:4:99:203,0    0:.:1   0:.:1   0:.:1   0:.:1   0:.:1   1:0,1:1:54:54,0 0:.:1   ./. 0:.:1   0:.:1   ./. ./. ./. ./. ./. ./. ./. ./. ./../.  ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. ./.
Example 9   163476  .   A   G   173 .   ABHom=1.00;AC=2;AF=0.167;AN=12;DP=15;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=1;MLEAF=1.00;MQ0=0;SOR=1.609;Samples=qHZT-08a-s02_r2657_p4094_dJ-002,qHZT-08a-s02_r2657_p4094_dJ-008;VariantType=SNP;set=qHZT-08a-s02_r2657_p4094_dJ-002_merged_sorted_ug_emitAll_raw.vcf-qHZT-08a-s02_r2657_p4094_dJ-008_merged_sorted_ug_emitAll_raw.vcf  GT:AD:DP:GQ:PL  0:.:1   1:0,4:4:99:203,0    0:.:1   0:.:1   0:.:1   0:.:1   0:.:1   1:0,1:1:57:57,0 0:.:1   0:.:1   0:.:1   0:.:1   .   .   .   .   .   .   .   .   .   .   .

Also, why are the GT's sometimes "./." as they are for site163422, and sometimes "." as they are for site 163476?

Read Details:

Example#7

Read name = qHZT-08a-s02_L_r2657_p4094_dJ-011_pcrP1_oMM575_2014-11-27_A11
Sample = qHZT-08a-s02_r2657_p4094_dJ-011
Read group = qHZT-08a-s02_L_r2657_p4094_dJ-011_pcrP1_oMM575_2014-11-27_A11
----------------------
Location = 163,422
Alignment start = 163,293 (+)
Cigar = 34S833M1D72M1I50M1I9M
Mapped = yes
Mapping quality = 60
Secondary = no
Supplementary = no
Duplicate = no
Failed QC = no
----------------------
Base = C
Base phred quality = 23
----------------------
RG = qHZT-08a-s02_L_r2657_p4094_dJ-011_pcrP1_oMM575_2014-11-27_A11
NM = 20
AS = 858
XS = 0
-------------------

Example 8

Read name = qHZT-08a-s02_L_r2657_p4094_dJ-011_pcrP1_oMM575_2014-11-27_A11
Sample = qHZT-08a-s02_r2657_p4094_dJ-011
Read group = qHZT-08a-s02_L_r2657_p4094_dJ-011_pcrP1_oMM575_2014-11-27_A11
----------------------
Location = 163,476
Alignment start = 163,293 (+)
Cigar = 34S833M1D72M1I50M1I9M
Mapped = yes
Mapping quality = 60
Secondary = no
Supplementary = no
Duplicate = no
Failed QC = no
----------------------
Base = G
Base phred quality = 15
----------------------
RG = qHZT-08a-s02_L_r2657_p4094_dJ-011_pcrP1_oMM575_2014-11-27_A11
NM = 20
AS = 858
XS = 0
-------------------

Example #9

Read name = qHZT-08a-s02_L_r2657_p4094_dJ-012_pcrP1_oMM575_2014-11-27_A12
Sample = qHZT-08a-s02_r2657_p4094_dJ-012
Read group = qHZT-08a-s02_L_r2657_p4094_dJ-012_pcrP1_oMM575_2014-11-27_A12
----------------------
Location = 163,422
Alignment start = 163,329 (+)
Cigar = 67S16M1D181M1D634M1D9M1I8M1I62M1D17M4S
Mapped = yes
Mapping quality = 60
Secondary = no
Supplementary = no
Duplicate = no
Failed QC = no
----------------------
Base = C
Base phred quality = 18
----------------------
RG = qHZT-08a-s02_L_r2657_p4094_dJ-012_pcrP1_oMM575_2014-11-27_A12
NM = 87
AS = 480
XS = 0
-------------------

AD in VCF doesn't match BAM

$
0
0

Hello,

I'm using GATK to call variants in my RNA-Seq data. I'm noticing something strange, perhaps someone can help? For a number of sites the VCF is reporting things I cannot replicate from BAMs. How can I recover the reads that contribute to a variant call? Here is an example for 1 site in 1 sample, but I've observed this at many sites/samples:

$ grep 235068463 file.vcf 
chr1    235068463   .   T   C   1795.77 .   AC=1;AF=0.500;AN=2;BaseQRankSum=-3.530;ClippingRankSum=-0.535;DP=60;FS=7.844;MLEAC=1;MLEAF=0.500;MQ=60.00;MQ0=0;MQRankSum=0.401;QD=29.93;ReadPosRankSum=3.557   GT:AD:DP:GQ:PL  0/1:5,55:60:44:1824,0,44

60 reads, 5 T, 55 C.
But loading the bam in IGV, I do not see any T reads. Similarly:

$ samtools view -uh file.md.realn.bam chr1:235068463-235068463 |samtools mpileup - |grep 235068463
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
chr1    235068463   N   60  cCCccccCCCcccccCcccccccccCCCccCCCCCcCcccccCCCcCcCCccCCCCccCC    >CA@B@>A>BA@BCABACCC:@@ACABBBCAACBBCABCB@CABBAB?>A?CBBAAAABA

There are just 60 C's at that location. How do I decide what the genotype here is? C/C or C/T ?

For methodology I'm using gatk/3.2.0. I tried using HC from gatk/3.3.1 and got the same result. The bam and vcf files come from the final two lines:
-2 pass STAR
-Mark Dups
-SplitNCigarReads
-RealignerTargetCreator
-IndelRealigner
-BaseRecalibrator
-PrintReads
-MergeSamFiles.jar
-Mark Dups
-RealignerTargetCreator
-IndelRealigner
-HaplotyeCaller

Thanks,
Kipp

The VCF file generated by GATK HaplotypeCaller does not contain SOR information.

$
0
0

I am using GATK HaplotypeCaller to call variation with the following command:
java -Xmx20g -jar GenomeAnalysisTK.jar -l INFO -R hg19.fa -T HaplotypeCaller -nct 16 -I D-2.realigned.recal.bam -I D-3.realigned.recal.bam -I D-4.realigned.recal.bam --dbsnp hg19_GATK_snp137.vcf -o D-2_D-3_D-4.raw.vcf -A StrandOddsRatio -A AlleleBalance -A BaseCounts -A StrandBiasBySample -A FisherStrand

However, there is a problem in the VCF file generated by this command. The SOR information in the header definition line did not exist in the mutation list.
##INFO=<ID=SOR,Number=1,Type=Float,Description="Symmetric Odds Ratio of 2x2 contingency table to detect strand bias">
chr1 13116 rs201725126 T G 94.57 . AC=1;AF=0.250;AN=4;BaseQRankSum=1.754;DB;DP=7;FS=0.000;MLEAC=1;MLEAF=0.250;MQ=29.47;MQ0=0;MQRankSum=-1.754;QD=23.64;ReadPosRankSum=-0.550 GT:AD:GQ:PL:SB 0/0:3,0:9:0,9,191:0,0,0,0 0/1:1,3:44:123,0,44:0,0,0,0 ./.

I have tried VariantAnnotator but still got the same problem.

Could you please tell me where the problem exist and how to solve it?

Thanks !

HaplotypeCaller bugs

$
0
0

Hi,

I have tried to solve several issues which came up while trying to run the HaplotypeCaller. For this one, I didn't find anything on google and to be honest when pasting the error, google doesn't even find something similar.

ERROR MESSAGE: Badly formed genome loc: Contig NC_007605 given as location, but this contig isn't present in the Fasta sequence dictionary

Can anyone please tell me what's the problem here? The fasta file I got was the one downloaded from the bundle: human_g1k_v37.fasta.gz

Any help would be really appreciated. Thank you!!

HaplotypeCaller DP reports low values

$
0
0

Dear GATK Team,

I've recently been exploring HaplotypeCaller and noticed that, for my data, it is reporting ~10x lower DP and AD values in comparison to reads visible in the igv browser and reported by the UnifiedGenotyper.

I'm analyzing a human gene panel of amplicon data produced on a MiSeq, 150bp paired end. The coverage is ~5,000x.

My pipeline is:

Novoalign -> GATK (recalibrate quality) -> GATK (re-align) -> HaplotypeCaller/UnifiedGenotyper.

Here are the minimum commands that reproduce the discrepancy:

java -jar /GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar \
-T HaplotypeCaller \
--dbsnp /gatk_bundle/dbsnp_137.hg19.vcf \
-R /gatk_bundle/ucsc.hg19.fasta \
-I sample1.rg.bam \
-o sample1.HC.vcf \
-L ROI.bed \
-dt NONE \
-nct 8

Example variant from sample1.HC.vcf:

chr17 41245466 . G A 18004.77 . AC=2;AF=1.00;AN=2;BaseQRankSum=1.411;ClippingRankSum=-1.211;DP=462;FS=2.564;MLEAC=2;MLEAF=1.00;MQ=70.00;MQ0=0;MQRankSum=0.250;QD=31.14;ReadPosRankSum=1.159 GT:AD:DP:GQ:PL 1/1:3,458:461:99:18033,1286,0

... In comparison to using UnifiedGenotyper with exactly the same alignment file:

java -jar /GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar \
-T UnifiedGenotyper \
--dbsnp /gatk_bundle/dbsnp_137.hg19.vcf \
-R /gatk_bundle/ucsc.hg19.fasta \
-I sample1.rg.bam \
-o sample1.UG.vcf \
-L ROI.bed \
-nct 4 \
-dt NONE \
-glm BOTH

Example variant from sample1.UG.vcf:

chr17 41245466 . G A 140732.77 . AC=2;AF=1.00;AN=2;BaseQRankSum=5.488;DP=6382;Dels=0.00;FS=0.000;HaplotypeScore=568.8569;MLEAC=2;MLEAF=1.00;MQ=70.00;MQ0=0;MQRankSum=0.096;QD=22.05;ReadPosRankSum=0.104 GT:AD:DP:GQ:PL 1/1:56,6300:6378:99:140761,8716,0

I looked at the mapping quality and number of the alignments at the example region (200nt window) listed above and they look good:

awk '{if ($3=="chr17" && $4 > (41245466-100) && $4 < (41245466+100))  print}' sample1.rg.sam | awk '{count[$5]++} END {for(i in count) print count[i], i}' | sort -nr
8764 70
77 0

With other data generated in our lab, that has ~200x coverage and the same assay principle [just more amplicons], the DP reported by HaplotypeCaller corresponds perfectly to UnifiedGenotyper and igv.

Is there an explanation as to why I should see a difference between HaplotypeCaller and UnifiedGenotyper, using these kinds of data?

Many thanks in advance,

Sam

Periodicity in variant calling quality - is this normal?

$
0
0

After applying the standard RNA-Seq pipeline (with STAR, etc) I called varients with the command:

java -jar GenomeAnalysisTK.jar
    -T HaplotypeCaller
    -R chromosome.fa
    -I ./final.bam
    -dontUseSoftClippedBases
    --variant_index_type LINEAR
    --variant_index_parameter 128000
    --emitRefConfidence GVCF -o ./final.gvcf

On the resultant gVCF file, I ran a little python script to see the distribution of calling quality across the different called genotypes:

  • x-axis is quality score rounded to the nearest integer
  • y-axis is the number of variants at that quality score

    ``

As you can see, its mostly heterozygous variants, which is what I expect since this data comes from highly inbred mice.
What i didn't expect however is the periodicity. Is that normal?
Now I presumably I need to filter these variants on some number of quality score, and from this I really dont know where. 0? 50? 75?

Code to generate this data:

#!/usr/bin/env python2.7
import collections
with open('/home/john/overnight/outputs/ctrl_all_FVB.gvcf', 'rb') as f:
    data = {}
    for line in f:
        if line[0] == '#': continue
        line = line.split('\t')
        if line[5] == '.': continue
        gt = line[9][:3]
        try: data[gt][int(float(line[5]))] += 1
        except KeyError: data[gt] = collections.defaultdict(int)
for gt,qualities in data.items():
    print '\n',gt
    for qual,count in sorted(qualities.items()):
        print qual,count

Haplotype caller for RNA-seq

$
0
0

Hi,
I am trying to call SNPs from RNA-seq data. The data that I have is a pooled sample (from the larvae of shellfish 1000s of larvae pooled together to get enough RNA) I have 6 of those samples. Can I use GATK to call SNPs in these pooled samples ?

Hamdi

HaplotypeCaller Error: Mismatch between the reference haplotype and reference assembly graph path

$
0
0

Hi I have been running HaplotypeCaller on >700 monkey alignments and came across this error in some intervals:

##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace 
java.lang.IllegalStateException: Mismatch between the reference haplotype and the reference assembly graph path. for graph BaseGraph{kmerSize=10} graph = GGAATAACTCCAGGCAACCA
GTTCCAGCCGCCTCCTCCCTGTCTCCTTCAAGGTTCCCTTCCTCTACCTGCAATTTACAACCTCAGTGGTTCCCCAGGGCTCTGTCCTGCGCCCTCAGTGCTTCCCTTCTGCACGTTTTCCCAGGCAATCTCTTCCTGCCTCTGGGCACCAACTCCATCCGTATAGAGATAGTT
CCCACAGGCACAGCCC haplotype = CCAGGCAACCAGTTCCAGCCGCCTCCTCCCTGTCTCCTTCAAGGTTCCCTTCCTCTACCTGCAATTTACAACCTCAGTGGTTCCCCAGGGCTCTGTCCTGCGCCCTCAGTGCTTCCCTTCTGCACGTTTTCCCAGGCAATCTCTT
CCTGCCTCTGGGCACCAACTCCATCCGTATAGAGATAGTTCCCACAGGCACAGCCC
        at org.broadinstitute.sting.gatk.walkers.haplotypecaller.LocalAssemblyEngine.sanityCheckReferenceGraph(LocalAssemblyEngine.java:396)
        at org.broadinstitute.sting.gatk.walkers.haplotypecaller.LocalAssemblyEngine.sanityCheckGraph(LocalAssemblyEngine.java:378)
        at org.broadinstitute.sting.gatk.walkers.haplotypecaller.LocalAssemblyEngine.runLocalAssembly(LocalAssemblyEngine.java:135)
        at org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller.assembleReads(HaplotypeCaller.java:751)
        at org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:672)
        at org.broadinstitute.sting.gatk.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:136)
        at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:665)
        at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:661)
        at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274)
        at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
        at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:260)
        at org.broadinstitute.sting.gatk.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:80)
        at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:100)
        at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:301)
        at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113)
        at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245)
        at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152)
        at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version nightly-2013-05-17-g2c8b717):
##### ERROR
##### ERROR Please check the documentation guide to see if this is a known problem
##### ERROR If not, please post the error, with stack trace, to the GATK forum
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: Mismatch between the reference haplotype and the reference assembly graph path. for graph BaseGraph{kmerSize=10} graph = GGAATAACTCCAGGCAACCAGTTCCAGCCGCCTCCTCCCTGTCTCCTTCAAGGTTCCCTTCCTCTACCTGCAATTTACAACCTCAGTGGTTCCCCAGGGCTCTGTCCTGCGCCCTCAGTGCTTCCCTTCTGCACGTTTTCCCAGGCAATCTCTTCCTGCCTCTGGGCACCAACTCCATCCGTATAGAGATAGTTCCCACAGGCACAGCCC haplotype = CCAGGCAACCAGTTCCAGCCGCCTCCTCCCTGTCTCCTTCAAGGTTCCCTTCCTCTACCTGCAATTTACAACCTCAGTGGTTCCCCAGGGCTCTGTCCTGCGCCCTCAGTGCTTCCCTTCTGCACGTTTTCCCAGGCAATCTCTTCCTGCCTCTGGGCACCAACTCCATCCGTATAGAGATAGTTCCCACAGGCACAGCCC
##### ERROR ------------------------------------------------------------------------------------------

My commandline looks like (omitting long list of bam files):

java -Xms6000m -Xmx8000m -XX:PermSize=1500m -XX:MaxPermSize=2000m -jar gatk2Jar/GenomeAnalysisTK.jar --reference_sequence reference/3280_vervet_ref_6.0.3.fasta -T HaplotypeCaller --unsafe --validation_strictness SILENT --read_filter BadCigar --num_threads 1 -L:bed folder/Scaffold84_line_1064463_1069462_bed.tsv --out NewCaller/Scaffold84_1064463_1069462.orig.vcf --heterozygosity 0.01 --minPruning 2 --downsample_to_coverage 40 --downsampling_type BY_SAMPLE -I ...

downsample_to_coverage in HaplotyperCaller

$
0
0

Hello,
Here I have a question about downsample_to_coverage in HaplotypeCaller. I found -dcov cannot be used in HaplotypeCaller and I tried to change the values of parameters maxReadsInRegionPerSample and minReadsPerAlignStart to change the coverage level, but what I got the coverage of result files is still default coverage level.
so I wanna ask what parameter in HaplotypeCaller could change the level of coverage? if they are above two parameters, then how could I increase the downsample_coverage?

Haplotype Caller calls variants where there are no reads?

$
0
0

Basically, i'm seeing variants called where I have no reads. Not sure why, but maybe the developers might know why?

All the data needed to replicate this can be found at http://ac.gt/haplotypecaller.tar.gz

I saw this when I created the g.vcf on the full BAM file with the full genome.fa, but I also tried re-making the g.vcf with just the reads around the variant, and a genome.fa of just chromosome 1. It gave the same result so the above link just that data from which you can re-run:

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ./genome.fa -I ./input.bam --emitRefConfidence GVCF -o ./output.g.vcf

HaplotypeCaller doesn't call true variants which are located on the outside of duplicated reads

$
0
0

Hi,

I was running the haplotypeCaller for many samples, but some variants (validated as true positives by using other techniques) within these samples are not called by the haplotypeCaller. I saw in the bam files that most of these variants are located on the outside of duplicated reads (around 200 reads). Most of my data consists of duplicated reads. First I thought that the duplicated reads were filtered out by the read filters which are automatically applied (like duplicateReadFilter), but when I checked it this was not the case. I was wondering why my true variants are not called by the HaplotypeCaller and if there is an option to resolve this problem?

Thank you!

Viewing all 1335 articles
Browse latest View live


Latest Images