HELP: haplotypecaller doesn't call the driver oncogenic variant!

January 13, 2015, 1:49 am

≫ Next: Supplying a bed file to GATK haplotype caller

Hi, I'm using GATK haplotypecaller in order to detect varianti in a tumor sample analyzed with Illumina WES 100X2 paired end mode. I know KRAS p.G12 variant (chr12:25398284) is present in my sample (because previously seen with Sanger sequencing, KRAS is the oncogenic event in this kind of tumor). I aligned the fastq file with bwa after quality control e adapter trimmig. In my bam file I'm able to view the variant at genomic coordinate chr12:25398284 as you can see in the picture (IGV):

however GATK haplotypecaller doesn't call the variant anyway. This is my basic command line:

java -Xmx4g -jar /opt/GenomeAnalysisTK-3.3-0/GenomeAnalysisTK.jar -T HaplotypeCaller -R human_g1k_v37.fasta -I my.bam -stand_call_conf 0 -stand_emit_conf 0 -o output.file

and I tryed to tune a a lot of parameter such as stand_call_conf; -stand_emit_conf; --heterozygosity; --min_base_quality_score and so on.

This is the pileup string of the chr12:25398284 coordinate in my bam file 12 25398284 C 55 ,,T.,,...,,,,,,T.......,t,t.....,,,t..,,,,,,,,,,,,.^],^],^],^], BB@ECAFCECBCBBB@DBCCCDDCABADBDBCCDD@BBEADDEDBCADBB@@BAC

The base quality is good and the mapping quality too, but the haplotypecaller does not determine any active region around the position chr12:25398284

Any suggestion about the reason of misscalling??

Many thanks, Valentina

↧

Supplying a bed file to GATK haplotype caller

January 14, 2015, 2:31 am

≫ Next: vcf file from a pooled sample

≪ Previous: HELP: haplotypecaller doesn't call the driver oncogenic variant!

Hello,

I tried to find this option but I was unsuccessful. I would like to call SNPs only on a subset of my bam files, that is only on specific chromosomes. I have now done so by getting a subset of bamfile, I was wondering if there is a way to provide Haplotype caller a bed file with the list of chromosomes I want for SNP calling in order to skip the subsetting step?

Thanks, Lucia

↧

vcf file from a pooled sample

January 15, 2015, 2:42 am

≫ Next: haplotypecaller step took unusual long time

≪ Previous: Supplying a bed file to GATK haplotype caller

Hello,

I am having difficulty in understanding the vcf output of haplotype caller when I use it for a pool of two individuals. I set ploidy to 4 in this case.

When ploidy is 2, one get the heterozygote genotype as 0/1 which is understandable. In the case of ploidy equals to 4, how I can interpret such a genotype (GT:AD:DP:GQ:PL 0/0/0/1:8,3:11:5:95,0,5,24,232) given that this is a pooled sample so it is not possible to know the genotype. I just was wondering what this means.

Thank you, Homa

↧

haplotypecaller step took unusual long time

January 15, 2015, 10:03 am

≫ Next: genotype calling in gatk

≪ Previous: vcf file from a pooled sample

Hi, I have been using the best practices pipeline for illumnia samples for quite long time. Everything is fine. Recently, I am using the same pipeline for IONTORRENT samples. Everything is fine until the haplotypecaller step. I only got two samples to do the haplotypecaller. One bam file is around 9G, and the other is around 13G. But the speed of this step is really slow. It need around 24 weeks to finish haplotypecaller step. It's really strange. Cause if I run the haplotypecaller for a family with 9 persons (Illumina), it only took around 9 days to finish it. How come these two files would need 24 weeks. Does anyone have any idea about what may cause this problem? The bam file of those 9 person are around the same size of the two IONTORRENT samples.

Thank you.

↧

genotype calling in gatk

January 15, 2015, 10:37 pm

≫ Next: HaplotypeCaller on haploid genomes

≪ Previous: haplotypecaller step took unusual long time

Excuse me:

After calling the genotype using Haplotyper Caller in gatk, i manually check the reads covering the variant sites. But i found one exception: The genotype of one sample was 0/0，wild type,but the reads for this site were .,,,..,,c.,c.c,.cccc. I think almost all the reads should be . or ,. Is this case normal?

Many thanks in advance!

↧

HaplotypeCaller on haploid genomes

February 7, 2013, 3:05 am

≫ Next: When should I use HC in ERC mode?

≪ Previous: genotype calling in gatk

I'm trying to run HaplotypeCaller on a haploid organism. Is this possible? What argument should I use for this? My first attempt produced a diploid calls.

Sorry for the silly question

↧

When should I use HC in ERC mode?

January 19, 2015, 6:58 am

≫ Next: HaplotypeCaller stopping midway without error, probably ram related

≪ Previous: HaplotypeCaller on haploid genomes

Hi there, maybe this is a very basic question but, I've read several posts and sections of GATK Best Practices and tutorials but I can not get the point. What is the main difference between using GATK HC in gVCF mode instead of in multi-sample mode? I know that HC in GVCF mode is used to do variant discovery analysis on cohorts of samples, but what is the meaning of "cohorts of samples"? If I have 2 groups of samples, one WT and the other mutant, should I use GVCF mode? I've read almost all the tutorials and Howto's of GATK and I can not understand at all.

And another question, can I give to HC more than one bam file? Maybe using -I several times? And if I introduce several files, in the raw vcf output file, what I'm going to find? Several columns one per sample?

Thank you in advance.

↧

HaplotypeCaller stopping midway without error, probably ram related

June 11, 2014, 10:47 am

≫ Next: What is the best practice for calling/combining variants across multiple RNA-Seq datasets

≪ Previous: When should I use HC in ERC mode?

I'm running the HaplotypeCaller on a series of samples using a while loop in a bash script and for some samples the HaplotypeCaller is stopping part way through the file. My command was:

java -Xmx18g -jar $Gpath/GenomeAnalysisTK.jar \
   -nct 8 \
    -l INFO \
    -R $ref \
    -log $log/$plate.$prefix.HaplotypeCaller.log \
    -T HaplotypeCaller \
    -I  $bam/$prefix.realign.bam \
    --emitRefConfidence GVCF \
    -variant_index_type LINEAR \
    -variant_index_parameter 128000 \
    -o $gvcf/$prefix.GATK.gvcf.vcf

Most of the samples completed and the output looks good, but for some I only have a truncated gvcf file with no index. When I look at the log it looks like this:

INFO  17:25:15,289 HelpFormatter - --------------------------------------------------------------------------------
INFO  17:25:15,291 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.1-1-g07a4bf8, Compiled 2014/03/18 06:09:21
INFO  17:25:15,291 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO  17:25:15,291 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO  17:25:15,294 HelpFormatter - Program Args: -nct 8 -l INFO -R /home/owens/ref/Gasterosteus_aculeatus.BROADS1.73.dna.toplevel.fa -log /home/owens/SB/C31KCACXX05.log/C31KCACXX05.sb1Pax102L-S2013.Hap
INFO  17:25:15,296 HelpFormatter - Executing as owens@GObox on Linux 3.2.0-63-generic amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_17-b02.
INFO  17:25:15,296 HelpFormatter - Date/Time: 2014/06/10 17:25:15
INFO  17:25:15,296 HelpFormatter - --------------------------------------------------------------------------------
INFO  17:25:15,296 HelpFormatter - --------------------------------------------------------------------------------
INFO  17:25:15,722 GenomeAnalysisEngine - Strictness is SILENT
INFO  17:25:15,892 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 250
INFO  17:25:15,898 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO  17:25:15,942 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.04
INFO  17:25:15,948 HCMappingQualityFilter - Filtering out reads with MAPQ < 20
INFO  17:25:15,993 MicroScheduler - Running the GATK in parallel mode with 8 total threads, 8 CPU thread(s) for each of 1 data thread(s), of 12 processors available on this machine  
INFO  17:25:16,097 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files
INFO  17:25:16,114 GenomeAnalysisEngine - Done preparing for traversal
INFO  17:25:16,114 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO  17:25:16,114 ProgressMeter -        Location processed.active regions  runtime per.1M.active regions completed total.runtime remaining
INFO  17:25:16,114 HaplotypeCaller - Standard Emitting and Calling confidence set to 0.0 for reference-model confidence output
INFO  17:25:16,116 HaplotypeCaller - All sites annotated with PLs force to true for reference-model confidence output
INFO  17:25:16,278 HaplotypeCaller - Using global mismapping rate of 45 => -4.5 in log10 likelihood units
INFO  17:25:46,116 ProgressMeter - scaffold_1722:1180        1.49e+05   30.0 s        3.3 m      0.0%        25.6 h    25.6 h
INFO  17:26:46,117 ProgressMeter - scaffold_279:39930        1.37e+07   90.0 s        6.0 s      3.0%        50.5 m    49.0 m
INFO  17:27:16,118 ProgressMeter - scaffold_139:222911        2.89e+07  120.0 s        4.0 s      6.3%        31.7 m    29.7 m
INFO  17:27:46,119 ProgressMeter - scaffold_94:517387        3.89e+07    2.5 m        3.0 s      8.5%        29.2 m    26.7 m
INFO  17:28:16,121 ProgressMeter - scaffold_80:591236        4.06e+07    3.0 m        4.0 s      8.9%        33.6 m    30.6 m
INFO  17:28:46,123 ProgressMeter - groupXXI:447665        6.07e+07    3.5 m        3.0 s     13.3%        26.4 m    22.9 m
INFO  17:29:16,395 ProgressMeter -  groupV:8824013        7.25e+07    4.0 m        3.0 s     17.6%        22.7 m    18.7 m
INFO  17:29:46,396 ProgressMeter - groupXIV:11551262        9.93e+07    4.5 m        2.0 s     24.0%        18.7 m    14.2 m
WARN  17:29:52,732 ExactAFCalc - this tool is currently set to genotype at most 6 alternate alleles in a given context, but the context at groupX:1516679 has 8 alternate alleles so only the top alleles
INFO  17:30:19,324 ProgressMeter - groupX:14278234        1.15e+08    5.1 m        2.0 s     27.9%        18.1 m    13.0 m
INFO  17:30:49,414 ProgressMeter - groupXVIII:5967453        1.46e+08    5.6 m        2.0 s     33.0%        16.8 m    11.3 m
INFO  17:31:19,821 ProgressMeter - groupXI:15030145        1.63e+08    6.1 m        2.0 s     38.5%        15.7 m     9.7 m
INFO  17:31:50,192 ProgressMeter - groupVI:5779653        1.96e+08    6.6 m        2.0 s     43.8%        15.0 m     8.4 m
INFO  17:32:20,334 ProgressMeter - groupXVI:18115788        2.13e+08    7.1 m        1.0 s     50.1%        14.1 m     7.0 m
INFO  17:32:50,335 ProgressMeter - groupVIII:4300439        2.50e+08    7.6 m        1.0 s     55.1%        13.7 m     6.2 m
INFO  17:33:30,336 ProgressMeter - groupXIII:2378126        2.89e+08    8.2 m        1.0 s     63.1%        13.0 m     4.8 m
INFO  17:34:02,099 GATKRunReport - Uploaded run statistics report to AWS S3

It seems like it got half way through and stopped. I think it's a memory issue because when I increased the available ram to java, the problem happens less, although I can't figure out why some samples work and others don't (there isn't anything else running on the machine using ram and the biggest bam files aren't failing). It's also strange to me that there doesn't seem to be an error message. Any insight into why this is happening and how to avoid it would be appreciated.

↧

What is the best practice for calling/combining variants across multiple RNA-Seq datasets

January 21, 2015, 11:53 am

≫ Next: Workflow using HaplotypeCaller, GenotypeGVCFs, VQSR, and CalculateGenotypePosteriors

≪ Previous: HaplotypeCaller stopping midway without error, probably ram related

Hi, I am working with RNA-Seq data from 6 different samples. Part of my research is to identify novel polymorphisms. I have generated a filtered vcf file for each sample. I would like to now combine these into a single vcf.

I am concerned about sites that were either not covered by the RNA-Seq analysis or were no different from the reference allele in some individuals but not others. These sites will be ‘missed’ when haplotypeCaller analyzes each sample individually and will not be represented in the downstream vcf files.

When the files are combined, what happens to these ‘missed’ sites? Are they automatically excluded? Are they treated as missing data? Is the absent data filled in from the reference genome?

Alternatively, can BaseRecallibrator and/or HaplotypeCaller simultaneously analyze multiple bam files?

Is it common practice to combine bam files for discovering sequence variants?

↧

Workflow using HaplotypeCaller, GenotypeGVCFs, VQSR, and CalculateGenotypePosteriors

January 23, 2015, 8:55 am

≫ Next: Filtering a variant on SOR that looks otherwise well supported

≪ Previous: What is the best practice for calling/combining variants across multiple RNA-Seq datasets

Hi,

I have recal.bam files for all the individuals in my study (these constitute 4 families), and each bam file contains information for one chromosome for one individual. I was wondering if it is best for me to pass all the files for a single individual together when running HaplotypeCaller, if it will increase the accuracy of the calling, or if I can just run HaplotypeCaller on each individual bam file separately.

Also, I was wondering at which step I should be using CalculateGenotypePosteriors, and if it will clean up the calls substantially. VQSR already filters the calls, but I was reading that CalculateGenotypePosteriors actually takes pedigree files, which would be useful in my case. Should I try to use CalculateGenotypePosteriors after VQSR? Are there other relevant filtering or clean-up tools that I should be aware of?

Thanks very much in advance,

Alva

↧

Filtering a variant on SOR that looks otherwise well supported

January 26, 2015, 3:01 am

≫ Next: HaplotypeCaller : Rod span ... isn't contained within the data shard without -nct option

≪ Previous: Workflow using HaplotypeCaller, GenotypeGVCFs, VQSR, and CalculateGenotypePosteriors

Hi, I need to apply hard filters to my data. In cases where I have lower coverage I plan to use the Fisher Strand annotation, and in higher coverage variant calls, SOR (using a JEXL expression to switch between them: DP < 20 ? FS > 50.0 : SOR > 3).

The variant call below (some annotations snipped), which is from a genotyped gVCF from HaplotypeCaller (using a BQSR'ed BAM file), looks well supported (high QD, high MQ, zero MQ0). However, there appears to be some strand bias (SOR=3.3):

788.77 . DP=34;FS=5.213;MQ=35.37;MQ0=0;QD=25.44;SOR=3.334 GT:AD:DP:GQ:PL 1/1:2,29:31:35:817,35,0

In this instance the filter example above would be applied.

My Question

Is this filtering out a true positive? And what kind of cut-offs should I be using for FS and SOR?

The snipped annotations ReadPosRankSum=-1.809 and BaseQRankSum=-0.8440 for this variant also indicate minor bias that the evidence to support this variant call also has some bias (the variant appears near the end of reads in low quality bases, compared to the reads supporting the reference allele).

My goal

This is part of a larger hard filter I'm applying to a set of genotyped gVCFs called from HaplotypeCaller.

I'm filtering HomRef positions using this JEXL filter:

vc.getGenotype("%sample%").isHomRef() ? ( vc.getGenotype("%sample%").getAD().size == 1 ? (DP < 10) : ( ((DP - MQ0) < 10) || ((MQ0 / (1.0 * DP)) >= 0.1) || MQRankSum > 3.2905 || ReadPosRankSum > 3.2905 || BaseQRankSum > 3.2905 ) ) : false

And filtering HomVar positions using this JEXL:

vc.getGenotype("%sample%").isHomVar() ? ( vc.getGenotype("%sample%").getAD().0 == 0 ? ( ((DP - MQ0) < 10) || ((MQ0 / (1.0 * DP)) >= 0.1) || QD < 5.0 || MQ < 30.0 ) : ( BaseQRankSum < -3.2905 || MQRankSum < -3.2905 || ReadPosRankSum < -3.2905 || (MQ0 / (1.0 * DP)) >= 0.1) || QD < 5.0 || (DP < 20 ? FS > 60.0 : SOR > 3.5) || MQ < 30.0 || QUAL < 100.0 ) ) : false

My goal is true positive variants only and I have high coverage data, so the filtering should be relatively stringent. Unfortunately I don't have a database I could use to apply VQSR, henceforth the comprehensive filtering strategy.

↧

HaplotypeCaller : Rod span ... isn't contained within the data shard without -nct option

January 23, 2015, 12:37 am

≫ Next: Incorrect AD values in HC-called vcf and combined gvcf

≪ Previous: Filtering a variant on SOR that looks otherwise well supported

Hello,

I have a pb with HaplotypeCaller.
I saw that this error has been reported as linked with -nct option, but I did not use it.

Strangely enough, I think my command worked some times ago (before server restart ?)

All reference files I used are from the GATK bundle.

Sletort

↧

Incorrect AD values in HC-called vcf and combined gvcf

January 28, 2015, 6:33 am

≫ Next: (howto) Call variants on a single diploid genome with the HaplotypeCaller

≪ Previous: HaplotypeCaller : Rod span ... isn't contained within the data shard without -nct option

Hi,

I am using GATK v3.2.2 following the recommended practices (...HC -> CombineGVCFs -> GenotypeGVCFs ...) and while looking through suspicious variants I came across a few hetz with AD=X,0. Tracing them back I found two inconsistencies (bugs?);

1) Reordering of genotypes when combining gvcfs while the AD values are kept intact, which leads to an erronous AD for a heterozygous call. Also, I find it hard to understand why the 1bp insertion is emitted in the gvcf - there is no reads supporting it:

single sample gvcf 1 26707944 . A AG,G,<NON_REF> 903.73 . [INFO] GT:AD:DP:GQ:PL:SB 0/2:66,0,36,0:102:99:1057,1039,4115,0,2052,1856,941,3051,1925,2847:51,15,27,9
combined gvcf 1 26707944 . A G,AG,<NON_REF> . . [INFO] GT:AD:DP:MIN_DP:PL:SB [other_samples] ./.:66,0,36,0:102:.:1057,0,1856,1039,2052,4115,941,1925,3051,2847:51,15,27,9 [other_samples]
vcf
1 26707944 . A G 3169.63 . [INFO] [other_samples] 0/1:66,0:102:99:1057,0,1856 [other_samples]

2) Incorrect AD is taken while genotyping gvcf files:

single sample gvcf: 1 1247185 rs142783360 AG A,<NON_REF> 577.73 . [INFO] GT:AD:DP:GQ:PL:SB 0/1:13,20,0:33:99:615,0,361,654,421,1075:7,6,17,3
combined gvcf 1 1247185 rs142783360 AG A,<NON_REF> . . [INFO] [other_samples] ./.:13,20,0:33:.:615,0,361,654,421,1075:7,6,17,3 [other_samples]
vcf
1 1247185 . AG A 569.95 . [INFO] [other_samples] 0/1:13,0:33:99:615,0,361 [other_samples]

I have found multiple such cases here, and no errors nor warnings in the logs. I checked also with calls that I had done before on these samples, but in a smaller batch. There the AD values were correct, but there were plenty of other hetz with AD=X,0... I haven't looked closer into those.

Are these bugs that have been fixed in 3.3? Or maybe my brain is not working properly today and I miss sth obvious?

Best regards, Paweł

↧

(howto) Call variants on a single diploid genome with the HaplotypeCaller

June 17, 2013, 2:31 pm

≫ Next: HC_ERC output question:"sample1"

≪ Previous: Incorrect AD values in HC-called vcf and combined gvcf

Objective

Call variants on a diploid genome with the HaplotypeCaller, producing a raw (unfiltered) VCF.

Caveat

This is meant only for single-sample analysis. To analyze multiple samples, see the Best Practices documentation on joint analysis.

Prerequisites

Steps

Determine the basic parameters of the analysis
Call variants in your sequence data

1. Determine the basic parameters of the analysis

If you do not specify these parameters yourself, the program will use default values. However we recommend that you set them explicitly because it will help you understand how the results are bounded and how you can modify the program's behavior.

Genotyping mode (–genotyping_mode)

This specifies how we want the program to determine the alternate alleles to use for genotyping. In the default DISCOVERY mode, the program will choose the most likely alleles out of those it sees in the data. In GENOTYPE_GIVEN_ALLELES mode, the program will only use the alleles passed in from a VCF file (using the -alleles argument). This is useful if you just want to determine if a sample has a specific genotype of interest and you are not interested in other alleles.

Emission confidence threshold (–stand_emit_conf)

This is the minimum confidence threshold (phred-scaled) at which the program should emit sites that appear to be possibly variant.

Calling confidence threshold (–stand_call_conf)

This is the minimum confidence threshold (phred-scaled) at which the program should emit variant sites as called. If a site's associated genotype has a confidence score lower than the calling threshold, the program will emit the site as filtered and will annotate it as LowQual. This threshold separates high confidence calls from low confidence calls.

The terms called and filtered are tricky because they can mean different things depending on context. In ordinary language, people often say a site was called if it was emitted as variant. But in the GATK's technical language, saying a site was called means that that site passed the confidence threshold test. For filtered, it's even more confusing, because in ordinary language, when people say that sites were filtered, they usually mean that those sites successfully passed a filtering test. However, in the GATK's technical language, the same phrase (saying that sites were filtered) means that those sites failed the filtering test. In effect, it means that those would be filtered out if the filter was used to actually remove low-confidence calls from the callset, instead of just tagging them. In both cases, both usages are valid depending on the point of view of the person who is reporting the results. So it's always important to check what is the context when interpreting results that include these terms.

2. Call variants in your sequence data

Action

Run the following GATK command:

java -jar GenomeAnalysisTK.jar \ 
    -T HaplotypeCaller \ 
    -R reference.fa \ 
    -I preprocessed_reads.bam \  # can be reduced or not
    -L 20 \ 
    --genotyping_mode DISCOVERY \ 
    -stand_emit_conf 10 \ 
    -stand_call_conf 30 \ 
    -o raw_variants.vcf

Note: This is an example command. Please look up what the arguments do and see if they fit your analysis before copying this. To see how the -L argument works, you can refer here: http://gatkforums.broadinstitute.org/discussion/4133/when-should-i-use-l-to-pass-in-a-list-of-intervals#latest

Expected Result

This creates a VCF file called raw_variants.vcf, containing all the sites that the HaplotypeCaller evaluated to be potentially variant. Note that this file contains both SNPs and Indels.

Although you now have a nice fresh set of variant calls, the variant discovery stage is not over. The distinctions made by the caller itself between low-confidence calls and the rest is very primitive, and should not be taken as a definitive guide for filtering. The GATK callers are designed to be very lenient in calling variants, so it is extremely important to apply one of the recommended filtering methods (variant recalibration or hard-filtering), in order to move on to downstream analyses with the highest-quality call set possible.

↧

HC_ERC output question:"sample1"

January 29, 2015, 8:09 am

≫ Next: Haplotype Caller output missing fileds AC/ AN

≪ Previous: (howto) Call variants on a single diploid genome with the HaplotypeCaller

Hi there,

This is maybe a silly question but I want to know if it is normal or not. I have just run HC in GVCF mode for each of my bam files. When I see the output...:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sample1

Is it normal to have "sample1" instead of "bam file name" or something like that? I have realized because when I was running GenotypeGVCFs, I was only getting one column named also "sample1", instead of one column per input vcf (what I think that is the normal output, isn't it? So, what I've done is to change "sample1" of each vcf file with the corresponding sample name to run GenotypeGVCF.

I hope I've explained myself more or less.

Thanks in advance

↧

Haplotype Caller output missing fileds AC/ AN

January 30, 2015, 3:27 am

≫ Next: HaplotypeCaller pooled sequence problem

≪ Previous: HC_ERC output question:"sample1"

I'm a bit confused regarding the new GATK version and new HC-functions. I'm trying to call haplotypes in a family of plants. I call Haplotypes using haplotype caller, then I want to run Read-backed phasing on the raw vcfs and then CalculateGenotypePosterios to add pedigree information. The CalculateGenotypePosterios-Walker seems to need the format Fields AC and AN, but they are not produced by the HaplotypeCaller. They used to be in earlier HC-Versions though...(?). How can I fix this? And is this a proper workflow at all? Is Read-backed phasing needed or has it become redundant with the new HC-Version being able to do physical phasing. Would it be "enough" to run HC for phasing and CalculateGenotypePosterios to add pedigree information? Anyhow the problem of missing ac and an fields remains. I would be greatful for some help on this.

Thsi is how a raw vcf produced by HC looks like

fileformat=VCFv4.1

ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location">

FILTER=<ID=LowQual,Description="Low quality">

FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">

FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">

FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">

FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">

FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another">

FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">

FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">

FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">

GATKCommandLine=<ID=HaplotypeCaller,Version=3.3-0-g37228af,Date="Fri Jan 30 12:04:00 CET 2015",Epoch=1422615840668,CommandLineOptions="analysis_type=HaplotypeCaller input_file=[/prj/gf-grape/project_FTC_in_crops/members/Nadia/test/GfGa4742_CGATGT_vs_candidategenes.sorted.readgroups.deduplicated.realigned.recalibrated.bam] showFullBamList=false read_buffer_size=null phone_home=AWS gatk_key=null tag=NA read_filter=[] intervals=null excludeIntervals=null interval_set_rule=UNION interval_merging=ALL interval_padding=0 reference_sequence=/prj/gf-grape/project_FTC_in_crops/members/Nadia/amplicons_run3/GATK_new/RefSequences_all_candidate_genes.fasta nonDeterministicRandomSeed=false disableDithering=false maxRuntime=-1 maxRuntimeUnits=MINUTES downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=250 baq=OFF baqGapOpenPenalty=40.0 refactor_NDN_cigar_string=false fix_misencoded_quality_scores=false allow_potentially_misencoded_quality_scores=false useOriginalQualities=false defaultBaseQualities=-1 performanceLog=null BQSR=null quantize_quals=0 disable_indel_quals=false emit_original_quals=false preserve_qscores_less_than=6 globalQScorePrior=-1.0 validation_strictness=SILENT remove_program_records=false keep_program_records=false sample_rename_mapping_file=null unsafe=null disable_auto_index_creation_and_locking_when_reading_rods=false no_cmdline_in_header=false sites_only=false never_trim_vcf_format_field=true bcf=false bam_compression=null simplifyBAM=false disable_bam_indexing=false generate_md5=false num_threads=1 num_cpu_threads_per_data_thread=1 num_io_threads=0 monitorThreadEfficiency=false num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false generateShadowBCF=false variant_index_type=LINEAR variant_index_parameter=128000 logging_level=INFO log_to_file=null help=false version=false likelihoodCalculationEngine=PairHMM heterogeneousKmerSizeResolution=COMBO_MIN graphOutput=null bamWriterType=CALLED_HAPLOTYPES disableOptimizations=false dbsnp=(RodBinding name= source=UNBOUND) dontTrimActiveRegions=false maxDiscARExtension=25 maxGGAARExtension=300 paddingAroundIndels=150 paddingAroundSNPs=20 comp=[] annotation=[ClippingRankSumTest, DepthPerSampleHC, StrandBiasBySample] excludeAnnotation=[SpanningDeletions, TandemRepeatAnnotator, ChromosomeCounts, FisherStrand, StrandOddsRatio, QualByDepth] debug=false useFilteredReadsForAnnotations=false emitRefConfidence=GVCF annotateNDA=false heterozygosity=0.001 indel_heterozygosity=1.25E-4 standard_min_confidence_threshold_for_calling=-0.0 standard_min_confidence_threshold_for_emitting=-0.0 max_alternate_alleles=6 input_prior=[] sample_ploidy=2 genotyping_mode=DISCOVERY alleles=(RodBinding name= source=UNBOUND) contamination_fraction_to_filter=0.0 contamination_fraction_per_sample_file=null p_nonref_model=null exactcallslog=null output_mode=EMIT_VARIANTS_ONLY allSitePLs=true sample_name=null kmerSize=[10, 25] dontIncreaseKmerSizesForCycles=false allowNonUniqueKmersInRef=false numPruningSamples=1 recoverDanglingHeads=false doNotRecoverDanglingBranches=false minDanglingBranchLength=4 consensus=false GVCFGQBands=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 70, 80, 90, 99] indelSizeToEliminateInRefModel=10 min_base_quality_score=10 minPruning=2 gcpHMM=10 includeUmappedReads=false useAllelesTrigger=false phredScaledGlobalReadMismappingRate=45 maxNumHaplotypesInPopulation=2 mergeVariantsViaLD=false doNotRunPhysicalPhasing=false pair_hmm_implementation=VECTOR_LOGLESS_CACHING keepRG=null justDetermineActiveRegions=false dontGenotype=false errorCorrectKmers=false debugGraphTransformations=false dontUseSoftClippedBases=false captureAssemblyFailureBAM=false allowCyclesInKmerGraphToGeneratePaths=false noFpga=false errorCorrectReads=false kmerLengthForReadErrorCorrection=25 minObservationsForKmerToBeSolid=20 pcr_indel_model=CONSERVATIVE maxReadsInRegionPerSample=1000 minReadsPerAlignmentStart=5 activityProfileOut=null activeRegionOut=null activeRegionIn=null activeRegionExtension=null forceActive=false activeRegionMaxSize=null bandPassSigma=null maxProbPropagationDistance=50 activeProbabilityThreshold=0.002 min_mapping_quality_score=20 filter_reads_with_N_cigar=false filter_mismatching_base_and_quals=false filter_bases_not_stored=false">

GVCFBlock=minGQ=0(inclusive),maxGQ=1(exclusive)

GVCFBlock=minGQ=1(inclusive),maxGQ=2(exclusive)

GVCFBlock=minGQ=10(inclusive),maxGQ=11(exclusive)

GVCFBlock=minGQ=11(inclusive),maxGQ=12(exclusive)

GVCFBlock=minGQ=12(inclusive),maxGQ=13(exclusive)

GVCFBlock=minGQ=13(inclusive),maxGQ=14(exclusive)

GVCFBlock=minGQ=14(inclusive),maxGQ=15(exclusive)

GVCFBlock=minGQ=15(inclusive),maxGQ=16(exclusive)

GVCFBlock=minGQ=16(inclusive),maxGQ=17(exclusive)

GVCFBlock=minGQ=17(inclusive),maxGQ=18(exclusive)

GVCFBlock=minGQ=18(inclusive),maxGQ=19(exclusive)

GVCFBlock=minGQ=19(inclusive),maxGQ=20(exclusive)

GVCFBlock=minGQ=2(inclusive),maxGQ=3(exclusive)

GVCFBlock=minGQ=20(inclusive),maxGQ=21(exclusive)

GVCFBlock=minGQ=21(inclusive),maxGQ=22(exclusive)

GVCFBlock=minGQ=22(inclusive),maxGQ=23(exclusive)

GVCFBlock=minGQ=23(inclusive),maxGQ=24(exclusive)

GVCFBlock=minGQ=24(inclusive),maxGQ=25(exclusive)

GVCFBlock=minGQ=25(inclusive),maxGQ=26(exclusive)

GVCFBlock=minGQ=26(inclusive),maxGQ=27(exclusive)

GVCFBlock=minGQ=27(inclusive),maxGQ=28(exclusive)

GVCFBlock=minGQ=28(inclusive),maxGQ=29(exclusive)

GVCFBlock=minGQ=29(inclusive),maxGQ=30(exclusive)

GVCFBlock=minGQ=3(inclusive),maxGQ=4(exclusive)

GVCFBlock=minGQ=30(inclusive),maxGQ=31(exclusive)

GVCFBlock=minGQ=31(inclusive),maxGQ=32(exclusive)

GVCFBlock=minGQ=32(inclusive),maxGQ=33(exclusive)

GVCFBlock=minGQ=33(inclusive),maxGQ=34(exclusive)

GVCFBlock=minGQ=34(inclusive),maxGQ=35(exclusive)

GVCFBlock=minGQ=35(inclusive),maxGQ=36(exclusive)

GVCFBlock=minGQ=36(inclusive),maxGQ=37(exclusive)

GVCFBlock=minGQ=37(inclusive),maxGQ=38(exclusive)

GVCFBlock=minGQ=38(inclusive),maxGQ=39(exclusive)

GVCFBlock=minGQ=39(inclusive),maxGQ=40(exclusive)

GVCFBlock=minGQ=4(inclusive),maxGQ=5(exclusive)

GVCFBlock=minGQ=40(inclusive),maxGQ=41(exclusive)

GVCFBlock=minGQ=41(inclusive),maxGQ=42(exclusive)

GVCFBlock=minGQ=42(inclusive),maxGQ=43(exclusive)

GVCFBlock=minGQ=43(inclusive),maxGQ=44(exclusive)

GVCFBlock=minGQ=44(inclusive),maxGQ=45(exclusive)

GVCFBlock=minGQ=45(inclusive),maxGQ=46(exclusive)

GVCFBlock=minGQ=46(inclusive),maxGQ=47(exclusive)

GVCFBlock=minGQ=47(inclusive),maxGQ=48(exclusive)

GVCFBlock=minGQ=48(inclusive),maxGQ=49(exclusive)

GVCFBlock=minGQ=49(inclusive),maxGQ=50(exclusive)

GVCFBlock=minGQ=5(inclusive),maxGQ=6(exclusive)

GVCFBlock=minGQ=50(inclusive),maxGQ=51(exclusive)

GVCFBlock=minGQ=51(inclusive),maxGQ=52(exclusive)

GVCFBlock=minGQ=52(inclusive),maxGQ=53(exclusive)

GVCFBlock=minGQ=53(inclusive),maxGQ=54(exclusive)

GVCFBlock=minGQ=54(inclusive),maxGQ=55(exclusive)

GVCFBlock=minGQ=55(inclusive),maxGQ=56(exclusive)

GVCFBlock=minGQ=56(inclusive),maxGQ=57(exclusive)

GVCFBlock=minGQ=57(inclusive),maxGQ=58(exclusive)

GVCFBlock=minGQ=58(inclusive),maxGQ=59(exclusive)

GVCFBlock=minGQ=59(inclusive),maxGQ=60(exclusive)

GVCFBlock=minGQ=6(inclusive),maxGQ=7(exclusive)

GVCFBlock=minGQ=60(inclusive),maxGQ=70(exclusive)

GVCFBlock=minGQ=7(inclusive),maxGQ=8(exclusive)

GVCFBlock=minGQ=70(inclusive),maxGQ=80(exclusive)

GVCFBlock=minGQ=8(inclusive),maxGQ=9(exclusive)

GVCFBlock=minGQ=80(inclusive),maxGQ=90(exclusive)

GVCFBlock=minGQ=9(inclusive),maxGQ=10(exclusive)

GVCFBlock=minGQ=90(inclusive),maxGQ=99(exclusive)

GVCFBlock=minGQ=99(inclusive),maxGQ=2147483647(exclusive)

INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">

INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases">

INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">

INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">

INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">

INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">

INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">

INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">

INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">

INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">

INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">

INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">

INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">

contig=<ID=GSVIVT01012145001,length=8683>

contig=<ID=GSVIVT01012049001,length=18657>

contig=<ID=GSVIVT01012249001,length=14432>

contig=<ID=GSVIVT01011652001,length=6117>

contig=<ID=GSVIVT01011710001plu,length=4623>

contig=<ID=GSVIVT01012250001plu,length=27163>

contig=<ID=GSVIVT01011947001,length=3289>

contig=<ID=GSVIVT01011821001,length=7310>

contig=<ID=GSVIVT01011897001,length=5751>

contig=<ID=GSVIVT01022014001,length=6337>

contig=<ID=GSVIVT01011387001,length=11582>

contig=<ID=GSVIVT01036237001,length=18407>

contig=<ID=GSVIVT01036499001_CO,length=4568>

contig=<ID=GSVIVT01020232001,length=21274>

contig=<ID=GSVIVT01030735001,length=3570>

contig=<ID=GSVIVT01011433001,length=5349>

contig=<ID=GSVIVT01011939001,length=73679>

contig=<ID=GSVIVT01021854001,length=5609>

contig=<ID=GSVIVT01036549001plu,length=22905>

contig=<ID=GSVIVT01031112001,length=5884>

contig=<ID=GSVIVT01036551001plu,length=18328>

contig=<ID=GSVIVT01031354001,length=8603>

contig=<ID=GSVIVT01008655001_pl,length=4022>

contig=<ID=GSVIVT01031338001,length=6893>

contig=<ID=GSVIVT01019969001,length=5388>

contig=<ID=GSVIVT01032607001,length=8294>

contig=<ID=GSVIVT01010521001,length=19492>

contig=<ID=GSVIVT01036447001,length=6911>

contig=<ID=GSVIVT01010513001,length=23656>

contig=<ID=GSVIVT01033067001,length=28278>

reference=file:///prj/gf-grape/project_FTC_in_crops/members/Nadia/amplicons_run3/GATK_new/RefSequences_all_candidate_genes.fasta

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT GfGa4742

GSVIVT01012145001 1 . G . . END=113 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0
GSVIVT01012145001 114 . C . . END=164 GT:DP:GQ:MIN_DP:PL 0/0:172:99:164:0,120,1800
GSVIVT01012145001 165 . T C, 7732.77 . DP=175;MLEAC=2,0;MLEAF=1.00,0.00;MQ=60.00;MQ0=0 GT:AD:DP:GQ:PGT:PID:PL:SB 1/1:0,173,0:173:99:0|1:165_T_C:7761,521,0,7761,521,7761:0,0,165,8
GSVIVT01012145001 166 . G . . END=166 GT:DP:GQ:MIN_DP:PL 0/0:174:72:174:0,72,1080
GSVIVT01012145001 167 . T . . END=175 GT:DP:GQ:MIN_DP:PL 0/0:174:66:174:0,60,900
GSVIVT01012145001 176 . T . . END=191 GT:DP:GQ:MIN_DP:PL 0/0:174:57:173:0,57,855
GSVIVT01012145001 192 . A . . END=194 GT:DP:GQ:MIN_DP:PL 0/0:173:54:173:0,54,810
GSVIVT01012145001 195 . T . . END=199 GT:DP:GQ:MIN_DP:PL 0/0:174:51:173:0,51,765

And this is the Error Message I get

ERROR ------------------------------------------------------------------------------------------

ERROR A GATK RUNTIME ERROR has occurred (version 3.3-0-g37228af):

ERROR

ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.

ERROR If not, please post the error message, with stack trace, to the GATK forum.

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions http://www.broadinstitute.org/gatk

ERROR

ERROR MESSAGE: Key AC found in VariantContext field INFO at GSVIVT01012145001:1 but this key isn't defined in the VCFHeader. We require all VCFs to have complete VCF headers by default.

↧

HaplotypeCaller pooled sequence problem

February 2, 2015, 2:18 am

≫ Next: Illumina gVCFs

≪ Previous: Haplotype Caller output missing fileds AC/ AN

Hi,

I have a number of samples that consist of multiple individuals from the same population pooled together, and have been truing to use HaplotypeCaller to call the variants. I have set the (ploidy to 2 * number of individuals) but keep getting the same or similar error message, after running for several hours or days:

ERROR ------------------------------------------------------------------------------------------

ERROR A GATK RUNTIME ERROR has occurred (version 3.3-0-g37228af):

ERROR

ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.

ERROR If not, please post the error message, with stack trace, to the GATK forum.

ERROR Visit our website and forum for extensive documentation and answers to

ERROR commonly asked questions http://www.broadinstitute.org/gatk

ERROR

ERROR MESSAGE: the combination of ploidy (180) and number of alleles (9) results in a very large number of genotypes (> 2147483647). You need to limit ploidy or the number of alternative alleles to analyze this locus

ERROR ------------------------------------------------------------------------------------------

and I'm not sure what I can do to rectify it... Obviously I can't limit the ploidy, it is what it is, and I thought that HC only allows a maximum of six alleles anyway?

My code is below, and any help would be appreciated.

java -Xmx24g -jar ~/bin/GenomeAnalysisTK-3.3-0/GenomeAnalysisTK.jar -T HaplotypeCaller
-nct 6 \
-R ~/my_ref_sequence \
--intervals ~/my_intervals_file \
-ploidy 180 \
-log my_log_file \
-I ~/my_input_bam \
-o ~/my_output_vcf

↧

Illumina gVCFs

February 4, 2015, 12:52 am

≫ Next: gVCF files look different for same sample

≪ Previous: HaplotypeCaller pooled sequence problem

Hi GATK-ers,

I have been given ~2000 gVCFs generated by Illumina (one sample per gVCF). Though they are in standard gVCF format, they were generated by an Illumina pipeline (https://support.basespace.illumina.com/knowledgebase/articles/147078-gvcf-file if you're really curious) and not the Haplotype Caller. As a result (I think ... ), the GATK doesn't want to process them (I have tried CombineGVCFs and GenotypeGVCFs to no avail). Is there a GATK walker or some other tool that will make my gVCFs GATK-friendly? I need to be able to merge this data together to make it analyze-able because in single-sample VCF format it's pretty useless at the moment.

My only other thought has been to expand all the ref blocks of data and then merge everything together, but this seems like it will result in the creation of a massive amount of data.

Any suggestions you may have are greatly appreciated!!!

Sara

↧

gVCF files look different for same sample

February 5, 2015, 1:44 pm

≫ Next: Finding exact reason for haplotype caller reassembly

≪ Previous: Illumina gVCFs

Hi,
I have noticed that every time I repeat a gVCF call on the same sample (~same Bam file), the output gVCF files are not exactly same. They are almost similar, but there will be a few differences here and there & there will be a minute difference in unix-file-sizes as well. Is that something that is expected??

Shalabh Suman

↧

Finding exact reason for haplotype caller reassembly

February 10, 2015, 1:12 pm

≫ Next: AlleleBalance and HomopolymerRun not working for HaplotypeCaller/gVCF

≪ Previous: gVCF files look different for same sample

I'm working on an association mapping project in a non-model bird (so there is a reference genome, but it may have problems). We're looking for SNPs linked to a phenotype using paired end GBS data and using the haplotypecaller for SNP calling. We found a single SNP that explains a significant portion of the phenotypic variation. When we actually look at the SNP, it turns out to be an artifact derived from the haplotype caller reassembly. There is 10 bp insertion in the reference not found in this population and when the reassembly happens, it realigns it to produce a SNP. Just looking at the sequence of the reads themselves, there is no SNP.

So the question is, what is causing the reassembly in some samples and not others. I tried outputting the activity profile but since it is smoothed it hard to figure exactly where the difference is happening. Is it possible output an unsmoothed activity profile? Are there other ways to figure out how exactly the active region is being picked?

Thanks.

↧