Quantcast
Viewing all 1335 articles
Browse latest View live

per-sample DP is missing in called genotypes

I try to filter both variant non-variant sites together. I see the only reasonable way to do it is to filter by the per-sample DP.
However, I noticed that substantial fraction of called sites (~10%) have missing value in DP field ( . instead of 0 or other value). Although many of sites with missing DP values indeed have low coverage with bad mapping and called ./., some sites have in my view good coverage. When I check the bam file (-bamout result) in IGV, I see many reads mapped to those sites with good quality (MQ=60). The genotype is usually called correctly, but for some unclear for me reason GATK doesn't report DP values. The AD values are usually 0,0 in such sites. Interestingly, the nearby sites have DP values reported.
When I check gVCF files, these sites usually have DP=0. Assuming coverage of 0 I could discard these sites, but I see in a bam file that it is not 0. So, I do not want to throw away 10% of my data.
I notice a tendency that such site is usually homozygot for ALT allele. I provide an example of such a site below. See the 1st sample in the position 13388742. (1/1:0,0:.:3:1|1:13388738_T_C:45,3,0)
I generated the data using HaplotypeCaller in GVCF mode and then GenotypeGVCFs.
Could you please tell me the reason why this is so?

VCF:
scaffold_1 13388742 . A G 8529.41 . AC=35;AF=0.833;AN=42;BaseQRankSum=-1.730e-01;ClippingRankSum=-6.940e-01;DP=247;ExcessHet=0.4083;FS=9.162;InbreedingCoeff=0.1415;MLEAC=37;MLEAF=0.881;MQ=38.83;MQRankSum=2.51;QD=32.26;ReadPosRankSum=1.56;SOR=0.043 GT:AD:DP:GQ:PGT:PID:PL 1/1:0,0:.:3:1|1:13388738_T_C:45,3,0 ./.:0,0:0 ./.:2,0:2 ./.:4,0:4 ./.:0,0:0 ./.:3,0:3 0/0:20,0:20:0:.:.:0,0,577 0/1:0,1:1:11:0|1:13388692_G_T:81,0,11 1/1:0,10:10:33:1|1:13388724_G_A:495,33,0 ./.:0,0:0 1/1:1,19:20:27:1|1:13388742_A_G:979,27,0 1/1:0,20:20:35:1|1:13388724_G_A:944,35,0 1/1:0,1:1:6:.:.:90,6,0 1/1:0,7:7:24:1|1:13388724_G_A:360,24,0 0/1:0,2:2:30:0|1:13388692_G_T:165,0,30 1/1:0,5:5:15:1|1:13388742_A_G:225,15,0 1/1:0,20:20:60:1|1:13388724_G_A:900,60,0 1/1:0,8:8:30:1|1:13388724_G_A:450,30,0 1/1:0,15:15:48:.:.:720,48,0 0/1:3,28:31:42:0|1:13388742_A_G:1167,0,42 ./.:3,0:3 1/1:0,1:1:3:1|1:13388687_C_T:45,3,0 1/1:0,2:2:12:1|1:13388692_G_T:180,12,0 ./.:4,0:4 ./.:6,0:6 1/1:0,6:6:18:1|1:13388724_G_A:270,18,0 ./.:4,0:4 1/1:0,14:14:45:1|1:13388724_G_A:675,45,0 1/1:0,3:3:9:1|1:13388692_G_T:135,9,0 0/0:21,0:21:0:.:.:0,0,533 1/1:0,14:14:45:1|1:13388724_G_A:655,45,0

gVCF:

scaffold_1 13388740    .   C   <NON_REF>   .   .   END=13388741    GT:DP:GQ:MIN_DP:PL  0/0:4:12:4:0,12,162
scaffold_1  13388742    .   A   G,<NON_REF> 31.82   .   DP=0;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;RAW_MQ=0.00    GT:GQ:PGT:PID:PL:SB 1/1:3:0|1:13388738_T_C:45,3,0,45,3,45:0,0,0,0
scaffold_1  13388743    .   A   <NON_REF>   .   .   END=13388743    GT:DP:GQ:MIN_DP:PL  0/0:5:15:5:0,15,214

Screenshot of a BAM:


What does one do with the raw VCF output of joint genotyping many gVCF files?

I have been following the best practices outline for calling SNPs on our samples, but I'm a little confused as to what to do with the VCF file produced following the joint genotyping/genotypeGVCFs step.

I understand the principle of gVCF calling for the most part, but my confusion is what are we to do with the VCF file once we do the joint genotyping step? We are looking at a F1 mapping population of a non-model organism, so does this VCF file have individual progeny (bam file names) indicated within it? I think not since I can't find any of the sample names while scrolling through it.

Can this VCF file be used to construct a pedigree file to use during genotype refinement? Should it be somehow fed back into Haplotypecaller to inform on likely calls during a second round of variant calling? Do you use it to go back to the individual gVCF files to extract the high confidence variants?

There seems to be a good amount of literature on the Broad websites about what a gVCF file is and how to perform joint genotyping, but not much direction about what to do with the joint genotyped VCF file once it is produced.

Any advice or referral to other walkthroughs/guides would be very appreciated.

Michael

[extra project information: My project involves calling SNPs across a mapping population for a non-model organism with the intent of mapping a trait. The goal is to produce robust SNP calls for each individual progeny (of which we have 30 currently, and >60 in the near future) and the two parents. We only have halfway-decent sequencing coverage of ~10-20x for each sample, which is thus why doing gVCF calling and joint genotyping sounds attractive to us. Since we work on a non-model, we also lack previously produced "gold standard" SNP sets or other resources allowing us to refine genotypes.]

What does each data thread stand for in HaplotypeCaller

Hi,
I'm using multi-threading for HaplotypeCaller by setting the nct option.
But actually, I found that the speedup it gains isn't in proportional to the increase of the number of data threads.
I tried nct as 8,12,16,24 on my machine, and gained a speedup of 4.1x, 4.2x, 4.2x, 4.2x. Seems that there is an upper bound of performance gains when enabling mult-threading for HaplotypeCaller.

I'm wondering what each data thread stands for in HaplotypeCaller. We need to use PairHMM to calculate the likelihood array in each active region. Are we distributing each read-haplotype pair in the region as one data thread and map it to a CPU thread? Or are we distributing the calculation in each region as one data thread?

Thanks,

Haplotypecaller taking too long

Hi,
When I run the pipeline according to best practices, HC on a fastq of 60MB (for a targeted panel) takes about 10 minutes, but then for the same pipeline/targeted region, on a fastq of 150MB, HC takes 6 hrs. Any idea what would explain such a stark jump in runtime? Also, is there any way to reduce it? Can I use -Xmx -Xms arguments to increase the speed if memory is not an issue?
Any help will be appreciated.
Thanks!

HaplotypeCaller calling error

Dears,
I found many calling errors like this, all of the phred quality >30, but HaplotypeCaller miss them, report as Homozygote.
How to deal with it?
Here is the code. Many thanks.
java -Xmx200g -jar /home/share/bin/GenomeAnalysisTK-3.5.jar \
-R /home/share/index/Prunus_persica.fa \
-T HaplotypeCaller -nct 8 \
-I $d'.sorted.uniqe.rg.dedup.realn.bam' \
-o $d'.gvcf' \
--genotyping_mode DISCOVERY \
-stand_emit_conf 30 \
-stand_call_conf 30 \
-ERC GVCF \
-variant_index_type LINEAR \
-variant_index_parameter 128000

Memory error - Downsampling unavailable (3.5

Hi there,

I am working on mitochondrial genomes (250 samples) with a coverage of ~7000x.
The HC (v 3.5) is running perfectly for the firsts 248 samples but reaching the last ones, the following Java error appears :

##### ERROR MESSAGE: An error occurred because you did not provide enough memory to run this program. You can use the -Xmx argument (before the -jar argument) to adjust the maximum heap size provided to Java. Note that this is a JVM argument, not a GATK argument.

All the other samples work perfectly with ~64G of mem and v_mem so i tried 128G but it is still not sufficient for my last samples to be processed. As you probably understand, i can't extend the memory indefinitely and I would want to know if it exists a way to force the down-sampling or at least to greatly ameliorate the memory used by the HaplotypeCaller.

Thanks for your time,

Regards,

Alex H

How to run GATK directly on SRA files

Hello , I recently saw a webinar by NCBI "Advanced Workshop on SRA and dbGaP Data Analysis" (ftp://ftp.ncbi.nlm.nih.gov/pub/education/public_webinars/2016/03Mar23_Advanced_Workshop/). They mentioned that they were able to run GATK directly on SRA files.

I downloaded GenomeAnalysisTK-3.5 jar file to my computer. I tried both these commands:

java -jar /path/GenomeAnalysisTK-3.5/GenomeAnalysisTK.jar -T HaplotypeCaller -R SRRFileName -I SRRFileName -stand_call_conf 30 -stand_emit_conf 10 -o SRRFileName.vcf

java -jar /path/GenomeAnalysisTK-3.5/GenomeAnalysisTK.jar -T SRRFileName -R SRR1718738 -I SRRFileName -stand_call_conf 30 -stand_emit_conf 10 -o SRRFileName.vcf

For both these commands, I got this error:
ERROR MESSAGE: Invalid command line: The GATK reads argument (-I, --input_file) supports only BAM/CRAM files with the .bam/.cram extension and lists of BAM/CRAM files with the .list extension, but the file SRR1718738 has neither extension. Please ensure that your BAM/CRAM file or list of BAM/CRAM files is in the correct format, update the extension, and try again.

I don't see any documentation here about this, so wanted to check with you or anyone else has had any experience with this.

Thanks
K

Using BAMOUT files to check HaplotypeCaller

In your documentation for this you say "You can see that the bamout file, on top, contains data only for the ActiveRegion that was within the analysis interval specified by -L. The two blue reads represent the artificial haplotypes constructed by HaplotypeCaller (you may need to adjust your IGV settings to see the same thing on your machine). "

What are the IGV settings to see the artificial haplotypes?


Bug in HaplotypeCaller: GT is called 0/1, but AD is 206,0

HI, I'd like to report a weird result from HaplotypeCaller.

We have a patient sequenced by targeted sequencing on HNF1B. This patient has been confirmed to have a whole gene deletion of HNF1B so we used this patient as a positive control. We expected to see no heterozygous variants called in HNF1B.

However, HaplotypeCaller called two heterozygous variants: one deletion (it didn't pass the FS strand bias filter and the ReadPosRankSumTest filter) and one substitution (this one passed all the quality filters). Both these two variants were not called by UnifiedGenotyper (and the variants called by UnifiedGenotyper in HNF1B region were all homozygous as what we expected)

Please see the VCF table:
Image may be NSFW.
Clik here to view.

There are three things I want to highlight:
1, The deletion is only 10 bases upstream of the substitution, but the FS score is enormous for the deletion whereas very low for the substitution. If there's a strand bias, it must affect both variants if they are so close to each other.
2, The score of ReadPosRankSumTest of the deletion didn't pass the filter because it's always called near the end of the reads. The ReadPosRankSumTest score for the substitution is missing.
3, The genotype was called 0/1 for the substitution, but if we look at the AD, there are 206 reads supporting the ref allele and 0 read supporting the alt allele. Going by the AD, it's clearly a homozygous ref/ref genotype.

Then I looked into the bam files. It turns out the all the alternative alleles of the substitution were from the ends of bad reads, and there are not too many of them after all. So the reads in the bam file also support a homozygous ref/ref genotype.

Image may be NSFW.
Clik here to view.

Therefore I'm really confused why the substitution has 0/1 genotype called by the HaplotypeCaller and why it passed the filter.

Many Thanks

Betty

Variant Calling in NGS data of a specific locus.

Hello,

I'm rather new to variant calling and had a question regarding this process on NGS data performed on a specific locus.
I sequenced the PCR amplicon of a bunch of samples in a pool together. Let's say I have 20 samples which were all diploid.
I expect some SNPs in some of the alleles but it's also possible that there is more than 1 SNP in 1 allele.
Is it possible to do a variant calling process that will 'group' the two SNPs in 1 allele together? I looked into it a bit and I maybe found HaplotypeCaller, but I'm unsure if this will do the trick.

My apologies if this is a stupid question.

Kind regards

Is there a paper describing the »Haplotype Caller algorithm?

Hi,

I'd like to ask you if there is a paper describing the Haplotype Caller algorithm, if you could please send me the reference. I have tried to find it, but I only found the paper on GATK which is great, but it doesn't describe in detail the Haplotype Caller algorithm.

thank you,

an error about HC call mutation genetoype and mutation type

Hi professor,
when I use HC call mutations ,the genotype and mutation as follows:
HC_Gene.refGene Chr Start End Ref Alt III17
HAVCR1 chr5 156479568 156479568 C CGTT,* 0/1

but when I chek the bam file of the reads approve the mutation,I see that almost all reads approve the second type,why HC believe it 0/1 , not 2/2?
see the attachment?

Is HaplotypeCaller suitable for use in a GWAS project?

Hi GATK Team,

I'm performing a Genome Wide Association Study and have just used HaplotypeCaller to call SNPs. Having read previous threads discussing the reference bias inherent in HC (at least relative to UnifiedGenotyper), I included the recommended arguments so my command line was as follows:

java -Xmx16000m -XX:+UseSerialGC -jar /GenomeAnalysisTK.jar -T HaplotypeCaller -L /lustre/scratch113/projects/cichlid/Mzebra_UMD1_assembly/New_Intervals_May16/UMD_1_2.intervals -R /lustre/scratch113/projects/cichlid/Mzebra_UMD1_assembly/UMD1_mzebra_nuclear_and_mtDNA.fa --emitRefConfidence GVCF --variant_index_type LINEAR --minPruning 1 --minDanglingBranchLength 1 --variant_index_parameter 128000 -nct 4 -I {} -o GATKgVCFs/{}_UMD_1_2.g.vcf.gz

I understand that the minPruning and minDanglingBranchLength arguments are expected to mitigate the reference bias? I would like to know how it does so - does it just refuse to make calls at those sites rather than guessing or does it take another approach? As I'm sure you appreciate, in a GWAS study it could be very misleading to have certain sites incorrectly recorded as reference bases simply as a result of lack of sequencing depth. Is HC's appropriateness for a GWAS study comparable to that of UG or would UG be recommended for this particular application?

Kind regards,

Ian

Why is Haplotype Caller not calling any variants?

Hi, I am following the BP guidelines for RNA Seq variant calling and am trying to run the Haplotype Caller on the output .bam files from the Split & Trim step. All the MAPQ scores of 255 have been replaced with 60 as suggested. For all samples, not a singe variant was called, yet I can see them clearly when viewing my .bam files in IGV. We are using GATK version 3.5-0-g36282e4.

I am a bit confused as more than 60% of the reads passed the filters and were processed (see data below).

What is going wrong here, and what should be done to obtain the variants that are clearly present?

Here is the command line (as given in the documentation):

java-1.8/jdk1.8.0_92/bin/java -jar -Xms16000m -Xmx16000m -Djava.io.tmpdir=/tmp/2171CB /gatk-1.0/GenomeAnalysisTK.jar -T HaplotypeCaller -R Peaxi162_genome.fa -I 2171CB_dedup_split_realigned.bam -dontUseSoftClippedBases -stand_call_conf 20.0 -stand_emit_conf 20.0 -o 2171CB.vcf

The job output included:

Successfully completed.

Resource usage summary:

CPU time   :   3492.42 sec.
Max Memory :      6118 MB
Max Swap   :     20473 MB

Max Processes  :         3

The job error included:
...
Using SSE4.1 accelerated implementation of PairHMM
INFO 18:40:12,995 VectorLoglessPairHMM - libVectorLoglessPairHMM unpacked successfully from GATK jar file
INFO 18:40:12,995 VectorLoglessPairHMM - Using vectorized implementation of PairHMM
INFO 18:40:12,997 VectorLoglessPairHMM - Time spent in setup for JNI call : 0.0
INFO 18:40:12,997 PairHMM - Total compute time in PairHMM computeLikelihoods() : 0.0
INFO 18:40:12,997 HaplotypeCaller - Ran local assembly on 0 active regions
INFO 18:40:13,214 ProgressMeter - done 1.259220201E9 62.6 m 2.0 s 100.0% 62.6 m 0.0 s
INFO 18:40:13,214 ProgressMeter - Total runtime 3756.49 secs, 62.61 min, 1.04 hours
INFO 18:40:13,215 MicroScheduler - 48797065 reads were filtered out during the traversal out of approximately 127666865 total reads (38.22%)
INFO 18:40:13,215 MicroScheduler - -> 0 reads (0.00% of total) failing BadCigarFilter
INFO 18:40:13,215 MicroScheduler - -> 39887566 reads (31.24% of total) failing DuplicateReadFilter
INFO 18:40:13,215 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter
INFO 18:40:13,216 MicroScheduler - -> 8909499 reads (6.98% of total) failing HCMappingQualityFilter
INFO 18:40:13,216 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter
INFO 18:40:13,216 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter
INFO 18:40:13,216 MicroScheduler - -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter
INFO 18:40:13,217 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter
INFO 18:40:16,079 GATKRunReport - Uploaded run statistics report to AWS S3

Thanks for your guidance,

Charles

RGQ returning 0 for high numbers of aligned reads

Hi,
Just wondering what the possible reasons could be for Haplotype Caller (version: 3.5-0-g36282e4) to declare a reference genotype quality of 0 for positions where the read depth is relatively high, such as the following region:

SC1 1628    .       T       .       .       PASS        DP=50;GC=33.33 GT:AD:DP:RGQ    0/0:50:50:99
SC1 1629    .       T       .       .       PASS        DP=50;GC=38.1  GT:AD:DP:RGQ    0/0:50:50:99
SC1 1630    .       T       .       .       FAIL_RGQ    DP=50;GC=38.1  GT:AD:DP:RGQ    0/0:45:50:5
SC1 1631    .       C       .       .       PASS        DP=51;GC=33.33 GT:AD:DP:RGQ    0/0:51:51:99
SC1 1632    .       A       .       .       PASS        DP=51;GC=33.33 GT:AD:DP:RGQ    0/0:51:51:96

In this instance the RGQ of the failed position above (at SC1:1630) was actually 5 (which I set as the threshold for filtering in this example), but I have plenty of instances where the read depth and resultant RGQ are like:

SC1 1630    .       T       .       .       FAIL_RGQ       DP=50;GC=38.1          GT:AD:DP:RGQ    ./.:45:50:5
SC1 1640    .       T       .       .       FAIL_RGQ       DP=48;GC=38.1          GT:AD:DP:RGQ    ./.:34:48:0
SC1 1805    .       T       .       .       FAIL_RGQ       DP=36;GC=33.33         GT:AD:DP:RGQ    ./.:32:36:0
SC1 2046    .       A       .       .       FAIL_RGQ       DP=37;GC=19.05         GT:AD:DP:RGQ    ./.:33:37:2
SC1 2345    .       A       .       .       FAIL_RGQ       DP=105;GC=23.81        GT:AD:DP:RGQ    ./.:90:105:0
SC1 2352    .       A       .       .       FAIL_RGQ       DP=116;GC=19.05        GT:AD:DP:RGQ    ./.:103:116:0
SC1 2356    .       C       .       .       FAIL_RGQ       DP=112;GC=23.81        GT:AD:DP:RGQ    ./.:100:112:0
SC1 2359    .       G       .       .       FAIL_RGQ       DP=111;GC=28.57        GT:AD:DP:RGQ    ./.:99:111:0

It feels like something funny is going on. Should it be possible for RGQ to be so low with such high depth? Also, I thought the AD format tag gave the count of unfiltered reads, whilst the format DP tag gave the filtered read depth (i.e. reads HC finds informative). Therefore shouldn't the AD count always be at least as high as the DP count?


ERROR MESSAGE: Graph must have ref source and sink vertices

Hello,

I am running GATK 3.5.-0 and encountered the following error:
ERROR MESSAGE: Graph must have ref source and sink vertices

The error occurs towards the end of the file processing and the resulting gvcf file seems rather complete to me, but I'd like to make sure.

Any help is greatly appreciated,
Susanne

INFO 15:31:52,917 GATKRunReport - Uploaded run statistics report to AWS S3

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.IllegalStateException: Graph must have ref source and sink vertices
at org.broadinstitute.gatk.tools.walkers.haplotypecaller.graphs.BaseGraph.removePathsNotConnectedToRef(BaseGraph.java:576)
at org.broadinstitute.gatk.tools.walkers.haplotypecaller.readthreading.ReadThreadingAssembler.createGraph(ReadThreadingAssembler.java:211)
at org.broadinstitute.gatk.tools.walkers.haplotypecaller.readthreading.ReadThreadingAssembler.assemble(ReadThreadingAssembler.java:127)
at org.broadinstitute.gatk.tools.walkers.haplotypecaller.LocalAssemblyEngine.runLocalAssembly(LocalAssemblyEngine.java:169)
at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.assembleReads(HaplotypeCaller.java:1029)
at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:865)
at org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller.map(HaplotypeCaller.java:228)
at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:709)
at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$TraverseActiveRegionMap.apply(TraverseActiveRegions.java:705)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274)
at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:274)
at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:78)
at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:315)
at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:106)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.5-0-g36282e4):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: Graph must have ref source and sink vertices
ERROR ------------------------------------------------------------------------------------------

HaplotypeCaller DP reports low values

Dear GATK Team,

I've recently been exploring HaplotypeCaller and noticed that, for my data, it is reporting ~10x lower DP and AD values in comparison to reads visible in the igv browser and reported by the UnifiedGenotyper.

I'm analyzing a human gene panel of amplicon data produced on a MiSeq, 150bp paired end. The coverage is ~5,000x.

My pipeline is:

Novoalign -> GATK (recalibrate quality) -> GATK (re-align) -> HaplotypeCaller/UnifiedGenotyper.

Here are the minimum commands that reproduce the discrepancy:

java -jar /GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar \
-T HaplotypeCaller \
--dbsnp /gatk_bundle/dbsnp_137.hg19.vcf \
-R /gatk_bundle/ucsc.hg19.fasta \
-I sample1.rg.bam \
-o sample1.HC.vcf \
-L ROI.bed \
-dt NONE \
-nct 8

Example variant from sample1.HC.vcf:

chr17 41245466 . G A 18004.77 . AC=2;AF=1.00;AN=2;BaseQRankSum=1.411;ClippingRankSum=-1.211;DP=462;FS=2.564;MLEAC=2;MLEAF=1.00;MQ=70.00;MQ0=0;MQRankSum=0.250;QD=31.14;ReadPosRankSum=1.159 GT:AD:DP:GQ:PL 1/1:3,458:461:99:18033,1286,0

... In comparison to using UnifiedGenotyper with exactly the same alignment file:

java -jar /GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar \
-T UnifiedGenotyper \
--dbsnp /gatk_bundle/dbsnp_137.hg19.vcf \
-R /gatk_bundle/ucsc.hg19.fasta \
-I sample1.rg.bam \
-o sample1.UG.vcf \
-L ROI.bed \
-nct 4 \
-dt NONE \
-glm BOTH

Example variant from sample1.UG.vcf:

chr17 41245466 . G A 140732.77 . AC=2;AF=1.00;AN=2;BaseQRankSum=5.488;DP=6382;Dels=0.00;FS=0.000;HaplotypeScore=568.8569;MLEAC=2;MLEAF=1.00;MQ=70.00;MQ0=0;MQRankSum=0.096;QD=22.05;ReadPosRankSum=0.104 GT:AD:DP:GQ:PL 1/1:56,6300:6378:99:140761,8716,0

I looked at the mapping quality and number of the alignments at the example region (200nt window) listed above and they look good:

awk '{if ($3=="chr17" && $4 > (41245466-100) && $4 < (41245466+100))  print}' sample1.rg.sam | awk '{count[$5]++} END {for(i in count) print count[i], i}' | sort -nr
8764 70
77 0

With other data generated in our lab, that has ~200x coverage and the same assay principle [just more amplicons], the DP reported by HaplotypeCaller corresponds perfectly to UnifiedGenotyper and igv.

Is there an explanation as to why I should see a difference between HaplotypeCaller and UnifiedGenotyper, using these kinds of data?

Many thanks in advance,

Sam

Which variant caller should be selected for my dataset?

Hi,

I am currently working on two different projects and interested in finding common and rare variants.

Project1 -
Organism : Influenza virus
Number of samples : 18
Sequencing type: Exome sequencing
Alignment tool : BWA
Analysis ready BAM files :
- BAM files are generated using BWA,
- then sorted the bam
- deduplicated bam file using picard (markduplicates)
GATK variant caller : Unifiedgenotyper or HaplotypeCaller (Which one to be used?)

Project2 :
Organism : Human
Number of samples : 24
Sequencing type: DNAseq (paired-end)
Analysis ready BAM files :
- BAM files are generated using BWA,
- then sorted the bam
- and finally deduplicated and recalibrated_reads.bam
GATK variant caller : Unifiedgenotyper or HaplotypeCaller (Which one to be used?)

What is the criteria to select the variant caller?

reference allele with Haplotype Caller

Hi
I'm trying to use Haplotype Caller for some WES and I used the --emitRefConfidence GVCF option in order to see only variants in my VCF output (no 0/0 homo wild type). Unexpectedly I obtained something like this:
[...]
chrX 155254672 . C . . END=155254791 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0
chrX 155254853 . T . . END=155254972 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0

What am I getting wrong?

HaplotypeCaller for a long time

Hi @Geraldine_VdAuwera

I have many bam files obtained through best practices. I want to get gvcf for each sample, but it puzzled me that most samples finished the HaplotypeCaller (version 3.5) process in three days (genome size,about 3G ,sequencing depth, about 10X), but only three samples have run last for a half month(liking following). I do not know why, maybe something unexpected happened to these samples. Any suggestion will be appreciated!

INFO 20:35:24,351 ProgressMeter - 17:67437745 1.771981436E9 3.1 w 17.8 m 65.5% 4.8 w 11.5 d
INFO 20:36:24,352 ProgressMeter - 17:67487105 1.771981436E9 3.1 w 17.8 m 65.5% 4.8 w 11.5 d
INFO 20:37:24,355 ProgressMeter - 17:67547076 1.771981436E9 3.1 w 17.8 m 65.5% 4.8 w 11.5 d

best
Viewing all 1335 articles
Browse latest View live