Low coverage loci - GATK pipeline

July 19, 2017, 7:25 am

≫ Next: Problem with HaplotypeCaller and GenotypeGVCFs

≪ Previous: Variant in VCF of multiple samples called by HaplotypeCaller absent in their respective BAM files

Hi GATK team,

I am posting this question for everyone's benefit as it will shed more light on how HaplotypeCaller and other GATK programs deal with low coverage positions.

For the sake of this example, let's assume we have a position no 1234 supported by 2 C reads, and let's also assume that there is enough evidence for the haplotype containing this read for it to be mapped to the reference, and let's also assume that we have set --minPruning to 1 so that this read does not get tossed out during re-alignment with HaplotypeCaller, and let's also assume that the reference is AC for this site, and let's assume we filter for annotations except for depth or coverage during VQSR, and let's finally assume that for some reason only 1 strand got sequenced at position 1234 during sequencing.

Questions:

1- Will the call at position 1234 likely be CC?
2- Will position 1234 likely not get called at all?
3- Will position 1234 get called as AC because that is what the reference has?

↧

Problem with HaplotypeCaller and GenotypeGVCFs

July 20, 2017, 11:20 am

≫ Next: Is GATK overestimating the heterozygous calls?

≪ Previous: Low coverage loci - GATK pipeline

Hi just wondering if you have any experience with this problem.

I am following GATK best practices for a targeted sequencing experiment.

After the GenotypeGVCFs phase the majority of variants are marked "MQ=NaN"

Inspection of g.vcf reveals that the only sites which have a numerical MQ are those which are homozygous for the alternative allele

I've included below the three commands used to get to that point

I have tried this with and without -A AS_RMSMappingQuality

Thanks so much for your help!

for filename in *.bam; do $gatk -T HaplotypeCaller -nct 15 -R ucsc.hg19.fasta -I ${filename} -L bedfile.bed -ERC GVCF -o ${filename}.g.vcf;done

ls *.g.vcf > gee_vee_cee_eff.list

$gatk -T GenotypeGVCFs -R ucsc.hg19.fasta -nt 15 -V gee_vee_cee_eff.list -o GVCFs_jointcalls.vcf

↧

Is GATK overestimating the heterozygous calls?

July 24, 2017, 1:21 am

≫ Next: Truth or control samples - Variant calling

≪ Previous: Problem with HaplotypeCaller and GenotypeGVCFs

Hi,
I have 24 genotypes distributed in 4 different populations.

I used HaplotypeCaller with the option –ERC –GVCF and obtained the vcf file for each genotype. Then combined all the genotypes to a single vcf file with GenotypeGVCFs option.

Is there a way to tell GATK to label a variant site as „Heterozygous“ only if it is present in >60% of the reads?

Example:
At position 82 (highlighted with a red box in the figure), the genotype field for this variant is 0/1. Whereas, as seen from the IGV, only 3 of the 10 reads contain an alternate allele „A“. Which filter should I use in the HaplotypeCaller or GenotypeGVCF or VariantFiltration to label a variant site as heterozygous if it’s present in say, for example 6 out of 10 reads.

↧

Truth or control samples - Variant calling

July 24, 2017, 11:00 am

≫ Next: Genotype Called by HaplotypeCaller

≪ Previous: Is GATK overestimating the heterozygous calls?

Are we able to incorporate truth/control samples in addition to dbSNP when calling variants with GVCF (cohorts) or the traditional way with HaplotypeCaller. There are for example situations where the sequences are for Australian, E Asian, or African samples, and we would like to include truth/control samples for those areas, perhaps from 1000 genomes or some other source.

If possible, what arguments do we use.

↧

Genotype Called by HaplotypeCaller

July 25, 2017, 12:06 am

≫ Next: GENOTYPE_GIVEN_ALLELES mode not work in GATK4 beta

≪ Previous: Truth or control samples - Variant calling

Hello,

I'm running GATK pipeline for Variant calling.
First starting with Bowtie2->Markduplicates->AddOrReplaceReadGroups->RealignerTargetCreator->IndelRealigner->BaseRecalibrator->PrintReads->HaplotypeCaller->GenotypeGVCFs

I've got one variant with confusing genotype call
chr5 136954583 rs4835684 T A 2126.77 VQSRTrancheSNP90.00to99.00 AC=2;AF=1.00;AN=2;DB;DP=57;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=43.77;QD=32.89;SOR=2.428;VQSLOD=1.51;culprit=MQ GT:AD:GQ:PL 1/1:0,57:99:2155,171,0

Here it is showing AD 0,57 and it's 1/1 genotype, but when I checked manually in samtools viewer there are 33(T) REF calls and 22(A) ALT calls. I don't understand why 1/1 genotype called by HaplotypeCaller?

Then to confirm this variant I ran HaplotypeCaller with -bamout option, as suggested in this thread https://gatkforums.broadinstitute.org/gatk/discussion/5042/genotype-calling-in-gatk.

When I viewed output of -bamout in samtools tview it's showing all ALT(A) calls. I think HaplotypeCaller is performing some steps which I'm not understanding. But how come ALT calls got increased in -bamout output. Am I missing something?

I'm totally confused.
Please help. Any help would be appreciated!! Thanks

↧

GENOTYPE_GIVEN_ALLELES mode not work in GATK4 beta

July 26, 2017, 8:00 pm

≫ Next: Heterozygous X variants observed in male samples called by HaplotypeCaller in normal VCF mode

≪ Previous: Genotype Called by HaplotypeCaller

Hi, guys, I tried to run GGA(GENOTYPE_GIVEN_ALLELES) of GATK4.beta.1, but failed with NullPointerException, I'm sure that my input file and parameter settings are OK, cause I have checked my setting with this post of our forum and 4.beta.1's docs, also the equivalent parameters work fine for GATK3.7.

I haven't tested GGA with 4.beta.3 and 4.beta.2, as the release notes shows that there is no update related to this function. I'm wondering if GGA can function well for 4.beta or the future general release or maybe I need to change my parameters to get it running up? Below is my parameters and error log.

gatk-launch --javaOptions "-Xmx4g" HaplotypeCaller  \
   -R /reference/BWAIndex/genome.fa \
   -I  miseq_161113_PE75.bwa.sorted.filtered.recal.bam \
   -O miseq_161113_PE75_gatk4_pgkb.vcf \
   -L  /path/to/my.vcf.gz \
   --alleles /path/to/my.vcf.gz \
   --genotyping_mode GENOTYPE_GIVEN_ALLELES

[July 14, 2017 2:51:14 PM CST] Executing as jiecui@Neptune on Linux 4.4.0-83-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_131-8u131-b11-0ubuntu1.16.04.2-b11; Version: 4.beta.1
14:51:14.936 INFO  HaplotypeCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 1
14:51:14.936 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
14:51:14.936 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
14:51:14.936 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
14:51:14.936 INFO  HaplotypeCaller - Deflater: IntelDeflater
14:51:14.936 INFO  HaplotypeCaller - Inflater: IntelInflater
14:51:14.936 INFO  HaplotypeCaller - Initializing engine
......
14:51:15.342 INFO  IntervalArgumentCollection - Processing 43 bp from intervals
14:51:15.350 INFO  HaplotypeCaller - Done initializing engine
14:51:15.356 INFO  HaplotypeCallerEngine - Disabling physical phasing, which is supported only for reference-model confidence output
14:51:15.594 WARN  PossibleDeNovo - Annotation will not be calculated, must provide a valid PED file (-ped) from the command line.
14:51:15.737 WARN  PossibleDeNovo - Annotation will not be calculated, must provide a valid PED file (-ped) from the command line.
14:51:15.964 INFO  NativeLibraryLoader - Loading libgkl_pairhmm_omp.so from jar:file:/media/home/jiecui/software/gatk/gatk-4.beta.1/gatk-package-4.beta.1-local.jar!/com/intel/gkl/native/libgkl_pairhmm_omp.so
[INFO] Available threads: 40
[INFO] Requested threads: 4
[INFO] Using 4 threads
14:51:16.035 INFO  PairHMM - Using the OpenMP multi-threaded AVX-accelerated native PairHMM implementation
14:51:16.051 INFO  ProgressMeter - Starting traversal
14:51:16.051 INFO  ProgressMeter -        Current Locus  Elapsed Minutes     Regions Processed   Regions/Minute
log4j:WARN No appenders could be found for logger (org.broadinstitute.hellbender.utils.MathUtils$Log10Cache).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
14:51:16.424 INFO  VectorLoglessPairHMM - Time spent in setup for JNI call : 0.001359444
14:51:16.424 INFO  PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 0.004590645
14:51:16.424 INFO  HaplotypeCaller - Shutting down engine
[July 14, 2017 2:51:16 PM CST] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 0.03 minutes.
Runtime.totalMemory()=1598029824
java.lang.NullPointerException
        at org.broadinstitute.hellbender.tools.walkers.haplotypecaller.AssemblyBasedCallerGenotypingEngine.createAlleleMapper(AssemblyBasedCallerGenotypingEngine.java:159)
        at org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCallerGenotypingEngine.assignGenotypeLikelihoods(HaplotypeCallerGenotypingEngine.java:128)
        at org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCallerEngine.callRegion(HaplotypeCallerEngine.java:541)
        at org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller.apply(HaplotypeCaller.java:221)
        at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.processReadShard(AssemblyRegionWalker.java:244)
        at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.traverse(AssemblyRegionWalker.java:217)
        at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:838)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:115)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:170)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:189)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:131)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:152)
        at org.broadinstitute.hellbender.Main.main(Main.java:230)

Java and GATK version:

Java version: openjdk version "1.8.0_131"
GATK version: 4.beta.1

↧

Heterozygous X variants observed in male samples called by HaplotypeCaller in normal VCF mode

July 27, 2017, 11:47 pm

≫ Next: DP differences between haploid and diploid mode

≪ Previous: GENOTYPE_GIVEN_ALLELES mode not work in GATK4 beta

Hello GATK team,

I have called variants with HaplotyeCaller in the whole exome data of 6 people (4 affected males and 2 unaffected). All 4 male patients are heterozygous for three variants in a gene located on the X chormosome. One of the unaffected who is the father of one patient is homozygous for reference allele while the other unaffected who is the mother of two of patients is heterozygous for the same variants (which makes sense)
I have checked the bamout from HC and it confirms what I see in the VCF. You can see the command I used below.
why does the caller decided to call the sons of the unaffected mother heterozygous? Can't HaplotypeCaller distinguish between male and female samples? Moreover, the number of reads showing the ALT allele is more than the REF in all three variant locations so I am confused about what the actual status of patients is.

GATK \
-T HaplotypeCaller \
-R ucsc.hg19.fasta \
-I recalibrated_reads_final.bam \
--genotyping_mode DISCOVERY \
-bamout bamout.bam \
--dbsnp dbsnp_138.hg19.vcf \
-A Coverage -A TandemRepeatAnnotator -A QualByDepth -A VariantType \
-o raw.vcf

↧

DP differences between haploid and diploid mode

July 28, 2017, 6:11 am

≫ Next: Variant calling using a phased genome as reference

≪ Previous: Heterozygous X variants observed in male samples called by HaplotypeCaller in normal VCF mode

I would appreciate some help in understanding better the differences between haploid and diploid mode when it comes to calling and joint-genotyping (HaplotypeCaller + GenotypeGVCFs) in gatk. In particular, differences in the reported read depth (DP). Here my example case.

HaplotypeCaller using haploid mode.
(sample1_chrY.bam)

java -jar ~/software/gatk/GenomeAnalysisTK.jar -T HaplotypeCaller --intervals Y:2650345-2650345 \
--input_file ~/data/bam/sample1_chrY.bam --emitRefConfidence GVCF --max_alternate_alleles 3 \
--contamination_fraction_to_filter 0.05 --min_base_quality_score 20 \
--sample_ploidy 1 --pcr_indel_model NONE --dbsnp ~/data/variations/dbsnp_138/dbsnp_138.b37.vcf \
--reference_sequence ~/data/fasta/Homo_sapiens_assembly19/Homo_sapiens_assembly19.fasta \
--output ~/data/haploid_calls/sample1_chrY.g.vcf.gz

Y       2650345 .       A       <NON_REF>       .       .       END=2650345     GT:DP:GQ:MIN_DP:PL      0:13:99:13:0,429

HaplotypeCaller using diploid mode.
(sample1_chrY.bam)

java -jar ~/software/gatk/GenomeAnalysisTK.jar -T HaplotypeCaller --intervals Y:2650345-2650345 \
--input_file ~/data/bam/sample1_chrY.bam --emitRefConfidence GVCF --max_alternate_alleles 3 \
--contamination_fraction_to_filter 0.05 --min_base_quality_score 20 \
--sample_ploidy 2 --pcr_indel_model NONE --dbsnp ~/data/variations/dbsnp_138/dbsnp_138.b37.vcf \
--reference_sequence ~/data/fasta/Homo_sapiens_assembly19/Homo_sapiens_assembly19.fasta \
--output ~/data/diploid_calls/sample1_chrY.g.vcf.gz

Y       2650345 .       A       <NON_REF>       .       .       END=2650345     GT:DP:GQ:MIN_DP:PL      0/0:13:33:13:0,33,495

The only difference between the last two calls to HaplotypeCaller is the parameter --sample-ploidy. In both cases (ploidy 1 and ploidy 2), the reference call is being supported by 13 reads (DP field). Concordant with this, looking at this position using the bam file in IGV (see image below), it is possible to confirm that there are 14 reads covering the position and only one base in one of the reads is of low quality (QV 2), hence a DP of 13 makes sense.

What it's more, the number of reads spanning this position even increases (up to 24 DP + 2 artificial haplotypes) when looking at the same sample/position but using the already locally re-aligned reads that can be output by gatk in a bam file. See below.

HaplotypeCaller using haploid mode.
(sample1_chrY.bam)

java -jar ~/software/gatk/GenomeAnalysisTK.jar -T HaplotypeCaller --intervals Y:2650345-2650345 \
--input_file ~/data/bam/sample1_chrY.bam --emitRefConfidence GVCF --max_alternate_alleles 3 \
--contamination_fraction_to_filter 0.05 --min_base_quality_score 20 \
--sample_ploidy 1 --pcr_indel_model NONE --dbsnp ~/data/variations/dbsnp_138/dbsnp_138.b37.vcf \
--reference_sequence ~/data/fasta/Homo_sapiens_assembly19/Homo_sapiens_assembly19.fasta \
-forceActive -disableOptimizations --bamOutput ~/data/sample1_RE-AL_HAP_chrY.bam

HaplotypeCaller using diploid mode.
(sample1_chrY.bam)

java -jar ~/software/gatk/GenomeAnalysisTK.jar -T HaplotypeCaller --intervals Y:2650345-2650345 \
--input_file ~/data/bam/sample1_chrY.bam --emitRefConfidence GVCF --max_alternate_alleles 3 \
--contamination_fraction_to_filter 0.05 --min_base_quality_score 20 \
--sample_ploidy 2 --pcr_indel_model NONE --dbsnp ~/data/variations/dbsnp_138/dbsnp_138.b37.vcf \
--reference_sequence ~/data/fasta/Homo_sapiens_assembly19/Homo_sapiens_assembly19.fasta \
-forceActive -disableOptimizations --bamOutput ~/data/sample1_RE-AL_DIP_chrY.bam

However, when I do multi-sample joint genotyping using GenotypeGVCFs, DP values and the number of supporting reads reported vary significantly between g.vcf files produced in haploid and those produced in diploid mode. DP values get significantly reduced, in particular for reference calls it seems. To simplify, I added only a second extra sample in the example here below.

GenotypeGVCFs
(Using g.vcf files produced with HaplotypeCaller in haploid mode)

java -jar ~/software/gatk/GenomeAnalysisTK.jar -T GenotypeGVCFs  --intervals Y:2650345-2650345 \
--standard_min_confidence_threshold_for_calling 10 --dbsnp ~/data/variations/dbsnp_138/dbsnp_138.b37.vcf \
--reference_sequence ~/data/fasta/Homo_sapiens_assembly19/Homo_sapiens_assembly19.fasta --max_alternate_alleles 3 \
--variant ~/data/haploid_calls/sample1_chrY.g.vcf.gz --variant ~/data/haploid_calls/sample2_chrY.g.vcf.gz \
--out ~/data/raw_vcfs/raw_haploid_calls.vcf.gz

Y       2650345 .       A       G       497.76  .       AC=1;AF=0.500;AN=2;DP=23;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;QD=31.11;SOR=0.941       GT:AD:DP:GQ:PL  0:6,0:6:99:0,109        1:0,16:16:99:529,0

GenotypeGVCFs
(Using g.vcf files produced with HaplotypeCaller in diploid mode)

java -jar ~/software/gatk/GenomeAnalysisTK.jar -T GenotypeGVCFs  --intervals Y:2650345-2650345 \
--standard_min_confidence_threshold_for_calling 10 --dbsnp ~/data/variations/dbsnp_138/dbsnp_138.b37.vcf \
--reference_sequence ~/data/fasta/Homo_sapiens_assembly19/Homo_sapiens_assembly19.fasta --max_alternate_alleles 3 \
--variant ~/data/diploid_calls/sample1_chrY.g.vcf.gz --variant ~/data/diploid_calls/sample2_chrY.g.vcf.gz \
--out ~/data/raw_vcfs/raw_diploid_calls.vcf.gz

Y       2650345 .       A       G       494.42  .       AC=2;AF=0.500;AN=4;DP=30;ExcessHet=0.7918;FS=0.000;MLEAC=2;MLEAF=0.500;MQ=60.00;QD=30.90;SOR=0.941      GT:AD:DP:GQ:PL  0/0:13,0:13:33:0,33,495 1/1:0,16:16:48:529,48,0

As can be seen, using g.vcf files produced in haploid mode, the final DP value for sample1 gets down to 6 reads, while previously was 13. The number 13 however, is reported when g.vcf files produced in diploid mode are used.

So, why?

I would be very thankful about some help understanding this. Additional information here below

.- I'm using gatk v3.7-0-gcfedb67, Java 1.8.0_40-b26
.- In the case of sample2, there is no difference in final DP values as reported using haploid vs diploid g.vcf files. In this case is an ALT call, but it does happen in REF calls just as the example for sample1.
.- This is an example with one SNP, but the issue is widespread across the call set, at least when it comes to REF calls.
.- I've checked with the help of IGV -> all the 13 reads/base_positions that I think should be reported in haploid mode (only 6 reported) pass -mmq 20 and -mbq 20
.- there is no significant strand bias
.- data is Illumina, PCR-free, 150 bp paired-reads, reads alligned with bwa-mem, and picard for marking duplicates.

Best,
Rodrigo

↧

Variant calling using a phased genome as reference

July 31, 2017, 8:13 am

≫ Next: HC listing depth one read less

≪ Previous: DP differences between haploid and diploid mode

Hello,

I want to do variant calling in a diploid organism using a phased genome as a reference. Therefore, in the reference we have both chromosomes represented. For variant calling with Haplotype Caller should I consider this genome as a diploid (as it is) or a haploid (as the reference has the 2 homolog chromosomes)? What do you think?

Thanks in advance!

↧

HC listing depth one read less

August 1, 2017, 1:33 pm

≫ Next: Usage of "--dontUseSoftClippedBases" HaplotypeCaller option for exom enrichment data

≪ Previous: Variant calling using a phased genome as reference

Hi,
I ran haplotypecaller on a bunch of samples using the following commands:
java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -drf DuplicateRead -R hg19.fa -I SAMPLE.bam -o SAMPLE.g.vcf -L target_region.bed -ERC GVCF
java -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -R hg19.fa -V SAMPLE.g.vcf -o SAMPLE.hc.vcf

For many variants, DP is listed to be one read less than it actually is. I load the bam file in IGV and count the reads manually (also appears when I hover over the bar plot). Moreover, the correct depth is listed by the output of the DepthOfCoverage tool:

java -jar GenomeAnalysisTK.jar -T DepthOfCoverage -drf DuplicateRead -R hg19.fa -I SAMPLE.bam --omitDepthOutputAtEachBase -o SAMPLE.coverage

Here is an example from hc.vcf:

chrX 49119876 . T C 531.77 . AC=2;AF=1.00;AN=2;DP=19;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=27.99;SOR=1.022 GT:AD:DP:GQ:PL 1/1:0,19:19:57:560,57,0

and from DepthOfCoverage:
chrX:49119876 20 20.00 20 20.00 21 21 21 100.0

So, depth matches between DepthOfCoverage and bam+IGV (DP=20), but it is one less in hc.vcf (DP=19).

Has anybody else seen this issue or know how to fix it? This is giving me problems for the variants that are right at my threshold.

Thanks a lot in advance!

↧

Usage of "--dontUseSoftClippedBases" HaplotypeCaller option for exom enrichment data

August 3, 2017, 11:48 pm

≫ Next: Joint genotyping exomes is extremely slow (part of the germline haplotypecaller GVCF pipeline)

≪ Previous: HC listing depth one read less

Hi GATK Team,
HaplotypeCaller does not call structural variants from soft clipped bases, therefore the "--dontUseSoftClippedBases" should mainly reduce false positives (e.g. incomplete adapter trimming)? Is this thought correct or am i wrong?

Greetings from Munich

↧

Joint genotyping exomes is extremely slow (part of the germline haplotypecaller GVCF pipeline)

August 7, 2017, 7:52 pm

≫ Next: SNV gets dbSNP annotation in one sample, doesn't get annotated in another one

≪ Previous: Usage of "--dontUseSoftClippedBases" HaplotypeCaller option for exom enrichment data

I am enduring an incredible slow down during my genotyping stage of the haplotypecaller GVCF command series. It is my understanding from the documentation that this step should be rather fast: "This step runs very fast and can be rerun at any point when samples are added to the cohort, thereby solving the so-called N+1 problem."

However, given 50 - 100 exomes, the command estimates several weeks until completion time, despite being given 64 cores and 256GB ram with unlimited disk space. I'm concerned because this seems unrealistically high, especially given that once a pool of several hundred training exomes is created, the purpose of the GVCF pipeline is to quickly use that pool in a joint genotyping step with a new sample exome. Therefore, each time I have a new sample exome, I would have to endure another multi-week joint genotyping step.

Can you please advise me as to why my command is taking so long? Any insight is much appreciated. Please find below a copy of my command:

    time java -Djava.io.tmpdir=$temp_directory -Xmx192g -jar /root/Installation/gatk/GenomeAnalysisTK.jar -T GenotypeGVCFs \
    -R /bundles/b37/human_g1k_v37.fasta \
    (list of all training exomes and the single sample exome goes here \
    --disable_auto_index_creation_and_locking_when_reading_rods \
    -o genotyped.g.vcf -nt 60


    # I deactivated the following step since it seems to be unnecessary
    # --sample_ploidy 60 \ #(ploidy is set to number of samples per pool * individual sample ploidy)

↧

SNV gets dbSNP annotation in one sample, doesn't get annotated in another one

August 9, 2017, 7:17 am

≫ Next: Phasing

≪ Previous: Joint genotyping exomes is extremely slow (part of the germline haplotypecaller GVCF pipeline)

Hello everyone,

I recently run HaplotypeCaller for GATK3.7 on a series of samples (several GATK runs performed at the same time), using the latest release of dbSNP(150). This was the command line I used for both cases (I omissed the full paths for privacy concerns):

/usr/bin/java -Djava.io.tmpdir=/scratch/javatmp/ngs_pipe \ -Xmx4g -jar /data01/Softwares/GATK/3.7/GenomeAnalysisTK.jar \ -T HaplotypeCaller \ -R /path/to/hg19 \ -I input_bam \ -o output.vcf \ --dbsnp dbSNP_150_NEW_hg19_chr.vcf

and here's the same variant reported in two different files of the same data (exomes) performed using the same kit, on the same NextSeq run

sample 1:

chr11 125479363 rs2241502 G A 208.01 . AC=2;AF=1.00;AN=2;DB;DP=9;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=23.11;SOR=0.892 GT:AD:DP:GQ:PL 1/1:0,9:9:27:222,27,0

sample2:

chr11 125479363 . G A 597.60 . AC=1;AF=0.500;AN=2;BaseQRankSum=-0.140;ClippingRankSum=0.000;DP=41;ExcessHet=3.0103;FS=2.820;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=14.58;ReadPosRankSum=-1.110;SOR=0.448 GT:AD:DP:GQ:PL 0/1:16,25:41:99:605,0,343

I've seen this error running sistematically for several other positions in the same run, and I fear that the error might be always been there and I didn't notice before. I'm wondering if you know why this occurs, if it's a bug and it is known and if I should reannotate with VariantAnnotator every vcf I got in order to fix the issue (if VariantAnnotator is immune from this bug)

Thanks a lot for your help and time, I'm here for every clarification

↧

Phasing

January 12, 2017, 2:42 pm

≫ Next: Biallelic variants only with HaplotypeCaller

≪ Previous: SNV gets dbSNP annotation in one sample, doesn't get annotated in another one

Hi there
How do u incorporate phasing with your variants if you dont have data from parents.
my haplotype output is 1/1 not 1|1 so how can we say haplotyper can give phased haplotypes.
Huma

↧

Biallelic variants only with HaplotypeCaller

August 10, 2017, 2:38 am

≫ Next: Multithreaded HaplotypeCaller gives different GVCF files to single-core run

≪ Previous: Phasing

I want to restrict the HaplotypeCaller to only call biallelic sites. I'm thinking that this can be accomplished with --max_alternate_alleles, although I'm not 100% whether this parameter should be set to 1 or 2 (one alternate allele, or two different alleles?).

--max_genotype_count seems relevant here also, but I don't understand what it does.

Basically, how do I ensure calling only biallelic sites?

↧

Multithreaded HaplotypeCaller gives different GVCF files to single-core run

August 11, 2017, 4:03 am

≫ Next: SnpEff html and .vcf file result are not matching

≪ Previous: Biallelic variants only with HaplotypeCaller

Hi,

I've been recreating a bioinformatics pipeline for paired WGS data which largely follows the GATK best practices, but I'm having issues with HaplotypeCaller to create a GVCF file for one individual. I'm using GATK 3.4 with java 1.8.0_141. The BAM file was assembled with bwa, and I've marked duplicates, locally realigned around indels and recalibrated the base quality scores.

I understand that it's not recommended to use multiple CPU threads with HC, but if there is no crash the results should be the same. I ran HC as follows:

gatk34 -T HaplotypeCaller -R /home/shared/reference/1000Genome/GRCh38/GRCh38_full_analysis_set_plus_decoy_hla.fa -I K1561-4464.raw-reads.header.sorted.nodup.rg.realign.bsqr.bam -o K1561-4464.raw-reads.g.vcf -ERC GVCF --annotation BaseQualityRankSumTest --annotation FisherStrand --annotation GCContent --annotation HaplotypeScore --annotation HomopolymerRun --annotation MappingQualityRankSumTest --annotation MappingQualityZero --annotation QualByDepth --annotation ReadPosRankSumTest --annotation RMSMappingQuality --annotation DepthPerAlleleBySample --annotation Coverage --annotation ClippingRankSumTest --annotation DepthPerSampleHC --annotation StrandBiasBySample --dbsnp /home/shared/reference/1000Genome/GRCh38/dbsnp_146.hg38.vcf.gz --excludeAnnotation ChromosomeCounts --excludeAnnotation FisherStrand --excludeAnnotation StrandOddsRatio --excludeAnnotation QualByDepth --GVCFGQBands 10 --GVCFGQBands 20 --GVCFGQBands 30 --GVCFGQBands 40 --GVCFGQBands 60 --GVCFGQBands 80 --standard_min_confidence_threshold_for_calling 0 --interval_set_rule INTERSECTION --read_filter BadCigar --read_filter NotPrimaryAlignment --unsafe LENIENT_VCF_PROCESSING --variant_index_parameter 128000 --variant_index_type LINEAR

I also ran this with -nct 4 and with -nct 16 to generate GVCF1, GVCF2 and GVCF3 respectively. The multi-threaded runs ran without and errors reported. However, the three GVCF files are not all the same. I ran GenotypeConcordance on all pairs of the GVCF files - I can't seem to attach the output files here, but here are the Genotype Concordance Counts for GVCF1 vs GVCF2

#:GATKTable:38:2:%s:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:;
#:GATKTable:GenotypeConcordance_Counts:Per-sample concordance tables: comparison counts
Sample      NO_CALL_NO_CALL  NO_CALL_HOM_REF  NO_CALL_HET  NO_CALL_HOM_VAR  NO_CALL_UNAVAILABLE  NO_CALL_MIXED  HOM_REF_NO_CALL  HOM_REF_HOM_REF  HOM_REF_HET  HOM_REF_HOM_VAR  HOM_REF_UNAVAILABLE  HOM_REF_MIXED  HET_NO_CALL  HET_HOM_REF  HET_HET  HET_HOM_VAR  HET_UNAVAILABLE  HET_MIXED  HOM_VAR_NO_CALL  HOM_VAR_HOM_REF  HOM_VAR_HET  HOM_VAR_HOM_VAR  HOM_VAR_UNAVAILABLE  HOM_VAR_MIXED  UNAVAILABLE_NO_CALL  UNAVAILABLE_HOM_REF  UNAVAILABLE_HET  UNAVAILABLE_HOM_VAR  UNAVAILABLE_UNAVAILABLE  UNAVAILABLE_MIXED  MIXED_NO_CALL  MIXED_HOM_REF  MIXED_HET  MIXED_HOM_VAR  MIXED_UNAVAILABLE  MIXED_MIXED  Mismatching_Alleles
ALL                       0                0            0                0                    0              0                0        244795804          575               25                 2908              0            0          265  3465716           46              325          0                0                3           54          1777987                   10              0                    0                 2705              302                   21                        0                  0              0              0          0              0                  0            0                  697
K1561-4464                0                0            0                0                    0              0                0        244795804          575               25                 2908              0            0          265  3465716           46              325          0                0                3           54          1777987                   10              0                    0                 2705              302                   21                        0                  0              0              0          0              0                  0            0                  697

The other output files were quite similar, with some variants being called differently between the files. I've run ValiateVariants on the three GVCF files, with no failures reported. I'm getting a PairInfoMap error with ValidateSamFile, not sure if this is related.

Is there a reason why the output GVCF files are different? I'm unsure which file to use, or if it will make a difference downstream.

↧

SnpEff html and .vcf file result are not matching

August 11, 2017, 5:22 am

≫ Next: Is Indel realignment necessary when HaplotypeCaller re-assembles all reads in a region?

≪ Previous: Multithreaded HaplotypeCaller gives different GVCF files to single-core run

Asslamu Alikum

I have successfully managed to run SnpEff for my vcf files. However, the count of missense variants in my html file and the VCF file generated by SnpEff are different.

Missense in HTML: 20,854

Missense in VCF fle: 20,754

Can any one please suggest me about the criteria for missense calculation in html file, so that I could match the vcf file with the missense file

Input used is vcf and version SnpEff 4.3

↧

Is Indel realignment necessary when HaplotypeCaller re-assembles all reads in a region?

August 13, 2015, 8:47 am

≫ Next: Is there a paper describing the »Haplotype Caller algorithm?

≪ Previous: SnpEff html and .vcf file result are not matching

If HaplotypeCaller re-assembles all reads in a region
Why is it recommend to run IndelRealigner first?

↧

Is there a paper describing the »Haplotype Caller algorithm?

November 17, 2014, 6:40 pm

≫ Next: Does HaplotypeCaller detect SNPs which are heterozygous (not matching the refrence allele)?

≪ Previous: Is Indel realignment necessary when HaplotypeCaller re-assembles all reads in a region?

Hi,

I'd like to ask you if there is a paper describing the Haplotype Caller algorithm, if you could please send me the reference. I have tried to find it, but I only found the paper on GATK which is great, but it doesn't describe in detail the Haplotype Caller algorithm.

thank you,

↧

Does HaplotypeCaller detect SNPs which are heterozygous (not matching the refrence allele)?

August 16, 2017, 12:37 am

≫ Next: can i determine total count for a snp using HaplotypeCaller Tool?

≪ Previous: Is there a paper describing the »Haplotype Caller algorithm?

Dear GATK team,

I have a simple question regarding the HaplotypeCaller/UnifiedGenotype module. There are 3 options for SNPs in the output depended on the reference.
Homozygous for the reference, heterozygous for the reference and alternate nucleotide and homozygous for the alternate nucleotide (0/0, 1/1, 0/1). But what if I would like to have the SNPs which are heterozygous and do not match the reference allel. So e.g. the reference has an A and the mapped reads indicate a G/T. Does one of the module detect that? I have a diploid species and want to get all SNPs. This issue popped up after I had a look on the output of both modules.

Thanks a lot.
Julia

↧