Quantcast
Channel: haplotypecaller — GATK-Forum
Viewing all 1335 articles
Browse latest View live

can i determine total count for a snp using HaplotypeCaller Tool?

$
0
0

Hello

I am interested in finding the total number of reads supporting each snp found by the HaploTypeCaller tool. I just find the DepthOfCoverage tool which has the --countType COUNT_READS option. It is pretty close to what I am looking at, but I need to mix it whit the snp found. Is it possible? thanks


How does HaplotypeCaller discriminate between heterozygous and homozygous variants?

$
0
0

Dear members of the GATK team

I am using different GATK modules to detect some SNPs in my RNASeq data set. I did a test run for one individual to get an idea about the output of HaplotypeCaller. I know that I still need to filter my variants, but nevertheless I was wondering how HaplotypeCaller set the variant to heterozygous or homozygous. There must be another parameter to take into account (other than the AD values). Am I right?

Here is an example:

0|*|TRINITY_DN53108_c0_g1::TRINITY_DN53108_c0_g1_i1::g.132814::m.132814 7333 . G T 42.77 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=1.644;ClippingRankSum=0.000;DP=21;ExcessHet=3.0103;FS=3.109;MLEAC=1;MLEAF=0.500;MQ=42.00;MQRankSum=0.000;QD=2.04;ReadPosRankSum=0.629;SOR=0.132 GT:AD:DP:GQ:PL 0/1:18,3:21:71:71,0,703

The genotype is 0/1 (G/T) and the AD is 18 to 3. Actually, I would say that this homozygous.
Before I mapped the reads to the reference, I filtered the reads with FastQC and did other processing steps like adapter trimming. I also marked and removed duplicated reads from the BAM file. So, my reads are processed correctly (I would say) and I could trust the final reads.

Nevertheless, with a ration of 18:3, I would still suggest a homozygous variant (just based on the AD values). I would change my mind if there is another value which is important for the decision or if one can say: "If you trust your read files, than this ration is still a reliable result for heterozygous variants.".

But still: If I doubt the files:
Is there any possibility to filter the variants based on their AD values? An example would be to filter out all heterozygous variants which are below the ratio of 30% : 70%?

Thanks in advance for your reply and I am looking forward to your answers.
Julia

How to make results from GAT4.beta.3 "HaplotypeCaller" comparable to GATK2.7's "UnifiedGenotyper"?

$
0
0

We have been using “UnifiedGenotyper” of GATK2.7 for SNV calling, with "EMIT_ALL_SITES" mode, which always generate great results. We recently learnt GATK4 is in-development, with UnifiedGenotyper discontinued & HaplotypeCaller recommended. We thus test performance of HaplotypeCaller on our data with “-ERC GVCF” (didn't include the GenotypeGVCFs step). We found the amount of SNV identified decreased dramatically, with ~80%-90% reduction, in comparison with what's found by “UnifiedGenotyper”.

Our samples are from single cells, shallow sequenced. Paired-reads are 150bp each. Reads are supposed to align with short regions of 30-200bp across human genome, thus 99% of genome won’t be covered with reads. We’re not interested in arbitrary SNV, and don’t have target region or any window; we only care for mapping SNV across our samples. Based on quite a few experiments analyzed with UnifiedGenotyper, we found that even with low coverage, the short regions we aligned to always have highly reproducible base calls, and we could always identify SNV within these regions. We usually process five single cell samples each time, thus most region should have identical sequence and thus only 1 or 2 major alleles.

To our understanding, the “HaplotypeCaller” call variant based on de-novo assembly of active regions; if there are large amount of missing data in surrounding regions of our 30-200bp alignments, will it result in failure of haplotype identification, and lead to failure of SNV calling? Is the algorithm required certain sample size to work well? Is there any “HaplotypeCaller” parameters or discovery mode we could use to serve SNV calling with our current experiment design, or at least bring SNV calling rate to a level comparable to what identified by “UnifiedGenotyper”?

Greatly appreciate your advice!

GenotypeGVCF on pooled data running out of memory despite providing 512GB to Java

$
0
0

Dear GATK staff,

I am doing SNP calling with GATK 3.8 on whole genome sequences of 12 pools (50 diploid individuals in each pool, genome size ~900 Mbp) of a non-model organism with scaffolded reference genome (114K fragments now stitched into 94 super-scaffolds).
First, I ran HaplotypeCaller in –ERC GVCF mode for each pool sample and super_scaffold separately (i.e. scatter by scaffold). An example of the commands used for "sample1" in "super_scaffold8" is shown below (I chose ploidy 10 as this seems to be the maximum number working for other Poolseq users, and I also modified the number of PLs to 10000 as in the first runs I got warning messages saying that the default maximum number of PLs of 100 was too low):

java -Djava.io.tmpdir=/path/tmp/ -XX:ParallelGCThreads=1 -Dsamjdk.use_async_io=true -Dsamjdk.buffer_size=4194304 -Xmx8g -jar /path/GATK/3.8.0/GenomeAnalysisTK.jar \
-T HaplotypeCaller \
-R /path/Stitched_Ref_genome/ref.fasta \
-I /path/BAMs-SG/sample1.PoolSeq.sorted.MarkDup.RG.bam \
-L Super_Scaffold8 \
-ERC GVCF \
-ploidy 10 \
-mbq 20 \
-minPruning 5 \
-maxNumPLValues 10000 \
--read_filter OverclippedRead \
-o /path/GATK_results/sample1.Super_Scaffold8.raw.g.vcf

Second, I obtained a single gVCF file per pool sample by merging the gVCF files of each of the 94 super_scaffolds of a given pool sample.
Third, I ran GenotypeGVCF on the cohort of gVCFS (12) for each super_scaffold separately (i.e. scatter by scaffold), setting the maximum number of alternative alleles to 3, using the newQUAL and setting the maximum number of PL values to 700000 (as this was the maximum number of PLs observed in some test runs where I obtained warning messages that -maxNumPLValues 10000 was too low). An example of the commands used for the 12 gVCFS on "super_scaffold8" is shown below (-Xmx 18g):

java -Djava.io.tmpdir=/path/tmp/ -XX:ParallelGCThreads=1 -Dsamjdk.use_async_io=true -Dsamjdk.buffer_size=4194304 -Xmx18g -jar /path/GATK/3.8.0/GenomeAnalysisTK.jar \
-T GenotypeGVCFs \
-R /path/Stitched_Ref_genome/ref.fasta \
-V /path/GVCFs/sample1.94.Super_Scaffolds.raw.g.vcf \
-V /path/GVCFs/sample2.94.Super_Scaffolds.raw.g.vcf \
-V /path/GVCFs/sample3.94.Super_Scaffolds.raw.g.vcf \
-V /path/GVCFs/sample4.94.Super_Scaffolds.raw.g.vcf \
-V /path/GVCFs/sample5.94.Super_Scaffolds.raw.g.vcf \
-V /path/GVCFs/sample6.94.Super_Scaffolds.raw.g.vcf \
-V /path/GVCFs/sample7.94.Super_Scaffolds.raw.g.vcf \
-V /path/GVCFs/sample8.94.Super_Scaffolds.raw.g.vcf \
-V /path/GVCFs/sample9.94.Super_Scaffolds.raw.g.vcf \
-V /path/GVCFs/sample10.94.Super_Scaffolds.raw.g.vcf \
-V /path/GVCFs/sample11.94.Super_Scaffolds.raw.g.vcf \
-V /path/GVCFs/sample12.94.Super_Scaffolds.raw.g.vcf \
-L Super_Scaffold8 \
-maxAltAlleles 3 \
-newQual \
-maxNumPLValues 700000 \
-o /path/GATK_results/12pops/12.pops.Super_Scaffold0.raw.SNPs-indels.vcf

I got this ERROR message:

INFO  17:15:24,838 HelpFormatter - ----------------------------------------------------------------------------------
INFO  17:15:24,842 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.8-0-ge9d806836, Compiled 2017/07/28 21:26:50
INFO  17:15:24,843 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute
INFO  17:15:24,843 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk
INFO  17:15:24,843 HelpFormatter - [Thu Aug 24 17:15:24 EDT 2017] Executing on Linux 2.6.32-642.6.2.el6.x86_64 amd64
INFO  17:15:24,843 HelpFormatter - Java HotSpot(TM) 64-Bit Server VM 1.8.0_74-b02
INFO  17:15:24,849 HelpFormatter - Program Args: -T GenotypeGVCFs -R /path/ref.fasta -V /path/GVCFs/sample1.94.Super_Scaffolds.raw.g.vcf -V /path/GVCFs/sample2.Super_Scaffolds.raw.g.vcf -V /path/GVCFs/sample3.94.Super_Scaffolds.raw.g.vcf -V /path/GVCFs/sample4.Super_Scaffolds.raw.g.vcf -V /path/GVCFs/sample5.Super_Scaffolds.raw.g.vcf -V /path/GVCFs/sample6.94.Super_Scaffolds.raw.g.vcf -V /path/GVCFs/sample7.94.Super_Scaffolds.raw.g.vcf -V /path/GVCFs/sample8.94.Super_Scaffolds.raw.g.vcf -V /path/GVCFs/sample9.94.Super_Scaffolds.raw.g.vcf -V /path/GVCFs/sample10.94.Super_Scaffolds.raw.g.vcf -V /path/GVCFs/sample11.Super_Scaffolds.raw.g.vcf -V /path/GVCFs/sample12.Super_Scaffolds.raw.g.vcf -L Super_Scaffold8 -maxAltAlleles 3 -newQual -maxNumPLValues 10000 -o /path/GATK_results/12pops/12.pops.Super_Scaffold8.raw.SNPs-indels.vcf
INFO  17:15:24,859 HelpFormatter - Executing as xxxxxxx on Linux 2.6.32-642.6.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_74-b02.
INFO  17:15:24,860 HelpFormatter - Date/Time: 2017/08/24 17:15:24
INFO  17:15:24,860 HelpFormatter - ----------------------------------------------------------------------------------
INFO  17:15:24,860 HelpFormatter - ----------------------------------------------------------------------------------
ERROR StatusLogger Unable to create class org.apache.logging.log4j.core.impl.Log4jContextFactory specified in jar:file:/cvmfs/path/GATK/3.8.0/GenomeAnalysisTK.jar!/META-INF/log4j-provider.properties
ERROR StatusLogger Log4j2 could not find a logging implementation. Please add log4j-core to the classpath. Using SimpleLogger to log to the console...
INFO  17:15:25,100 GenomeAnalysisEngine - Deflater: JdkDeflater
INFO  17:15:25,100 GenomeAnalysisEngine - Inflater: JdkInflater
INFO  17:15:25,101 GenomeAnalysisEngine - Strictness is SILENT
INFO  17:15:25,272 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO  17:16:18,502 IntervalUtils - Processing 12508256 bp from intervals
INFO  17:16:18,615 GenomeAnalysisEngine - Preparing for traversal
INFO  17:16:18,616 GenomeAnalysisEngine - Done preparing for traversal
INFO  17:16:18,617 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO  17:16:18,617 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining
INFO  17:16:18,618 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime
WARN  17:16:20,705 StrandBiasTest - StrandBiasBySample annotation exists in input VCF header. Attempting to use StrandBiasBySample values to calculate strand bias annotation values. If no sample has the SB genotype annotation, annotation may still fail.
WARN  17:16:20,707 StrandBiasTest - StrandBiasBySample annotation exists in input VCF header. Attempting to use StrandBiasBySample values to calculate strand bias annotation values. If no sample has the SB genotype annotation, annotation may still fail.
INFO  17:16:20,707 GenotypeGVCFs - Notice that the -ploidy parameter is ignored in GenotypeGVCFs tool as this is automatically determined by the input variant files
WARN  17:16:21,841 HaplotypeScore - Annotation will not be calculated, must be called from UnifiedGenotyper, not GenotypeGVCFs
INFO  17:16:48,638 ProgressMeter - Super_Scaffold8:197901         0.0    30.0 s      49.6 w        1.6%    31.6 m      31.1 m
INFO  17:17:18,640 ProgressMeter - Super_Scaffold8:198001         0.0    60.0 s      99.2 w        1.6%    63.2 m      62.2 m
INFO  17:17:48,642 ProgressMeter - Super_Scaffold8:198001         0.0    90.0 s     148.9 w        1.6%    94.8 m      93.3 m
INFO  17:18:18,643 ProgressMeter - Super_Scaffold8:198001         0.0   120.0 s     198.5 w        1.6%     2.1 h       2.1 h
INFO  17:18:48,644 ProgressMeter - Super_Scaffold8:198001         0.0     2.5 m     248.1 w        1.6%     2.6 h       2.6 h
INFO  17:19:18,646 ProgressMeter - Super_Scaffold8:198001         0.0     3.0 m     297.7 w        1.6%     3.2 h       3.1 h
INFO  17:19:48,647 ProgressMeter - Super_Scaffold8:198001         0.0     3.5 m     347.3 w        1.6%     3.7 h       3.6 h
INFO  17:20:18,649 ProgressMeter - Super_Scaffold8:198001         0.0     4.0 m     396.9 w        1.6%     4.2 h       4.1 h
INFO  17:21:02,832 ProgressMeter - Super_Scaffold8:198001         0.0     4.7 m     469.9 w        1.6%     5.0 h       4.9 h
INFO  17:21:34,999 ProgressMeter - Super_Scaffold8:198001         0.0     5.3 m     523.1 w        1.6%     5.5 h       5.5 h
INFO  17:22:07,210 ProgressMeter - Super_Scaffold8:198001         0.0     5.8 m     576.4 w        1.6%     6.1 h       6.0 h
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 3.8-0-ge9d806836):
##### ERROR
##### ERROR This means that one or more arguments or inputs in your command are incorrect.
##### ERROR The error message below tells you what is the problem.
##### ERROR
##### ERROR If the problem is an invalid argument, please check the online documentation guide
##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
##### ERROR
##### ERROR Visit our website and forum for extensive documentation and answers to
##### ERROR commonly asked questions https://software.broadinstitute.org/gatk
##### ERROR
##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
##### ERROR
##### ERROR MESSAGE: An error occurred because you did not provide enough memory to run this program. You can use the -Xmx argument (before the -jar argument) to adjust the maximum heap size provided to Java. Note that this is a JVM argument, not a GATK argument.
##### ERROR ------------------------------------------------------------------------------------------

Following the instructions of the ERROR message, I set the argument -Xmx to 32g and run the program again, then tried 256g and finally 512g, always obtaining the same error message. I understand this problem is not of GATK but of Java, however, I don't have access to larger memory resources than 512GB RAM.

Thus, I was wondering if you could please indicate me how to reduce the demand of memory by GATK when running GenotypeGVCF, hopefully not compromising sensitivity of SNP calling on pool data (e.g. I would prefer keeping ploidy equals to 10).

Which parameters and in which step (HaplotypeCaller or GenotyGVCFs) would you recommend to make the changes? In pool data I am mostly interested on obtaining read counts per allele, not genotypes (as these are not individuals). I was considering reducing -maxNumPLValues to 10000 but then I would have PLs not being calculated when running GenotypeGVCF...but I am not sure how this may affect SNP calling.

Thanks very much for any help!

Recommendations on using different versions of GATK for variant calling and joint genotyping

$
0
0

Dear GATK Team,

From what I understand, best practices recommend using the same version of GATK for variant calling (with HaplotypeCaller) and joint genotyping (with GenotypeGVCFs). Let's say, we found that GATK v3.4 has a better way of calling/reporting mutli-allelic variant calls. Will it be OK to have used HaplotypeCaller from v3.3 to call variants and then use GenotypeGVCFs from v3.4 to perform joint genotyping?

Thank You.
Joseph.

I do not get the annotations I specified with -A

$
0
0

The problem

You specified -A <some annotation> in a command line invoking one of the annotation-capable tools (HaplotypeCaller, MuTect2, UnifiedGenotyper and VariantAnnotator), but that annotation did not show up in your output VCF.

Keep in mind that all annotations that are necessary to run our Best Practices are annotated by default, so you should generally not need to request annotations unless you're doing something a bit special.

Why this happens & solutions

There can be several reasons why this happens, depending on the tool, the annotation, and you data. These are the four we see most often; if you encounter another that is not listed here, let us know in the comments.

  1. You requested an annotation that cannot be calculated by the tool

    For example, you're running MuTect2 but requested an annotation that is specific to HaplotypeCaller. There should be an error message to that effect in the output log. It's not possible to override this; but if you believe the annotation should be available to the tool, let us know in the forum and we'll consider putting in a feature request.

  2. You requested an annotation that can only be calculated if an optional input is provided

    For example, you're running HaplotypeCaller and you want InbreedingCoefficient, but you didn't specify a pedigree file. There should be an error message to that effect in the output log. The solution is simply to provide the missing input file. Another example: you're running VariantAnnotator and you want to annotate Coverage, but you didn't specify a BAM file. The tool needs to see the read data in order to calculate the annotation, so again, you simply need to provide the BAM file.

  3. You requested an annotation that has requirements which are not met by some or all sites

    For example, you're looking at RankSumTest annotations, which require heterozygous sites in order to perform the necessary calculations, but you're running on haploid data so you don't have any het sites. There is no workaround; the annotation is not applicable to your data. Another example: you requested InbreedingCoefficient, but your population includes fewer than 10 founder samples, which are required for the annotation calculation. There is no workaround; the annotation is not applicable to your data.

  4. You requested an annotation that is already applied by default by the tool you are running

    For example, you requested Coverage from HaplotypeCaller, which already annotates this by default. There is currently a bug that causes some default annotations to be dropped from the list if specified on the command line. This will be addressed in an upcoming version. For now the workaround is to check what annotations are applied by default and NOT request them with -A.

Haplotype Caller Makes SNPs look like INDELS

$
0
0

I'm using the HaplotypeCaller to look at SNPs related to antimicrobial resistance and am getting a result that looks like this:

NC_011035.1 2049708 .   CCGGCG  C   ...
NC_011035.1 2049714 .   C   CAAGAA  ...

I believe this is an alignment that would look like:
CCGGCGC
CCAAGAA

but instead of giving me 5 individual SNPs, GATK is calling the region as though it is a 5bp deletion at position 2049708 and a 5bp insertion at position 2049714.

Is there any way to change the parameters so that the appropriate call is made?

My current command is:

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -nct 12 -R NCC_011035.fasta -I ST547_dedup_reads_group.bam --genotyping_mode DISCOVERY -stand_emit_conf 10 -stand_call_conf 30 -o ST547_raw.vcf

what does the minReadsPerAlignmentStart argument in HaplotypeCaller mean?

$
0
0

The minReadsPerAlignmentStart argument in HaplotypeCaller is described as the minimum number of reads with the same alignment start for each location in an active region and has a default of 10. For each location, there will be lots of different alignment start sites (if this means the most 5' position of a read). So does this mean in each case there must be at least 10 reads sharing that 5' position? This seems a lot to me if the depth was about 30, and given that duplicates will be excluded. Can you please explain?


Haplotype Caller: too many alternative alleles found?

$
0
0

Hello gatk team,
I am running HaplotypeCaller on 5 files of genomic alignments together at once.
Despite the fact that I did a InDel realignment (Indel Realigner gatk) before running these files with HaplotypeCaller, I get many warnings during the process that say "too many alternative alleles found", with sometimes 10, 12 or 13 alternative alleles found.
Is that normal? or is there a step that I could have done improperly?
Thank you for your help :)
Marvin

ROD files out of FASTA? + other questions

$
0
0

Hey all, newbie here.
tl;dr:
I have a fasta file containing two sequences of my region of interest (~5.5 kbp), that differ in ~100 SNPs. What is the fastest way to generate a ROD file out of these sequences, as an input to BQSR?

So, hey.
I'm trying to determine the frequency of a genetic fragment I introduced into a bacterial strain, at several different samples. As I wrote, my current challenge is to create the aforementioned ROD file; however, my project is a bit different than 'usual' variant calling projects, and any advice regarding processing and analysis would be appreciated.

  1. I have a WT bacteria strain. I introduced a 5.5kbp genetic fragment to it, by electroporation and homologous recombination. It is safe to assume different parts of the fragment have invaded the host's genome with different efficiencies (so I may have 'hybrid' variants, that are half WT and half mutated). The introduced fragment had ~100 SNPs compared to the WT fragment.
  2. I took that sample and grew it on different conditions, in order to determine whether the fragment I introduced is beneficial to the bacteria.
  3. The fragments were PCR-amplified, sheared to smaller DNA fragments (~300-500 bp), and sequenced (150bp per read, paired-end). I have a coverage of 10^6 reads per base for each sample.
  4. I'd like to determine the frequency of each SNP at each sample, and ideally, the identity and frequency of each variant.

I have:
The sequencing samples (1 sample of the initial pool, 6 samples of biological replicates for one condition, and 3 samples of biological replicates for the second condition), the sequence of the WT's genome, and the sequence of the of the fragment I introduced.

My questions:
1. How do I turn the fasta file containing my WT and modified fragments to a ROD file (type doesn't matter) for the BQSR procedure? I do not need to relay on the sequenced samples to determine the differences between the sequences, I already know them.
2. Since all my reads originate from a PCR-amplified fragment, can de-duplication introduce biases \ underestimation to my data?
3. I have a huge coverage. Does it require any different processing methods?
4. Any other advice?

Thanks,
Omer

Segfault when running GATK 3.6 in a container

$
0
0

I'm using GATK on the DNAnexus platform, which can convert Docker images to the ACI format in order to run them. I have a Docker image that uses GATK 3.6 to call variants, which runs fine under ordinary Docker, but which segfaults when run on DNAnexus using this converted container format.

The log for this error is attached. The key information is this:

#  SIGSEGV (0xb) at pc=0x00007f3980e9dce9, pid=17744, tid=0x00007f39a146a700
#
# JRE version: OpenJDK Runtime Environment (8.0_141-b15) (build 1.8.0_141-8u141-b15-1~deb9u1-b15)
# Java VM: OpenJDK 64-Bit Server VM (25.141-b15 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [libVectorLoglessPairHMM6177525404161670245.so+0x1bce9]  LoadTimeInitializer::LoadTimeInitializer()+0x1669

So, it seems that somehow the compiled C code in GATK is having troubles under this specific environment. How should I attempt to resolve this? Should I upgrade or downgrade Java? GATK? Ubuntu? System libraries? Unfortunately upgrading GATK would be a bit difficult, because our workflow is accredited using GATK 3.6, but this might be possible if this is the only solution.

CRAM support in GATK 3.7 is broken

$
0
0

I have not been able to get GATK 3.7 HaplotypeCaller to work with CRAM files at all (it has a 100% failure rate so far with our whole genome CRAMs). Based on my analysis of the problem, I don't think GATK 3.7 will work with any CRAM files containing IUPAC ambiguity codes other than 'N' (including GRCh37/hs37d5 and GRCh38/HS38DH).

The error I get is:

ERROR   2017-01-05 02:18:59     Slice   Reference MD5 mismatch for slice 2:60825966-60861215, ATCTTTCATG...CTCTCCCATT
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 3.7-0-gcfedb67):
##### ERROR
##### ERROR This means that one or more arguments or inputs in your command are incorrect.
##### ERROR The error message below tells you what is the problem.
##### ERROR
##### ERROR If the problem is an invalid argument, please check the online documentation guide
##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
##### ERROR
##### ERROR Visit our website and forum for extensive documentation and answers to
##### ERROR commonly asked questions https://software.broadinstitute.org/gatk
##### ERROR
##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
##### ERROR
##### ERROR MESSAGE: SAM/BAM/CRAM file /keep/46909b690725869e1d9bfbc1da4a1398+19932/20657_7.cram is malformed. Please see https://software.broadinstitute.org/gatk/documentation/article?id=1317for more
##### ERROR ------------------------------------------------------------------------------------------

This error occurs for 100% of my CRAM files, which can be read by samtools, scramble, or previous versions of GATK (including 3.6) without any issues, so the error message is incorrect and the CRAM files are not malformed.

The CRAM slice in question is on chromosome 3 of hs37d5 (3:60825966-60861215). We can verify externally that the FASTA reference we are passing into GATK with -R does have the md5 that GATK reports it is expecting:

$ samtools faidx /keep/d527a0b11143ebf18be6c52ff6c09552+2163/hs37d5.fa 3:60825966-60861215 | grep -v '^>' | tr -d '\012' | md5sum
0e0ff678755616cba9ac362f15b851cc  -

And the sequence starts and ends with the bases that htsjdk reports:

$ samtools faidx /keep/d527a0b11143ebf18be6c52ff6c09552+2163/hs37d5.fa 3:60825966-60861215 | grep -v '^>' | tr -d '\012' | cut -c1-10
ATCTTTCATG
$ samtools faidx /keep/d527a0b11143ebf18be6c52ff6c09552+2163/hs37d5.fa 3:60825966-60861215 | grep -v '^>' | tr -d '\012' | cut -c35241-
CTCTCCCATT

I ended up having to recompile GATK and htsjdk from source and added some print debugging to htsjdk to dump the whole sequence from which the md5 was being calculated. It seems the sequence that cause problems are regions of the reference with IUPAC ambiguity codes other than 'N' (in this case a slice of chromosome 3 that contains an 'M' and two 'R's). In GATK 3.7 (built with htsjdk 2.8.1), the reference which is used to calculate the md5 for the slice has had all ambiguity codes converted to 'N'. The md5 it calculates for this slice (according to my print debugging) is: 5d820b3624e78202f503796f7330d8d9

I have verified that this is the md5 we would get from converting the IUPAC codes in this slice to N's:

$ samtools faidx /keep/d527a0b11143ebf18be6c52ff6c09552+2163/hs37d5.fa 3:60825966-60861215 | grep -v '^>' | tr -d '\012' | tr RYMKWSBDHV NNNNNNNNNN | md5sum
5d820b3624e78202f503796f7330d8d9  -

I have tried in vain to figure out where in GATK and/or htsjdk the ambiguous reference bases are being converted to 'N's. I initially thought that it was in the CachingIndexedFastaSequenceFile call to BaseUtils.convertIUPACtoN (when preserveIUPAC is false, although I didn't find any code path that could set it to true). However, after recompiling with preserveIUPAC manually set to true, the problem persisted. I guess there must be some other place where the bases are remapped. I'll leave it to you guys to figure out how to get an unmodified view on the reference for htsjdk to use for CRAM decoding.

There is, however, no mystery as to why this problem has suddenly appeared in GATK 3.7. The slice md5 validation code in htsjdk was only added in July 2016 (https://github.com/samtools/htsjdk/commit/a781afa9597dcdbcde0020bfe464abee269b3b2e). The first release version it appears in is version 2.7.0. Prior to that, it seems CRAM slice md5's were not validated in htsjdk, so this error would not have occurred.

Why is HaplotypeCaller spending 8 hrs on "Strictness is SILENT" step?

$
0
0

Hi,

I am running HaplotypeCaller on whole genome re-sequenced (~10X coverage) African buffalo genomes, using a high coverage African buffalo genome as the reference (~90X). The genome is about 2.8Gb. The reference genome currently consists of 442 402 scaffolds and contigs.

HaplotypeCaller works fine and produces the expected output etc., but it is spending a lot of time on the "Strictness is SILENT" step in particular, but also on the "MicroScheduler" and "Preparing for traversal over 1 BAM files" steps (9 and 7 hrs, respectively) and is taking >36 hours per genome.

I know this is not a quick analysis and some steps will take a long time, but all the log files I've seen on the forum and from a colleague (working on smaller fungal genomes) have Strictness is SILENT steps of <1 min.

Why is HaplotypeCaller spending so much time on this step? Could it be because of the many scaffolds and contigs in the reference genome? Is there something I can do to speed up this step (and/or the other two steps with long processing times)?

I am running GATK v3.6-0-g89b7209 and Java 1.8.0_73-b02. I've allocated Java 10GB of memory (-Xmx10g), but have 125GB available. Would increasing the allocated RAM (to say about 40GB) help speed up some of these steps?

The log file:

<br />INFO  21:29:46,295 HelpFormatter - ----------------------------------------------------------------------------------
INFO  21:29:46,298 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.6-0-g89b7209, Compiled 2016/06/01 22:27:29
INFO  21:29:46,298 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute
INFO  21:29:46,298 HelpFormatter - For support and documentation go to https://www.broadinstitute.org/gatk
INFO  21:29:46,298 HelpFormatter - [Mon Sep 25 21:29:46 SAST 2017] Executing on Linux 3.10.0-514.6.1.el7.x86_64 amd64
INFO  21:29:46,298 HelpFormatter - Java HotSpot(TM) 64-Bit Server VM 1.8.0_73-b02 JdkDeflater
INFO  21:29:46,301 HelpFormatter - Program Args: -T HaplotypeCaller -R /mnt/lustre/users/djager/buf_clean/alignment_files/bam_sorted/gatk/GATK_Deon/refs/buffalo.final.fa -I /mnt/lustre/users/djager/buf_clean/alignment_files/bam_sorted/gatk/GATK_Deon/M_47_14_aln-PE_sorted_dups_marked.bam --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -hets 0.03 -mbq 20 -stand_emit_conf 20 -stand_call_conf 30 -out_mode EMIT_ALL_CONFIDENT_SITES -nct 24 -ploidy 2 -o M_47_14_aln-PE_sorted_dups_marked_output.raw.snps.indels.g.vcf
INFO  21:29:46,307 HelpFormatter - Executing as djager@cnode0033 on Linux 3.10.0-514.6.1.el7.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_73-b02.
INFO  21:29:46,307 HelpFormatter - Date/Time: 2017/09/25 21:29:46
INFO  21:29:46,308 HelpFormatter - ----------------------------------------------------------------------------------
INFO  21:29:46,308 HelpFormatter - ----------------------------------------------------------------------------------
WARN  21:29:46,314 GATKVCFUtils - Naming your output file using the .g.vcf extension will automatically set the appropriate values  for --variant_index_type and --variant_index_parameter
INFO  21:29:46,332 GenomeAnalysisEngine - Strictness is SILENT
INFO  05:46:59,587 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 500
INFO  05:46:59,737 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO  05:47:07,772 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 8.03
INFO  05:47:11,853 HCMappingQualityFilter - Filtering out reads with MAPQ < 20
INFO  05:47:13,107 MicroScheduler - Running the GATK in parallel mode with 24 total threads, 24 CPU thread(s) for each of 1 data thread(s), of 24 processors available on this machine
INFO  14:05:27,691 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files
INFO  21:57:03,635 GenomeAnalysisEngine - Done preparing for traversal
INFO  21:57:03,635 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]

Thanks and kind regards
Deon

Illegal argument exception when running HaplotypeCaller

$
0
0

Hello,

One of my jobs is consistently failing to run HaplotypeCaller with the following error message:

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.IllegalArgumentException
at java.nio.Buffer.limit(Buffer.java:275)
at org.broadinstitute.gatk.engine.datasources.reads.GATKBAMIndex.getBuffer(GATKBAMIndex.java:428)
at org.broadinstitute.gatk.engine.datasources.reads.GATKBAMIndex.readLongs(GATKBAMIndex.java:360)
at org.broadinstitute.gatk.engine.datasources.reads.GATKBAMIndex.readReferenceSequence(GATKBAMIndex.java:138)
at org.broadinstitute.gatk.engine.datasources.reads.BAMSchedule.(BAMSchedule.java:105)
at org.broadinstitute.gatk.engine.datasources.reads.BAMScheduler.getNextOverlappingBAMScheduleEntry(BAMScheduler.java:296)
at org.broadinstitute.gatk.engine.datasources.reads.BAMScheduler.advance(BAMScheduler.java:185)
at org.broadinstitute.gatk.engine.datasources.reads.BAMScheduler.next(BAMScheduler.java:156)
at org.broadinstitute.gatk.engine.datasources.reads.BAMScheduler.next(BAMScheduler.java:46)
at htsjdk.samtools.util.PeekableIterator.advance(PeekableIterator.java:68)
at htsjdk.samtools.util.PeekableIterator.next(PeekableIterator.java:54)
at org.broadinstitute.gatk.engine.datasources.reads.IntervalSharder.next(IntervalSharder.java:79)
at org.broadinstitute.gatk.engine.datasources.reads.IntervalSharder.next(IntervalSharder.java:39)
at htsjdk.samtools.util.PeekableIterator.advance(PeekableIterator.java:68)
at htsjdk.samtools.util.PeekableIterator.next(PeekableIterator.java:54)
at org.broadinstitute.gatk.engine.datasources.reads.ActiveRegionShardBalancer.getCombinedFilePointersOnSingleContig(ActiveRegionShardBalancer.java:80)
at org.broadinstitute.gatk.engine.datasources.reads.ActiveRegionShardBalancer.access$000(ActiveRegionShardBalancer.java:40)
at org.broadinstitute.gatk.engine.datasources.reads.ActiveRegionShardBalancer$1.next(ActiveRegionShardBalancer.java:52)
at org.broadinstitute.gatk.engine.datasources.reads.ActiveRegionShardBalancer$1.next(ActiveRegionShardBalancer.java:46)
at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:90)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:315)
at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:106)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.5-0-g36282e4):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: Code exception (see stack trace for error itself)
ERROR ------------------------------------------------------------------------------------------

I'm just running the program with the default arguments, a reference genome that has worked fine with other files, and a input bam that appears to be in good shape (i.e. can be viewed using samtools with no problem). Any insight into what may be causing this would be greatly appreciated! Thanks!

Haplotype caller not picking up variants for HiSeq Runs

$
0
0

Hello,
We were sequencing all our data in HiSeq and now moved to nextseq. We have sequenced the same batch of samples on both the sequencers. Both are processed using the same pipeline/parameters.
What I have noticed is, GATK 3.7 HC is not picking up variants, even though the coverage is good and is evidently present in the BAM file.

For example the screenshot below shows the BAM files for both NextSeq and HiSeq sample. There are atleast 3
variants in the region 22:29885560-29885861(NEPH, exon 5) that is expected to be picked up for HiSeq.

These variants are picked up for NextSeq samples (even though the coverage for hiSeq is much better).

The command that I have used for both samples is

java -Xmx32g -jar GATK_v3_7/GenomeAnalysisTK.jar -T HaplotypeCaller -R GRCh37.fa --dbsnp GATK_ref/dbsnp_138.b37.vcf -I ${i}.HiSeq_Run31.variant_ready.bam -L NEPH.bed -o ${i}.HiSeq_Run31.NEPH.g.vcf

Any idea why this can happen ?

Many thanks,


Problem with allele specific annotation AS_QualByDepth (AS_QD) during variant calling

$
0
0

Hi GATK team,

First a big thank you for all your hard work in developing the tool and supporting the users!

I am trying out the allelic specific(AS) annotations in version 3.6. While I have gotten a few other AS annotations to properly show up in my VCF, I am having trouble getting the AS_QualByDepth in particular.

For example, I tried to call variant on a few samples at a specific locus with a "T" homopolymer run. I first ran HaplotypeCaller in the GVCF mode for each sample:

java -jar GenomeAnalysisTK.jar\
  -T HaplotypeCaller \
  --emitRefConfidence GVCF -variant_index_type LINEAR -variant_index_parameter 128000 \
  -R ref_fasta \
  -I sample_$i \
  -L chr1:10348759-10348801 \
  -A AS_StrandOddsRatio -A AS_FisherStrand -A AS_QualByDepth \
  -A AS_BaseQualityRankSumTest -A AS_ReadPosRankSumTest -A AS_MappingQualityRankSumTest
  -o sample_$i.gvcf

I then did GenotypeGVCFs on all the samples together:

java -jar GenomeAnalysisTK.jar\
  -T GenotypeGVCFs \
  -R ref_fasta \
  -V gvcf_list \
  -L chr1:10348759-10348801 \
  -A AS_StrandOddsRatio -A AS_FisherStrand -A AS_QualByDepth \
  -A AS_BaseQualityRankSumTest -A AS_ReadPosRankSumTest -A AS_MappingQualityRankSumTest
  -o out.vcf

In the final joint-called VCF header, the following AS annotations all showed up.

##INFO=<ID=AS_BaseQRankSum,Number=A,Type=Float,Description="allele specific Z-score from Wilcoxon rank sum test of each Alt Vs. Ref base qualities">
##INFO=<ID=AS_FS,Number=A,Type=Float,Description="allele specific phred-scaled p-value using Fisher's exact test to detect strand bias of each alt allele">
##INFO=<ID=AS_MQRankSum,Number=A,Type=Float,Description="Allele-specific Mapping Quality Rank Sum">
##INFO=<ID=AS_QD,Number=1,Type=Float,Description="Allele-specific Variant Confidence/Quality by Depth">
##INFO=<ID=AS_RAW_BaseQRankSum,Number=1,Type=String,Description="raw data for allele specific rank sum test of base qualities">
##INFO=<ID=AS_RAW_MQRankSum,Number=1,Type=String,Description="Allele-specific raw data for Mapping Quality Rank Sum">
##INFO=<ID=AS_RAW_ReadPosRankSum,Number=1,Type=String,Description="allele specific raw data for rank sum test of read position bias">
##INFO=<ID=AS_ReadPosRankSum,Number=A,Type=Float,Description="allele specific Z-score from Wilcoxon rank sum test of each Alt vs. Ref read position bias">
##INFO=<ID=AS_SB_TABLE,Number=1,Type=String,Description="Allele-specific forward/reverse read counts for strand bias tests">
##INFO=<ID=AS_SOR,Number=A,Type=Float,Description="Allele specific strand Odds Ratio of 2x|Alts| contingency table to detect allele specific strand bias">

However, in the INFO column, I only got the other AS annotations but not AS_QD.

chr1    10348779        .       AT      A,ATT   981.29  .       AC=4,2;AF=0.333,0.167;AN=12;AS_BaseQRankSum=-1.087,-2.521;AS_FS=3.986,7.378;AS_MQRankSum=-1.130,-2.349;AS_ReadPosRankSum=-1.192,-1.396;AS_SOR=0.415,0.254;BaseQRankSum=-6.350e-01;ClippingRankSum=0.00;DP=627;ExcessHet=14.6052;FS=6.378;MLEAC=4,2;MLEAF=0.333,0.167;MQ=59.95;MQRankSum=0.00;QD=1.94;ReadPosRankSum=-1.050e-01;SOR=0.352        GT:AD:DP:GQ:PL  0/1:44,9,7:63:81:81,0,1033,93,844,1165  0/1:71,11,8:99:47:47,0,1659,110,1414,1803       0/1:54,15,7:81:99:205,0,1239,280,1087,1635      0/1:69,25,12:106:99:311,0,1603,336,1306,2058    0/2:55,11,22:94:99:291,233,1636,0,943,1294      0/2:61,11,14:91:14:92,14,1473,0,1071,1468

I also checked the individual sample gVCFs. Similarly, there is AS_QD in the header but not in the INFO column. I wondering if this might be a bug or I am doing something wrong.

Another curious thing I noticed is that in the VCF header, the other AS annotations all have "Number=A" but AS_QD has "Number=1". Don't know if this might be causing some problem.

GATK3 HC bug?

$
0
0

Hey GATK Devs!

I'm writing to report some unexpected behavior on the part of GATK3.8 HC. I'm trying to use Illumina data to call SNPs and indels on a PacBio assembly and identify loci where assembly polishing has failed to correct the assembly. I was looking through the reads of a particular contig and identified a locus (tig00006168:59182) that GATK failed to call. According the the reads, the locus should have been called homozygous for a deletion at 30X depth. Looking at the gVCF I see it is called homozygous for the reference allele (while reporting 31 reads supporting the alternate allele), reports 0 for GQ and all PLs:

tig00006168     59180   .       C       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:31,0:31:15:0,15,225
tig00006168     59181   .       A       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:30,0:30:15:0,15,225
tig00006168     59182   .       C       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:1,31:32:0:0,0,0
tig00006168     59183   .       A       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:1,31:32:0:0,0,0
tig00006168     59184   .       A       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:1,31:32:0:0,0,0
tig00006168     59185   .       A       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:27,2:29:54:0,54,1005
tig00006168     59186   .       C       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:30,1:31:72:0,72,1080

The -bamout output for this run reports no variant-containing reads at this locus. However, if I include -L tig00006168:59172-59195 in the options, GATK calls the indel:

tig00006168     59182   .       CA      C       927.73  .       AC=2;AF=1.00;AN=2;DP=27;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=34.36;SOR=2.753        GT:AD:DP:GQ:PL  1/1:0,27:27:81:965,81,0

The tview on the -bamout for this latter run displays:

59141     59151     59161     59171     59181     59191     59201     59211     592
GGAAATGAAGGAGAAGAAAGTGTTTATCAGCCTCGTGGGCACAAACAGGAATGGGCTGCAGGTTGGTACCCCCAATCTCTNNN
      ..........................................................................
      .................................        .................................
      ....................................*.....................................
      ....................................*.....................................
      ...........................              ,,,,,,,,,,,,,,,,,,,,c,,,,,,,,,,,,
      ..................                       ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
      ....................................*.....................................
      ....................................*..............................
      ....................................*..............................
      ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,*,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
      ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,*,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
      ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,       ,,g,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
      ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,*,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
      ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,*,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
      ,,,,,,,,,,,,,,,,,,,,,,                    ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
      ,,,,,,,,,,,,,,,,,,,,g,,,,,,,,,,,,,,,*,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
      ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,*,,,,,,,   ,,,,,,,,,,,,,,,,,,,,,,,,,,,
      ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,*,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
      ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,*,,,,t,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
      ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,*,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
      ,,,,,c,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,*,,,,,,,,,,,,,,,,,,,,,,,,
      ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,*,,,,,,,,,,,,,,,,,,,,,,,,t,,,,,,,,,,,,
      ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,*,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
      ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,*,,,,,,,,,,,,    .....................
      ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,*,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
      ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,*,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
      ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,*,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
      ,,,,,,,,,,,,g,,,,,,,,,,,,,,,,,,,,,,,*,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
      ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,*,,,,,,,,,,,,,,,,,,,,,,,,c,,,,,,,,,,,,
      ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
      ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,*,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
         ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,*,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
            ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,*,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
                 ,,,,,,,,,,,,,,,,,,,,,,,,,*,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
                 ,,,,,,,,,,,,,,,,,,,,,,,,,*,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

The mapping quality for all of these reads is over 30 and the base qualities in the 30+ range. In your experience, what might cause this odd behavior? I've tried GATK versions 3.2-2, 3.6-0, 3.7, and they exhibit the same behavior. My initial run log:

INFO  07:46:12,698 HelpFormatter - ---------------------------------------------------------------------------------------
INFO  07:46:12,701 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.8-0-ge9d806836, Compiled 2017/07/28 21:26:50
INFO  07:46:12,701 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute
INFO  07:46:12,701 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk
INFO  07:46:12,701 HelpFormatter - [Fri Oct 13 07:46:12 PDT 2017] Executing on Linux 2.6.32-696.3.2.el6.nersc.x86_64 amd64
INFO  07:46:12,701 HelpFormatter - Java HotSpot(TM) 64-Bit Server VM 1.8.0_31-b13
INFO  07:46:12,704 HelpFormatter - Program Args: -T HaplotypeCaller --standard_min_confidence_threshold_for_calling 0 -rf BadMate -R ./contigs.fasta -L
tig00006168 -I 10X.bam -mmq 25 -mbq 30 -o tig00006168.trg.vcf.gz -bamout tig00006168.trg.bam
INFO  07:46:12,714 HelpFormatter - Executing as bredeson@hostname on Linux 2.6.32-696.3.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM
1.8.0_31-b13.
INFO  07:46:12,714 HelpFormatter - Date/Time: 2017/10/13 07:46:12
INFO  07:46:12,714 HelpFormatter - ---------------------------------------------------------------------------------------
INFO  07:46:12,714 HelpFormatter - ---------------------------------------------------------------------------------------
ERROR StatusLogger Unable to create class org.apache.logging.log4j.core.impl.Log4jContextFactory specified in jar:file:~bredeson/tools/bin/GATK/3.
8-0-ge9d80683/GenomeAnalysisTK.jar!/META-INF/log4j-provider.properties
ERROR StatusLogger Log4j2 could not find a logging implementation. Please add log4j-core to the classpath. Using SimpleLogger to log to the console...
INFO  07:46:12,851 GenomeAnalysisEngine - Deflater: JdkDeflater
INFO  07:46:12,851 GenomeAnalysisEngine - Inflater: JdkInflater
INFO  07:46:12,852 GenomeAnalysisEngine - Strictness is SILENT
INFO  07:46:15,647 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 500
INFO  07:46:15,654 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO  07:46:15,879 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.22
INFO  07:46:16,007 HCMappingQualityFilter - Filtering out reads with MAPQ < 25
INFO  07:46:18,047 IntervalUtils - Processing 106407 bp from intervals
INFO  07:46:18,157 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files
INFO  07:46:18,277 GenomeAnalysisEngine - Done preparing for traversal
INFO  07:46:18,277 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO  07:46:18,278 ProgressMeter -                 |      processed |    time |         per 1M |           |   total | remaining
INFO  07:46:18,278 ProgressMeter -        Location | active regions | elapsed | active regions | completed | runtime |   runtime
INFO  07:46:18,278 HaplotypeCaller - Disabling physical phasing, which is supported only for reference-model confidence output
INFO  07:46:18,325 StrandBiasTest - SAM/BAM data was found. Attempting to use read data to calculate strand bias annotations values.
WARN  07:46:18,325 InbreedingCoeff - Annotation will not be calculated. InbreedingCoeff requires at least 10 unrelated samples.
INFO  07:46:18,326 StrandBiasTest - SAM/BAM data was found. Attempting to use read data to calculate strand bias annotations values.
INFO  07:46:18,674 HaplotypeCaller - Using global mismapping rate of 45 => -4.5 in log10 likelihood units
INFO  07:46:19,188 VectorLoglessPairHMM - Using OpenMP multi-threaded AVX-accelerated native PairHMM implementation
[INFO] Available threads: 40
[INFO] Requested threads: 1
[INFO] Using 1 threads
WARN  07:46:19,268 HaplotypeScore - Annotation will not be calculated, must be called from UnifiedGenotyper, not HaplotypeCaller
INFO  07:46:32,888 VectorLoglessPairHMM - Time spent in setup for JNI call : 0.012103603000000001
INFO  07:46:32,889 PairHMM - Total compute time in PairHMM computeLikelihoods() : 3.824669993
INFO  07:46:32,889 HaplotypeCaller - Ran local assembly on 82 active regions
INFO  07:46:33,175 ProgressMeter -            done         106407.0    14.0 s            2.3 m      100.0%    14.0 s       0.0 s
INFO  07:46:33,175 ProgressMeter - Total runtime 14.90 secs, 0.25 min, 0.00 hours
INFO  07:46:33,175 MicroScheduler - 106744 reads were filtered out during the traversal out of approximately 133359 total reads (80.04%)
INFO  07:46:33,176 MicroScheduler -   -> 0 reads (0.00% of total) failing BadCigarFilter
INFO  07:46:33,176 MicroScheduler -   -> 6304 reads (4.73% of total) failing BadMateFilter
INFO  07:46:33,176 MicroScheduler -   -> 0 reads (0.00% of total) failing DuplicateReadFilter
INFO  07:46:33,176 MicroScheduler -   -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter
INFO  07:46:33,176 MicroScheduler -   -> 99743 reads (74.79% of total) failing HCMappingQualityFilter
INFO  07:46:33,177 MicroScheduler -   -> 0 reads (0.00% of total) failing MalformedReadFilter
INFO  07:46:33,188 MicroScheduler -   -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter
INFO  07:46:33,188 MicroScheduler -   -> 697 reads (0.52% of total) failing NotPrimaryAlignmentFilter
INFO  07:46:33,188 MicroScheduler -   -> 0 reads (0.00% of total) failing UnmappedReadFilter
------------------------------------------------------------------------------------------
Done. There were 2 WARN messages, the first 2 are repeated below.
WARN  07:46:18,325 InbreedingCoeff - Annotation will not be calculated. InbreedingCoeff requires at least 10 unrelated samples.
WARN  07:46:19,268 HaplotypeScore - Annotation will not be calculated, must be called from UnifiedGenotyper, not HaplotypeCaller
------------------------------------------------------------------------------------------

How to use HaplotypeCallerSpark from GATK 4 (beta 6) with Adam input files

$
0
0

Hi,
it seems that GatkReads can read SAM/BAM/CRAM files and Adam files.
But when I try to use HaplotypeCallerSpark with adam parquet files it fails because of some dictionnary validation ?
Here's the command line I use (same command line that I use with SAM files, I just changed the input path) :

./gatk-4.beta.6/gatk-launch HaplotypeCallerSpark \
    --sparkMaster spark://<my spark master> \
    --input "input.adam" \
    --output output.vcf \
    --reference /data/hg19/hg19.2bit \
    -- --sparkRunner SPARK --driver-memory 10G --executor-memory 10G

HaplotypeCaller can't call a 10bp deletion variant

$
0
0


Hi, GATK team.
I use haplotypeCaller to call variants, but it can't find a 10bp deletion variant, as you can see in the graph.
I use -L targetInterval
-bamWriterType ALL_POSSIBLE_HAPLOTYPES
-bamout haplotype.bam
to see whether haplotype is correctly assembled, but the haplotype.bam is empty, it seems the targetInterval is not active region.

Then I use -forceActive, the haplotype.bam is not empty anymore, but the output vcf file still doesn't contain the 10bp deletion variant, so I'm really confused now, what should I do to call this variant out.
I use gatk3.6.

the 10 bp deletion's position is chr17:29541466.
Here is my pipeline:
java -d64 -server -XX:+UseParallelGC -XX:ParallelGCThreads=2 -Djava.io.tmpdir=$tmp_dir -jar $gatk \
-R $reference_file \
-L 17:29,541,000-29,542,000 \
-bamWriterType ALL_POSSIBLE_HAPLOTYPES \
-bamout test.bam\
-T HaplotypeCaller \
-I $in_dir/NA12878MOD_sort_markdup_vardict_sort_realign_recal.bam \
--dbsnp $dbsnp_del100 \
-forceActive \
-o $out_dir/test.raw.snps.indels.vcf

Regarding GenderMap file in genomestrip

$
0
0

Hello,
Can someone tell me that what should i define gender for plant sample in gendermap file.
Please explain me??

Thank you in Advance

Viewing all 1335 articles
Browse latest View live