GATK 3.4 was released on May 15, 2015. Itemized changes are listed below. For more details, see the user-friendly version highlights.
New tool
- ASEReadCounter: A tool to count read depth in a way that is appropriate for allele specific expression (ASE) analysis. It counts the number of reads that support the REF allele and the ALT allele, filtering low qual reads and bases and keeping only properly paired reads. See Highlights for more details.
HaplotypeCaller & GenotypeGVCFs
- Important fix for genotyping positions over spanning deletions. Previously, if a SNP occurred in sample A at a position that was in the middle of a deletion for sample B, sample B would be genotyped as homozygous reference there (but it's NOT reference - there's a deletion). Now, sample B is genotyped as having a symbolic DEL allele. See Highlights for more details.
- Deprecated
--mergeVariantsViaLD
argument in HaplotypeCaller since it didn’t work. To merge complex substitutions, use ReadBackedPhasing as a post-processing step. - Removed exclusion of MappingQualityZero, SpanningDeletions and TandemRepeatAnnotation from the list of annotators that cannot be annotated by HaplotypeCaller. These annotations are still not recommended for use with HaplotypeCaller, but this is no longer enforced by a hardcoded ban.
- Clamp the HMM window starting coordinate to 1 instead of 0 (contributed by nsubtil).
- Fixed the implementation of
allowNonUniqueKmersInRef
so that it applies to all kmer sizes. This resolves some assembly issues in low-complexity sequence contexts and improves calling sensitivity in those regions. - Initialize annotations so that
--disableDithering
actually works. - Automatic selection of indexing strategy based on
.g.vcf
file extension. See Highlights for more details. - Removed normalization of QD based on length for indels. Length-based normalization is now only applied if the annotation is calculated in UnifiedGenotyper.
- Added the RGQ (Reference GenotypeQuality) FORMAT annotation to monomorphic sites in the VCF output of GenotypeGVCFs. Now, instead of stripping out the GQs for monomorphic ohm-ref sites, we transfer them to the RGQ. This is extremely useful for people who want to know how confident the hom-ref genotype calls are. See Highlights for more details.
- Removed GenotypeSummaries from default annotations.
- Added
-uniquifySamples
to GenotypeGVCFs to make it possible to genotype together two different datasets containing the same sample. - Disallow changing
-dcov
setting for HaplotypeCaller (pending a fix to the downsampling control system) to prevent buggy behavior. See Highlights for more details. - Raised per-sample limits on the number of reads in ART and HC. Active Region Traversal was using per sample limits on the number of reads that were too low, especially now that we are running one sample at a time. This caused issues with high confidence variants being dropped in high coverage data.
- Removed explicit limitation (20) of the maximum ploidy of the reference-confidence model. Previously there was a fixed-size maximum ploidy indel RCM likelihood cache; this was changed to a dynamically resizable one. There are still some de facto limitations which can be worked around by lowering the max alt alleles parameter.
- Made GQ of Hom-Ref Blocks in GVCF output be consistent with PLs.
- Fixed a bug where HC was not realigning against the reference but against the best haplotype for the read.
- Fixed a bug (in HTSJDK) that was causing GenotypeGVCFs to choke on sites with large numbers of alternate alleles (>140).
- Modified the way GVCFBlock header lines are named because the new HTSJDK version disallows duplicate header keys (aside from special-cased keys such as INFO and FORMAT).
CombineGVCFs
- Added option to break blocks at every N sites. Using
--breakBandsAtMultiplesOf N
will ensure that no reference blocks span across genomic positions that are multiples of N. This is especially important in the case of scatter-gather where you don't want your scatter intervals to start in the middle of blocks (because of a limitation in the way-L
works in the GATK for VCF records with the END tag). See Highlights for more details. - Fixed a bug that caused the tool to stop processing after the first contig.
- Fixed a bug where the wrong REF allele was output to the combined gVCF.
VariantRecalibrator
- Switched VQSR tranches plot ordering rule (ordering is now based on tranche sensitivity instead of novel titv).
- VQSR VCF header command line now contains annotations and tranche levels.
SelectVariants
- Added
-trim
argument to trim (simplify) alleles to a minimal representation. - Added
-trimAlternates
argument to remove all unused alternate alleles from variants. Note that this is pretty aggressive for monomorphic sites. - Changed the default behavior to trim (remove) remaining alleles when samples are subset, and added the
-noTrim
argument to preserve original alleles. - Added
--keepOriginalDP
argument.
VariantAnnotator
- Improvements to the allele trimming functionalities.
- Added functionality to support multi-allelic sites when annotating a VCF with annotations from another callset. See Highlights for more details.
CalculateGenotypePosteriors
- Fixed user-reported bug featuring "trio" family with two children, one parent.
- Added error handling for genotypes that are called but have no PLs.
Various tools
- BQSR: Fixed an issue where GATK would skip the entire read if a SNP is entirely contained within a sequencing adapter (contributed by nsubtil); and improved how uncommon platforms (as encoded in RG:PL tag) are handled.
- DepthOfCoverage: Now logs a warning if incompatible arguments are specified.
- SplitSamFile: Fixed a bug that caused a NullPointerException.
- SplitNCigarReads: Fixed issue to make
-fixNDN
flag fully functional. - IndelRealigner: Fixed an issue that was due to reads that have an incorrect CIGAR length.
- CombineVCFs: Minor change to an error check that was put into 3.3 so that identical samples don't need
-genotypeMergeOption
. - VariantsToBinaryPED: Corrected swap between mother and father in PED file output.
- GenotypeConcordance: Monomorphic sites in the truth set are no longer called "Mismatching Alleles" when the comp genotype has an alternate allele.
- ReadBackedPhasing: Fixed a couple of bugs in MNP merging.
- CatVariants: Now allows different input / output file types, and spaces in directory names.
- VariantsToTable: Fixed a bug that affected the output of the FORMAT record lists when
-SMA
is specified. Note that FORMAT fields behave the same as INFO fields - if the annotation has a count of A (one entry per Alt Allele), it is split across the multiple output lines. Otherwise, the entire list is output with each field.
Read Filters
- Added erroneous CIGAR length to criteria for BadCigarFilter.
- Corrected logical expression in MateSameStrandFilter (contributed by user seru71).
- Handle X and = CIGAR operators appropriately
- Added
-drf
argument to disable default read filters. Limited to specific tools and specific filters (currently only DuplicateReadFilter).
Annotations
- Calculate StrandBiasBySample using all alternate alleles as “REF vs. any ALT”.
- Modified InbreedingCoeff so that it works when genotyping uniquified samples (see GenotypeGVCFs changes).
- Changed GC Content value type from Integer to Float.
- Added StrandAlleleCountsBySample annotation. This annotation outputs the number of reads supporting each allele, stratified by sample and read strand; callable from HaplotypeCaller only.
- Made annotators emit a warning if they can't be applied.
GATK Engine & common features
- Fixed logging of 'out' command line parameter in VCF headers; changed []-type arrays to lists so argument parsing works in VCF header commandline output.
- Modified GATK command line header for unique keys. The GATK command line header keys were being repeated in the VCF and subsequently lost to a single key value by HTSJDK. This resolves the issue by appending the name of the walker after the text "GATKCommandLine" and a number after that if the same walker was used more than once in the form: GATKCommandLine.(walker name) for the first occurrence of the walker, and GATKCommandLine.(walker name).# where # is the number of the occurrence of the walker (e.g. GATKCommandLine.SomeWalker.2 for the second occurrence of SomeWalker).
- Handle X and = CIGAR operators appropriately.
- Added barebones read/write CRAM support (no interval seeking!). See Highlights for more details.
- Cleaned up logging outputs / streams; messages (including HMM log messages) that were going to stdout now going to stderr.
- Improved error messages; when an error is related to a specific file, the engine now includes the file name in the error message.
- Fixed BCF writing when FORMAT annotations contain arrays.
Queue
- Added
-qsub-broad
argument. When -qsub-broad is specified instead of-qsub
, Queue will use theh_vmem
parameter instead ofh_rss
to specify memory limit requests. This was done to accommodate changes to the Broad’s internal job scheduler. Also causes the GridEngine native arguments to be output by default to the logger, instead of only when in debug mode. - Fixed the scala wrapper for Picard MarkDuplicates (needed because MarkDuplicates was moved to a different package within Picard).
- Added optional element "includeUnmapped" to the PartitionBy annotation. The value of this element (default true) determines whether Queue will explicitly run this walker over unmapped reads. This patch fixes a runtime error when FindCoveredIntervals was used with Queue.
Documentation
- Plentiful enhancements and fixes to various tool docs, especially annotations and read filters.
For developers
- Upgraded SLF4J to allow new convenient logging syntaxes.
- Patched maven pom file for
slf4j-log4j12
version (contributed by user Biocyberman). - Updated HTSJDK version (now pulling it in from Maven Central); various edits made to match.
- Collected VCF IDs and header lines into one place (GATKVCFConstants).
- Made various changes that lead to reduced build times.