Hi.
gatk-4.0.7.0
java openjdk version "1.8.0_181"
I've encountered a problem while running HaplotypeCaller in -ERC BP_RESOLUTION --genotyping-mode DISCOVERY --output-mode EMIT_ALL_CONFIDENT_SITES
mode on a single-sample paired-end WGS data. The problem is that HC reassembles reads so that a particular region of interest contains no supporting reads in the corresponding bamout
file while the original BAM file does contain them. The region of interest is the CYP2D6 gene on chr22.
Commands
After trimming the reads I align them to GRCh38 with Bowtie2:
bowtie2 --threads 16 --trim5 0 --trim3 0 --phred33 --no-mixed --no-discordant --no-unal --local -x /hg38/bowtie2/hg38 -q -1 /trimmomatic_output/sample_R1.paired.fastq.gz -2 /trimmomatic_output/sample_R2.paired.fastq.gz -t -S /bowtie2_output/sample_raw.sam --met-file /path_to_met_file
This is followed by sorting the resulting SAM file and its conversion to BAM, running Picard's MarkDuplicates on the resulting BAM and further filtering using samtools with flags -q 10 -f 0x2 -F 0x4 -F 0x100 -F 0x400 -F 0x800
. BQSR is applied next followed by calling on separate chromosomes/ROIs:
/home/ubuntu/Tools/gatk-4.0.7.0/gatk HaplotypeCaller --reference /hg38.fa --input /bqsr_output/sample.bqsr.bam --genotyping-mode DISCOVERY --output-mode EMIT_ALL_CONFIDENT_SITES --read-filter NotSecondaryAlignmentReadFilter --read-filter NotDuplicateReadFilter --read-filter MappingQualityAvailableReadFilter --read-filter NotSupplementaryAlignmentReadFilter --smith-waterman FASTEST_AVAILABLE --min-base-quality-score 10 --annotation-group StandardAnnotation --annotation-group StandardHCAnnotation --output /gatk_gvcf_output/sample.diploid.raw.g.vcf.gz --emit-ref-confidence BP_RESOLUTION --sample-ploidy 2 --native-pair-hmm-threads 16 -L chr22
The resulting files are combined with CombineGVCFs
followed by genotype calling using GenotypeGVCFs
(gatk 3.8-1-0-gf15c1c3ef). This old version is used due to the fact that --includeNonVariantSites
option is not implemented in up-to-date package (as far as I know), while I need to get the reference calls for further analysis.
Problems
The sample output from final VCF file from GenotypeGVCFs
for all positions in a region of interest looks like this (zero AD, DP, which seems reasonable as bamout contains no supporting reads):
chr22 42130000 . G . . . . GT:AD:DP:RGQ ./.:0,0:0:0
The IGV screenshot of the original BAM file after samtools
filtering and bamout BAM file from HC is attached below.
I have digged into some of the old threads for similar problems (link, link, link, link) and found no appropriate solution. I tried to run HC as above but with adding --dont-trim-active-regions True --min-dangling-branch-length 1 --min-pruning 1 --disable-optimizations True --allow-non-unique-kmers-in-ref
parameters. It resulted in complete absence of supporting reads in the bamout (the upper empty panel is for bamout output, the lower is for original BAM) as in the previous case:
Finally I tried running HC with --dont-trim-active-regions True --min-dangling-branch-length 1 --min-pruning 1 --disable-optimizations True --allow-non-unique-kmers-in-ref
and --read-filter AllowAllReadsReadFilter --disable-tool-default-read-filters True
. The IGV screenshot is below: