Quantcast
Channel: haplotypecaller — GATK-Forum
Viewing all articles
Browse latest Browse all 1335

Haplotype caller filter settings - Samtools called variant is filtered out with HaplotypeCaller

$
0
0

Hi I am new to using HC. I am just using the HapMap sample NA12878 to validate my pipeline. I ran the following command on a BAM file that I generated. I noticed that certain SNPs were filtered out most of which I can understand the rationale.

java -Xms454m -Xmx3181m -XX:+UseSerialGC -Djava.io.tmpdir=/home/tx/tmpYdAKAC -jar /usr/local/share/gatk/3.4-46-gbc02625/GenomeAnalysisTK.jar \
-R /usr/local/share/genomes/Hsapiens/GRCh37/seq/GRCh37.fa \
-I /home/INDELanalysis/validationrun/NA12878-sort-11_0_15994663-prep.bam \
--dbsnp /usr/local/share/genomes/Hsapiens/GRCh37/variation/dbsnp_138.vcf.gz \
-L 11:1628982-1651500 \
-T HaplotypeCaller \
-o /home/INDELanalysis/validationrun/NA12878-11_0_15994663-raw.vcf.gz

I ran the same analysis using Samtools and the following SNP was in the final filtered output

Samtools output
11  1651228 rs71454096  C   G   117 PASS    AC=1;AN=2;BQB=0.61207;BaseQRankSum=-0.458;DB;DP=21;DP4=11,0,5,3;FS=7.225;GC=71.29;HOB=0.5;HRun=3;ICB=1;MQ=56.97;MQ0=0;MQ0F=0;MQB=0.395294;MQRankSum=-1.091;MQSB=0.798516;QD=5.57;RPB=0.902014;ReadPosRankSum=-1.162;SGB=-0.651104;VDB=0.0103095;EFF=NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|gCg/gGg|A53G|237|KRTAP5-5|protein_coding|CODING|ENST00000399676|1|G)    GT:DP:PL    0/1:19:150,0,158

However when I run it through HC with the above settings the above variant is filtered out of the analysis. When I ran HC with--emitRefConfidence GVCF I got the following output for this variant:

HC output
11  1651228 rs71454096  C   G,<NON_REF> 0   .   DB;DP=20;MLEAC=0,0;MLEAF=0.00,0.00;MQ=58.22 GT:AD:DP:GQ:PGT:PID:PL:SB   0/0:10,0,0:10:34:0|1:1651120_G_T:0,34,442,34,442,442:10,0,0,0

I am a little confused why it got a QUAL score of 0. So looking nearby I also noticed that there were some sizeable INDELS

11 1651198 .   GAGGCTGTGGGGGCTGTGGCTCCGGCTGTGC G,<NON_REF> 0   .   DP=25;MLEAC=0,0;MLEAF=0.00,0.00;MQ=58.38    GT:AD:DP:GQ:PL:SB   0/0:14,0,0:14:37:0,37,585,42,589,594:12,2,0,0
11  1651199 rs71025763  A   AGGCTGTGGCTCC,<NON_REF> 0   .   DB;DP=18;MLEAC=0,0;MLEAF=0.00,0.00;MQ=57.74 GT:AD:DP:GQ:PL:SB   0/0:6,0,0:6:28:0,28,332,28,332,332:5,1,0,0

However neither of these INDELS were called either. Can anyone shed some light as to why nothing is called in this region?
If I look at the site
11 1651228
in the bam file I can see that most of the calls were made on the forward strand so there is likely to be some strand bias here and in
actual fact we can see that HC called the genotype 0/0 with only 10 reads contributing to this genotype call. I presume the remaining reads were filtered out. However when I look at the pileup for this position I don't understand why over half the reads were filtered

11 1651228 C CCCCGGGGCCCGGCCCGGCGG BBFIFB<FFIIFFFFFFFBBB

Any advice help would be much appreciated.


Viewing all articles
Browse latest Browse all 1335