Hi,
This may not directly relate to how GATK works, but I am still asking to get your expert input (I do use GATK for this pipeline).
I am trying to test my variant calling pipeline that I have prepared for my target region of 5Mb for which I need a test dataset where the fastq files and the VCF files are provided so that I can run my pipeline on those fastqs and then compare my VCF to those VCFs. So, I used the two whole exome datasets from Genome In a Bottle data with the following steps:
In order to obtain the fastqs that originated from my target region, I used the bam files published by GIAB and extracted the alignments falling in my target region using samtools
Converted the extracted bams in those regions to paired-end fastqs
Aligned those fastqs to the entire genome using BWA
Called the variants in my region of interest using GATK (with -L option, restricting my analysis to my target region for RealignerTargetCreator and HaplotypeCaller steps only). I don't do BQSR and VQSR.
Compared my variants to the variants from GIAB in the target region.
But I only get 50% variant calls correctly. I call only 50% of the total variants in that region. What could be the reason for this low agreement rate?
Interestingly, when I call the variants on the whole exome for which the fastqs were originally created, I can obtain 96% of variants in the whole-exome region published by GIAB. Moreover, from those variants, when I extract the variants that fall in my target regions and compare it to the corresponding GIAB variants, I can go upto 97%. In other words, when I use the entire whole-exome data, I can call 97% of the variants in my target region, but when I start with the alignments falling in my target region, convert to fastq and then call variants in the target region, I only get 50%. I use the exact same settings of GATK in both cases (except the -L region which is different).
Can you please help me figure out what could be going wrong?
Thanks,
~N