Quantcast
Channel: haplotypecaller — GATK-Forum
Viewing all articles
Browse latest Browse all 1335

Unexpected genotypes in GenotypeGVCF output

$
0
0

Hello GATK Team,

I have 21 bam files that I ran through HaplotypeCaller in GVCF mode followed by GenotypeGVCF, using Version=3.4-0-g7e26428. I found a few entries that I am having a difficult time understanding.

Here is one of the entries in question (quick note: each sample is from a pool of 3-5 animals, explaining the high ploidy):

chr1   88988835    rs396411987 C   G   226099.87   .   AC=61;AF=0.455;AN=134;BaseQRankSum=-1.176e+00;ClippingRankSum=0.054;DB;DP=29307;FS=0.000;MLEAC=61;MLEAF=0.455;MQ=59.98;MQRankSum=0.603;QD=11.32;ReadPosRankSum=1.30;SOR=0.696   GT:AD:DP:GQ:PL  0/0/0/0/0/1/1/1/1/1:396,449,0,0:845:28:9249,2087,965,413,120,0,28,219,649,1588,7879 0/1/1/1/1/1/1/1/1/1:275,2150,0,0:2425:99:52058,17674,11483,7903,5424,3571,2143,1051,287,0,3240  0/0/0/1/1/1/1/1:626,1134,0,0:1760:99:24237,5532,2600,1118,316,0,197,1292,11237  0/0/1/1/1/1/1/1:225,824,0,0:1049:99:18893,5121,2835,1577,772,257,0,116,3439 ./././././././././.:0,0 ./././././././././.:0,0 ./././././././.:0,0 0/0/0/0/0/0/0/1:824,105,0,0:929:99:1309,0,239,705,1369,2290,3644,6013,20091 0/0/0/0/0/1/1/1:786,499,0,0:1285:99:9169,1197,249,0,139,633,1609,3600,16598 0/0/1/1/1/1:557,1132,0,2:1691:99:24696,4545,1719,433,0,563,9784 0/0/0/0/0/0/0/1:922,140,0,0:1062:99:1825,0,202,685,1401,2410,3908,6544,22289    0/0/0/0/1/1/1/1/1/1:650,828,0,0:1478:26:16909,4072,1966,903,311,26,0,254,906,2400,12402 ./././././././.:0,0 0/0/0/0/0/0/0/0/1/1:846,179,0,0:1025:95:2546,95,0,178,520,1015,1689,2617,3986,6389,20368    0/0/0/0/1/1/1/1:875,986,0,0:1861:99:20237,3745,1410,380,0,136,885,2819,17626    0/0/0/0/0/0/0/0:570,0:570:23:0,23,50,82,120,170,241,361,1800    0/0/0/0/0/1/1/1:798,604,0,0:1402:20:11395,1675,423,0,20,429,1344,3302,16391 0/0/0/0/1/1/1/1:644,730,0,0:1374:96:14884,2778,1049,285,0,96,643,2062,12779 0/0/0/0/0/1/1/1:613,417,0,0:1030:74:7703,1065,243,0,74,433,1174,2710,12915  0/0/0/1/1/1/1/1:239,513,0,0:752:13:11159,2667,1309,604,198,0,13,378,4191    ./././././././.:0,0

So, for the samples with genotypes present, AD is obviously quite high (in the 1000s, generally). It surprised me, then, that some samples don't have information (represented by ./././././././.). I went back to the original sample gvcf files and pulled the entries from a subset that were represented by ./././././././.:

group14.g.vcf:chr1 88988835    rs396411987 C   G,A,   16897.68    .   BaseQRankSum=-0.709;ClippingRankSum=0.843;DB;DP=1729;MLEAC=5,0,0;MLEAF=0.500,0.00,0.00;MQ=59.96;MQRankSum=0.621;ReadPosRankSum=1.987    GT:AD:DP:GQ:SB  0/0/0/0/0/1/1/1/1/1:876,844,2,0:1722:99:443,433,424,422
group15.g.vcf:chr1 88988835    rs396411987 C   G,T,   690.35  .   BaseQRankSum=-1.311;ClippingRankSum=-0.019;DB;DP=1168;MLEAC=1,0,0;MLEAF=0.125,0.00,0.00;MQ=59.98;MQRankSum=0.603;ReadPosRankSum=3.624   GT:AD:DP:GQ:SB  0/0/0/0/0/0/0/1:1082,78,4,0:1164:99:542,540,44,38

A consistent difference between samples that are included and those that are not is the presence of multiple alternative alleles in the excluded samples (G,A and G,T in the above example). Is this the source of my troubles? Is there a way of forcing GATK to include those samples in the VCF, especially given that the support for the third allele seems pretty weak (2 reads out of ~1722 in sample 14 above)? It seems to me that these samples should reasonably be included in the output of GenotypeGVCF.

Apologies if the answer is somewhere in the forums/guide - I did check, but didn't find anything.

Thanks,
Russ


Viewing all articles
Browse latest Browse all 1335

Trending Articles