Let's say I have a bunch of mixed ploidy individuals (with biallelic markers) in my data. Some are tetraploid and some are diploid. But I choose to run GATK HaplotypeCaller (to get genotype likelihoods) with -ploidy set to 4 for all organisms since I know the highest ploidy level in the data to be 4.
My idea is to run the data and obtain genotype likelihoods with the highest resolution and then downscale those values obtained to a lower ploidy level post-hoc.
For instance, given that there are 5 genotype classes/dosage levels for tetraploid organisms (0 of the reference allele, 1 of the reference, 2 of the reference, 3 of the reference and 4 of the reference), I will get 5 phred-scaled scores for each locus in each individual. Each score represents the probability of having a certain count for the reference allele (0 through 4).
Now if I deduce that one of these individuals is a diploid but I've already run the analyses:
- Can I just combine the genotype likelihoods of the 3 heterozygote classes in the tetraploid call (1/3, 2/2, 3/1) to get the genotype likelihood of the one heterozygote class (1/1) in a diploid individual?
- If so, how do I do this quantitatively?
For example, at a locus in an individual that I assumed to be tetraploid during the GATK run, I get these phred-scaled genotype likelihoods:
0/4 1/3 2/2 3/1 4/0
6 67 0 4 60
But I now know that this individual is diploid, so I am now looking for just 3 phred-scaled genotype likelihoods instead of 5:
0/2 1/1 2/0
? ? ?
Would I keep the homozygote classes the same i.e. 6 and 60 and then just average the 3 dosage classes for the heterozygote of the diploid? Or would I perform another similar mathematical operation?
Thanks,
Vivaswat