Question: Is it possible to have CV merge like bcftools does it?
I get this warning, when running UG in GGA mode using an -alleles
vcf generated with CV:
WARN 10:17:21,394 GenotypingGivenAllelesUtils - Multiple valid VCF records detected in the alleles input file at site 20:106089, only considering the first record
I made this call with HC from 10 samples:
20 106089 . CA C
And this call with UG from 10 other samples:
20 106089 . C A
CV merges like this:
20 106089 . C A
20 106089 . CA C
bcftools merges like this:
20 106089 . CA AA,C
The UG recall from the CV generated -alleles
vcf is incomplete:
20 106089 . C A
The UG recall from the bcftools generated -alleles
vcf is complete:
20 106089 . CA AA,C
Is it possible to have CV merge like bcftools does it?
In another thread @Geraldine_VdAuwera said:
I'm really not sure. It's not a use case that UG was designed for (with UG we kept SNPs and indels separate until post-analysis), so I would recommend being cautious with it.
I checked the genotypes and UG seems to handle merged MNPs and indels just fine; see below. But I will do some additional testing. Or I might just take the safe path and do the recalling separately for SNPs and indels as suggested. The reason I have UG and HC calls in the first place is because I have low and high coverage data for different cohorts. I want to create a merged dataset.
Despite --interval_padding 100
helping to recall more sites with HC in GGA mode as per previous recommendation, some sites still fail to be called with HC in GGA mode. Hence I opted for UG.
UG calls on samples 1-10:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 535 545 546 550 554 564 567 574 575 578
20 106089 . C A 16.19 . AC=2;AF=0.125;AN=16;BaseQRankSum=-0.854;DP=37;Dels=0.00;FS=0.000;HaplotypeScore=1.5282;MLEAC=2;MLEAF=0.125;MQ=58.74;MQ0=0;MQRankSum=-0.560;QD=2.70;ReadPosRankSum=-1.797;SOR=0.935;VariantType=SNP GT:AD:DP:GQ:PL 0/0:3,0:3:6:0,6,76 0/0:4,2:6:9:0,9,115 0/1:3,1:4:24:24,0,80 0/0:6,0:6:12:0,12,130 0/1:1,1:2:29:30,0,29 ./. 0/0:7,0:7:15:0,15,188 0/0:3,1:4:6:0,6,74 ./. 0/0:5,0:5:12:0,12,142
HC calls on samples 11-20:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 585 590 622 625 628 640 655 668 687 693
20 106089 . CA C 47.95 . AC=5;AF=0.250;AN=20;BaseQRankSum=0.925;DP=36;FS=1.850;InbreedingCoeff=0.0646;MLEAC=5;MLEAF=0.250;MQ=59.48;MQ0=0;MQRankSum=0.175;QD=3.00;ReadPosRankSum=-1.725;SOR=0.387 GT:AD:GQ:PL 0/0:2,0:6:0,6,49 0/0:2,0:6:0,6,49 0/0:3,0:12:0,12,130 0/0:5,0:15:0,15,122 0/0:2,0:6:0,6,46 0/1:2,1:14:14,0,39 0/1:2,1:15:15,0,38 0/0:4,0:12:0,12,93 0/1:3,1:12:12,0,46 1/1:0,3:9:67,9,0
UG GGA recalls on samples 1-20:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 535 545 546 550 554 564 567 574 575 578 585 590 622 625 628 640 655 668 687 693
20 106089 . CA AA,C 110.56 . AC=0,8;AF=0.00,0.222;AN=36;DP=81;FS=0.000;InbreedingCoeff=0.5076;MLEAC=0,6;MLEAF=0.00,0.167;MQ=58.56;MQ0=0;QD=3.45;SOR=0.859;VariantType=MULTIALLELIC_MIXED GT:AD:DP:GQ:PL:SB 0/0:0,0,0:3:0:0,0,0,6,6,52:0,0,0,0 0/2:0,0,1:6:0:5,5,5,0,0,109:0,0,1,0 0/2:0,0,1:4:0:12,12,12,0,0,47:0,0,1,0 0/0:0,0,0:6:0:0,0,0,17,17,123:0,0,0,0 0/0:0,0,0:2:0:0,0,0,3,3,10:0,0,0,0 ./. 0/0:0,0,0:7:0:0,0,0,9,9,60:0,0,0,0 0/2:0,0,1:4:0:12,12,12,0,0,61:0,0,0,1 ./. 0/0:0,0,1:5:0:0,0,0,4,4,30:0,0,0,1 0/0:0,0,0:3:0:0,0,0,6,6,49:0,0,0,0 0/0:0,0,0:3:0:0,0,0,9,9,76:0,0,0,0 0/0:0,0,1:4:0:0,0,0,1,1,22:0,0,1,0 0/0:0,0,0:7:0:0,0,0,18,18,149:0,0,0,0 0/0:0,0,0:4:0:0,0,0,11,11,76:0,0,0,0 0/2:0,0,1:5:0:9,9,9,0,0,65:0,0,0,1 0/2:0,0,1:4:0:12,12,12,0,0,60:0,0,0,1 0/0:0,0,0:5:0:0,0,0,15,15,116:0,0,0,0 0/2:0,0,1:6:0:12,12,12,0,0,47:0,0,0,1 2/2:0,0,3:3:9:67,67,67,9,9,0:0,0,3,0
This thread is related to the following threads on GGA:
http://gatkforums.broadinstitute.org/discussion/5249/overcalling-deletion-in-unifiedgenotyper-genotype-given-alleles-mode
http://gatkforums.broadinstitute.org/discussion/5018/ug-call-combined-snp-indel-sites-in-gga-mode
http://gatkforums.broadinstitute.org/discussion/4936/not-all-sites-emitted-with-genotype-given-alleles
http://gatkforums.broadinstitute.org/discussion/4024/genotype-and-validate-or-haplotype-caller-gga-what-am-i-doing-wrong
P.S. I might gate crash your Cambridge party this week despite not being invited The course was already fully booked, when you announced it. I don't have a time machine!