In a discussion about using ERC, you provide some example VCF output lines like:
20 10000442 . T <NON_REF> . . . GT:AD:CD:DP:GQ:PL 0/0:56,0:56:56:99:0,168,2095 20 10000443 . A <NON_REF> . . . GT:AD:CD:DP:GQ:PL 0/0:56,0:56:56:99:0,169,2089 20 10000444 . A <NON_REF> . . . GT:AD:CD:DP:GQ:PL 0/0:56,0:56:56:99:0,168,2093
and say:
- "For each reference base not part of a variant call we emit a VCF record with the special symbolic allele <NON_REF> indicating the call is between the reference base and any possible non-reference allele that might be segregating at this site," and
- "Note that there's no site-level QUAL field value. We discussed this internally and since the QUAL is the probability that the site is polymorphic, all of the QUAL field values should be 0 here, so we decided to drop it."
While the formal Variant Call Format 4.2 Specification says:
- "ALT - alternate base(s): Comma separated list of alternate non-reference alleles called on at least one of the samples. Options are base Strings made up of the bases A,C,G,T,N,*, (case insensitive) or an angle-bracketed ID String (“<ID>”) or a breakend replacement string as described in the section on breakends," and
- "QUAL - quality: Phred-scaled quality score for the assertion made in ALT. i.e. −10log10 prob(call in ALT is wrong). If ALT is ‘.’ (no variant) then this is −10log10 prob(variant), and if ALT is not ‘.’ this is −10log10 prob(no variant). If unknown, the missing value should be specified."
The issue is subtle, but introduces problems with downstream processing of HaplotypeCaller generated VCF files containing reference calls. The use of the symbol "<NON_REF>" instead of "." for reference calls is a little confusing, but I also see the logic of that. More seriously: According to the VCFv4.2 specs, QUAL is NOT always a measure of "the probability that the site is polymorphic". Perhaps when a variant is called, but not when a site is called as non-variant. All those QUAL field values should NOT be 0 there. It is about the quality and correctness of the call itself. It is defined as the PHRED score of the probability that the assertion made in ALT is wrong. So if ALT asserts "<NON_REF>" or "." CONFIDENTLY (meaning a low probability that the assertion is wrong), then the QUAL PHRED score should be HIGH, not ZERO. A confident call should have a high QUAL score, whether variant or monomorphic. That is the intention of the specification. If otherwise: filtering out low quality records from a non-compliant VCF output file by removing those with low QUAL scores, for instance, would also filter out all the high quality reference calls.
I have previously been in touch with the writers/keepers of the VCF Specs (now over at samtools) and they confirm this interpretation of QUAL scores as the correct one for VCFv4.2 (and other versions) compliant output. It's unfortunately couched in mathematical double negatives, but look at it carefully and you will see the correctness of this reading. As one who uses tools from many sources, I hope you will address this discrepancy. It complicates interoperability.
-- Fred P.