After applying the standard RNA-Seq pipeline (with STAR, etc) I called varients with the command:
java -jar GenomeAnalysisTK.jar
-T HaplotypeCaller
-R chromosome.fa
-I ./final.bam
-dontUseSoftClippedBases
--variant_index_type LINEAR
--variant_index_parameter 128000
--emitRefConfidence GVCF -o ./final.gvcf
On the resultant gVCF file, I ran a little python script to see the distribution of calling quality across the different called genotypes:
- x-axis is quality score rounded to the nearest integer
- y-axis is the number of variants at that quality score
``
As you can see, its mostly heterozygous variants, which is what I expect since this data comes from highly inbred mice.
What i didn't expect however is the periodicity. Is that normal?
Now I presumably I need to filter these variants on some number of quality score, and from this I really dont know where. 0? 50? 75?
Code to generate this data:
#!/usr/bin/env python2.7
import collections
with open('/home/john/overnight/outputs/ctrl_all_FVB.gvcf', 'rb') as f:
data = {}
for line in f:
if line[0] == '#': continue
line = line.split('\t')
if line[5] == '.': continue
gt = line[9][:3]
try: data[gt][int(float(line[5]))] += 1
except KeyError: data[gt] = collections.defaultdict(int)
for gt,qualities in data.items():
print '\n',gt
for qual,count in sorted(qualities.items()):
print qual,count