Hi all,
I'm very new to GATK. I'm trying to map an EMS mutation in Arabidopsis. I have fastq files of a wt M3 bulk and a mut M3 bullk (both offspring of the same parent). The strategy is to call for SNPs->GenotypeGVCFs to a single file. That was done succesfully (I think). Next step is to look for SNPs that are homozygous (1/1) for the mut reads and het (1/0 or 0/0) or ref in the wt bulk; I used this command for this:
grep -v '^##' $line.genotype10.vcf | awk 'BEGIN{FS=" "; OFS=" "} $10~/^1\/1/ && ($11~/^1\/0/ || $11~/^0\/0/) {$3=$7=""; print $0}' | sed 's/ */ /g' >file.taxt
Tha also worked pretty well.
I noticed that I have ~150,000 records (SNPs or indels) using the HC but after merging the files using the GenotypeGVCFs I'm left w/ only a few thousands records. The same happens if I use CombineGVCFs (which keep ~150,000 records) and then go for GenotypeGVCFs.
The problem is that with such low # of reads it doesn't recognize a genomic region that fulfil that hom requirement for the mut bulk and het/ref for the wt one.
My question are:
- Why does GenotypeGVCFs reduces the read #.
- If anyone has other suggestions that would be great.
Thanks a lot,
Guy