Quantcast
Channel: haplotypecaller — GATK-Forum
Viewing all articles
Browse latest Browse all 1335

Best Practices for Batch Calling Malaria

$
0
0

Since there is no gold standard SNP database for Plasmodium vivax, my lab and I have been tinkering with a pipeline for variant calling with GATK. I'm pretty confident with the basic framework we have, but the complication is that we want to identify variants across different populations that were sequenced at different centers.

So far I have as follows:

java -jar -Xmx13g $1 AddOrReplaceReadGroups INPUT=$4 OUTPUT=readgroups.bam RGID=1 RGLB=$base RGPL=illumina RGPU=unit1 RGSM=$base  VALIDATION_STRINGENCY=LENIENT 
java -jar -Xmx13g $1 MarkDuplicates INPUT=readgroups.bam OUTPUT=dedup_reads.bam METRICS_FILE=metrics.txt  VALIDATION_STRINGENCY=LENIENT 
rm -rf readgroups.bam
java -jar -Xmx13g $1 BuildBamIndex INPUT=dedup_reads.bam  VALIDATION_STRINGENCY=LENIENT 
java -jar -Xmx13g  $2 -T RealignerTargetCreator -R $ref -I dedup_reads.bam -o realignment_targets.list
java -jar -Xmx13g $2 -T IndelRealigner -R $ref -I dedup_reads.bam -targetIntervals realignment_targets.list -o realigned_reads.bam
rm -rf realignment_targets.list dedup_reads.bam
java -jar -Xmx13g  $2 -T HaplotypeCaller -R $ref -I realigned_reads.bam -o raw_variants.vcf
java -jar -Xmx13g $2 -T SelectVariants -R $ref -V raw_variants.vcf -selectType SNP -o raw_snps.vcf
java -jar -Xmx13g $2 -T SelectVariants -R $ref -V raw_variants.vcf -selectType INDEL -o raw_indels.vcf
rm -rf raw_variants.vcf
java -jar -Xmx13g $2 -T VariantFiltration -R $ref -V raw_snps.vcf --filterExpression 'QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0 || SOR > 4.0' --filterName "basic_snp_filter" -o filtered_snps.vcf
java -jar -Xmx13g $2 -T VariantFiltration -R $ref -V raw_indels.vcf --filterExpression 'QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0 || SOR > 10.0' --filterName "basic_indel_filter" -o filtered_indels.vcf
java -jar -Xmx13g $2 -T BaseRecalibrator -R $ref -I realigned_reads.bam -knownSites filtered_snps.vcf -knownSites filtered_indels.vcf -o recal_data.table
java -jar -Xmx13g $2 -T BaseRecalibrator -R $ref -I realigned_reads.bam -knownSites filtered_snps.vcf -knownSites filtered_indels.vcf -BQSR recal_data.table -o post_recal_data.table
java -jar -Xmx13g $2 -T AnalyzeCovariates -R $ref -before recal_data.table -after post_recal_data.table -plots ${base}recalibration_plots.pdf
java -jar -Xmx13g $2 -T PrintReads -R $ref -I realigned_reads.bam -BQSR recal_data.table -o recal_reads.bam
java -jar -Xmx13g $2 -T HaplotypeCaller -R $ref -I recal_reads.bam -ERC GVCF -o /data/wraycompute/vdp5/variants_out_gvcf/${base}.g.vcf

This is done for each sample, and then I do batch calling using GenotypeGVCFs to produce a final file. However, I'm concerned that this approach isn't appropriate for two reasons:

  1. Some subsets of the samples have different characteristics, so we could really get a wide diversity of genotypes for each that may be sequencing bias.

  2. I am not performing the AnalyzeCovariates step after I produce my final set of SNPs. Should I be doing so? Do you have any suggestions for how I should modify my pipeline?


Viewing all articles
Browse latest Browse all 1335

Trending Articles