Best Practices for Batch Calling Malaria

Since there is no gold standard SNP database for Plasmodium vivax, my lab and I have been tinkering with a pipeline for variant calling with GATK. I'm pretty confident with the basic framework we have, but the complication is that we want to identify variants across different populations that were sequenced at different centers.

So far I have as follows:

java -jar -Xmx13g $1 AddOrReplaceReadGroups INPUT=$4 OUTPUT=readgroups.bam RGID=1 RGLB=$base RGPL=illumina RGPU=unit1 RGSM=$base  VALIDATION_STRINGENCY=LENIENT 
java -jar -Xmx13g $1 MarkDuplicates INPUT=readgroups.bam OUTPUT=dedup_reads.bam METRICS_FILE=metrics.txt  VALIDATION_STRINGENCY=LENIENT 
rm -rf readgroups.bam
java -jar -Xmx13g $1 BuildBamIndex INPUT=dedup_reads.bam  VALIDATION_STRINGENCY=LENIENT 
java -jar -Xmx13g  $2 -T RealignerTargetCreator -R $ref -I dedup_reads.bam -o realignment_targets.list
java -jar -Xmx13g $2 -T IndelRealigner -R $ref -I dedup_reads.bam -targetIntervals realignment_targets.list -o realigned_reads.bam
rm -rf realignment_targets.list dedup_reads.bam
java -jar -Xmx13g  $2 -T HaplotypeCaller -R $ref -I realigned_reads.bam -o raw_variants.vcf
java -jar -Xmx13g $2 -T SelectVariants -R $ref -V raw_variants.vcf -selectType SNP -o raw_snps.vcf
java -jar -Xmx13g $2 -T SelectVariants -R $ref -V raw_variants.vcf -selectType INDEL -o raw_indels.vcf
rm -rf raw_variants.vcf
java -jar -Xmx13g $2 -T VariantFiltration -R $ref -V raw_snps.vcf --filterExpression 'QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0 || SOR > 4.0' --filterName "basic_snp_filter" -o filtered_snps.vcf
java -jar -Xmx13g $2 -T VariantFiltration -R $ref -V raw_indels.vcf --filterExpression 'QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0 || SOR > 10.0' --filterName "basic_indel_filter" -o filtered_indels.vcf
java -jar -Xmx13g $2 -T BaseRecalibrator -R $ref -I realigned_reads.bam -knownSites filtered_snps.vcf -knownSites filtered_indels.vcf -o recal_data.table
java -jar -Xmx13g $2 -T BaseRecalibrator -R $ref -I realigned_reads.bam -knownSites filtered_snps.vcf -knownSites filtered_indels.vcf -BQSR recal_data.table -o post_recal_data.table
java -jar -Xmx13g $2 -T AnalyzeCovariates -R $ref -before recal_data.table -after post_recal_data.table -plots ${base}recalibration_plots.pdf
java -jar -Xmx13g $2 -T PrintReads -R $ref -I realigned_reads.bam -BQSR recal_data.table -o recal_reads.bam
java -jar -Xmx13g $2 -T HaplotypeCaller -R $ref -I recal_reads.bam -ERC GVCF -o /data/wraycompute/vdp5/variants_out_gvcf/${base}.g.vcf

This is done for each sample, and then I do batch calling using GenotypeGVCFs to produce a final file. However, I'm concerned that this approach isn't appropriate for two reasons:

Some subsets of the samples have different characteristics, so we could really get a wide diversity of genotypes for each that may be sequencing bias.
I am not performing the AnalyzeCovariates step after I produce my final set of SNPs. Should I be doing so? Do you have any suggestions for how I should modify my pipeline?

Best Practices for Batch Calling Malaria

Trending Articles

Scuffham Amps - S-GEAR 2.6.0 VST, AAX, STANDALONE x86 x64 (R2R NO iLok2, +NO...

Practice Sheet of Right form of verbs for HSC Students

VHSE First (1st) Allotment 2025 - vhscap.kerala.gov.in

UNIVERSE LEAGUE – UNIVERSE LEAGUE – WAR (We Are Ready) – EP [iTunes Plus M4A]

City Hunter Teledrama – Episode 18 – 07th May 2016

Comment on Proposed Criteria for Identifying Predatory Conferences by Luke...

Bureau of Internal Revenue: Regional Offices (Directory)

Kendrick Lamar – Not Like Us (2024) [24Bit-88.2kHz] [PMEDIA] ⭐️

Inception 2010 Hindi Dual Audio 650MB BRRip 720p ESubs HEVC

East Hull MD admits sexual assaults after another victim comes forward

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

R. v. Sargeant, 2023 ONSC 6406 (CanLII)

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Who’s been sentenced at Northampton Magistrates’ Court

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Family cries out as traditional ruler allegedly abducts brother, extorts N2.5m

Long-Running Conflict In Springfield (MA) Gangland Sphere Has Manzi Family &...

Wondershare Filmora X v10.1.20.16 x64

Man arrested after fracas in flat

Man charged in ongoing Sexual Assault Investigation Derek Nyilas, 46, Faces...