Hola WDL team,
I have whole-genome data of pooled DNA per population (one BAM file for each population) of several populations, and have access to a local Desktop with 125 GB RAM, 32 cores. My goal is to do SNP calling in a cohort of population files using the GVCF mode and WDL in "server" mode. For this I plan to:
a) Do scatter-gather of the HaplotypeCaller (-ERC mode) by scaffold per population file, separately
b) Once I have all g.vcfs, run the HaplotypeCaller in genotype mode (GenotypeGVCFs) to do SNP calling for all populations in once
Questions:
1. Since I am using pooled DNA data (40 to 50 individuals per population, diploid organism), I will set -ploidy to 100 and -max_alternate_alleles to 6. Should I set these parameters when obtaining the g.vcfs (-ERC) or during SNP calling (GenotypeGVCFs)?
2. I made a test run in one file of the scatter part of step a) to have a sense of how long it could take when run locally; commands used below, adapted from the public WG pipeline. I limited run in Cromwell to 10 concurrent jobs, and used HaplotypeCaller -ERC by scaffold with the -L flag, default threading (1 CPU?). So far, the program is using 99-110GB of RAM but 2 days have passed and it has not finished yet. How can I make it run faster?
workflow ScatterHaplotypeCaller {
File scaffoldsFile
Array[String] callingScaffolds = read_lines(scaffoldsFile)
scatter(subScaffold in callingScaffolds) {
call haplotypeCaller { input: scaffold=subScaffold }
}
}
task haplotypeCaller {
File GATK
File RefFasta
String sampleName
File inputBAM
File RefIndex
File RefDict
File bamIndex
String scaffold
command {
java -jar ${GATK} \
-T HaplotypeCaller \
-R ${RefFasta} \
-I ${inputBAM} \
-L ${scaffold} \
-ERC GVCF \
-mbq 20 \
-minPruning 5 \
-o ${sampleName}.{scaffold}.rawLikelihoods.g.vcf
}
output {
File GVCF = "${sampleName}.{scaffold}.rawLikelihoods.g.vcf"
}
}
2.1 I have 14 000 scaffolds in the reference genome, maybe this multiple-scattering is making the run slow? Should I establish intervals of several scaffolds instead of scattering per scaffold? If so, how long they should be? It is a 1GB genome, so maybe I should aim to split to 50 intervals?
2.2 Could I use multi-threading (-nct 4) safely in the HaplotypeCaller -ERC when scattering with WDL?
I know that several users have reported problems with multi-threading, but in the GATK documentation -nct 4 is recommended and the Intel white document shows 4T reduces significantly running time.2.4 In the WG public pipeline, within the command section of the task definition of the HaplotypeCaller (see bellow), the lowest memory for java is set with the command
java -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx8000m
and in runtime section withmemory: "10 GB"
. I wonder if it is necessary to set a memory limit when running GATK locally (in my Desktop computer, I know each file run uses ~10GB, this is why I set Cromwell to run 10 concurrent jobs max.). If so, can I use the very same commands?
# Call variants on a single sample with HaplotypeCaller to produce a GVCF
task HaplotypeCaller {
File input_bam
File input_bam_index
File interval_list
String gvcf_basename
File ref_dict
File ref_fasta
File ref_fasta_index
Float? contamination
Int disk_size
Int preemptible_tries
# tried to find lowest memory variable where it would still work, might change once tested on JES
command {
java -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx8000m \
-jar /usr/gitc/GATK35.jar \
-T HaplotypeCaller \
-R ${ref_fasta} \
-o ${gvcf_basename}.vcf.gz \
-I ${input_bam} \
-L ${interval_list} \
-ERC GVCF \
--max_alternate_alleles 3 \
-variant_index_parameter 128000 \
-variant_index_type LINEAR \
-contamination ${default=0 contamination} \
--read_filter OverclippedRead
}
runtime {
docker: "broadinstitute/genomes-in-the-cloud:2.2.3-1469027018"
memory: "10 GB"
cpu: "1"
disks: "local-disk " + disk_size + " HDD"
preemptible: preemptible_tries
}
output {
File output_gvcf = "${gvcf_basename}.vcf.gz"
File output_gvcf_index = "${gvcf_basename}.vcf.gz.tbi"
}
}
3. When scattering by scaffold for one file using WDL, should the output g.vcf file of the task HaplotypeCaller definition have the scaffold name or not? I mean, -o ${sampleName}.{scaffold}.rawLikelihoods.g.vcf
or -o ${sampleName}.rawLikelihoods.g.vcf
? I wonder this because maybe if I don't put the scaffold name then the g.vcfs will be overwritten with the latest scaffold run, is it?
Thanks very much in advance for any help!!