How to run HaplotypeCaller in pooled DNA data locally using scatter-gatter implemented in WDL?

Hola WDL team,

I have whole-genome data of pooled DNA per population (one BAM file for each population) of several populations, and have access to a local Desktop with 125 GB RAM, 32 cores. My goal is to do SNP calling in a cohort of population files using the GVCF mode and WDL in "server" mode. For this I plan to:

a) Do scatter-gather of the HaplotypeCaller (-ERC mode) by scaffold per population file, separately
b) Once I have all g.vcfs, run the HaplotypeCaller in genotype mode (GenotypeGVCFs) to do SNP calling for all populations in once

Questions:

1. Since I am using pooled DNA data (40 to 50 individuals per population, diploid organism), I will set -ploidy to 100 and -max_alternate_alleles to 6. Should I set these parameters when obtaining the g.vcfs (-ERC) or during SNP calling (GenotypeGVCFs)?

2. I made a test run in one file of the scatter part of step a) to have a sense of how long it could take when run locally; commands used below, adapted from the public WG pipeline. I limited run in Cromwell to 10 concurrent jobs, and used HaplotypeCaller -ERC by scaffold with the -L flag, default threading (1 CPU?). So far, the program is using 99-110GB of RAM but 2 days have passed and it has not finished yet. How can I make it run faster?

workflow ScatterHaplotypeCaller {
    File scaffoldsFile
    Array[String] callingScaffolds = read_lines(scaffoldsFile)
    scatter(subScaffold in callingScaffolds) {
        call haplotypeCaller { input: scaffold=subScaffold }
    }
}

task haplotypeCaller {
    File GATK
    File RefFasta
    String sampleName
    File inputBAM
    File RefIndex
    File RefDict
    File bamIndex
    String scaffold
    command {
        java -jar ${GATK} \
            -T HaplotypeCaller \
            -R ${RefFasta} \
            -I ${inputBAM} \
            -L ${scaffold} \
            -ERC GVCF \
            -mbq 20 \
            -minPruning 5 \
            -o ${sampleName}.{scaffold}.rawLikelihoods.g.vcf
    }
    output {
        File GVCF = "${sampleName}.{scaffold}.rawLikelihoods.g.vcf"
    }
}

2.1 I have 14 000 scaffolds in the reference genome, maybe this multiple-scattering is making the run slow? Should I establish intervals of several scaffolds instead of scattering per scaffold? If so, how long they should be? It is a 1GB genome, so maybe I should aim to split to 50 intervals?
2.2 Could I use multi-threading (-nct 4) safely in the HaplotypeCaller -ERC when scattering with WDL?
I know that several users have reported problems with multi-threading, but in the GATK documentation -nct 4 is recommended and the Intel white document shows 4T reduces significantly running time.
2.4 In the WG public pipeline, within the command section of the task definition of the HaplotypeCaller (see bellow), the lowest memory for java is set with the command java -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx8000m and in runtime section with memory: "10 GB". I wonder if it is necessary to set a memory limit when running GATK locally (in my Desktop computer, I know each file run uses ~10GB, this is why I set Cromwell to run 10 concurrent jobs max.). If so, can I use the very same commands?

# Call variants on a single sample with HaplotypeCaller to produce a GVCF
task HaplotypeCaller {
  File input_bam
  File input_bam_index
  File interval_list
  String gvcf_basename
  File ref_dict
  File ref_fasta
  File ref_fasta_index
  Float? contamination
  Int disk_size
  Int preemptible_tries

  # tried to find lowest memory variable where it would still work, might change once tested on JES
  command {
    java -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx8000m \
      -jar /usr/gitc/GATK35.jar \
      -T HaplotypeCaller \
      -R ${ref_fasta} \
      -o ${gvcf_basename}.vcf.gz \
      -I ${input_bam} \
      -L ${interval_list} \
      -ERC GVCF \
      --max_alternate_alleles 3 \
      -variant_index_parameter 128000 \
      -variant_index_type LINEAR \
      -contamination ${default=0 contamination} \
      --read_filter OverclippedRead
  }
  runtime {
    docker: "broadinstitute/genomes-in-the-cloud:2.2.3-1469027018"
    memory: "10 GB"
    cpu: "1"
    disks: "local-disk " + disk_size + " HDD"
    preemptible: preemptible_tries
  }
  output {
    File output_gvcf = "${gvcf_basename}.vcf.gz"
    File output_gvcf_index = "${gvcf_basename}.vcf.gz.tbi"
  }
}

3. When scattering by scaffold for one file using WDL, should the output g.vcf file of the task HaplotypeCaller definition have the scaffold name or not? I mean, -o ${sampleName}.{scaffold}.rawLikelihoods.g.vcfor -o ${sampleName}.rawLikelihoods.g.vcf? I wonder this because maybe if I don't put the scaffold name then the g.vcfs will be overwritten with the latest scaffold run, is it?

Thanks very much in advance for any help!!

How to run HaplotypeCaller in pooled DNA data locally using scatter-gatter implemented in WDL?

Trending Articles

Scuffham Amps - S-GEAR 2.6.0 VST, AAX, STANDALONE x86 x64 (R2R NO iLok2, +NO...

Practice Sheet of Right form of verbs for HSC Students

VHSE First (1st) Allotment 2025 - vhscap.kerala.gov.in

UNIVERSE LEAGUE – UNIVERSE LEAGUE – WAR (We Are Ready) – EP [iTunes Plus M4A]

City Hunter Teledrama – Episode 18 – 07th May 2016

Comment on Proposed Criteria for Identifying Predatory Conferences by Luke...

Bureau of Internal Revenue: Regional Offices (Directory)

Kendrick Lamar – Not Like Us (2024) [24Bit-88.2kHz] [PMEDIA] ⭐️

Inception 2010 Hindi Dual Audio 650MB BRRip 720p ESubs HEVC

East Hull MD admits sexual assaults after another victim comes forward

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

R. v. Sargeant, 2023 ONSC 6406 (CanLII)

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Who’s been sentenced at Northampton Magistrates’ Court

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Family cries out as traditional ruler allegedly abducts brother, extorts N2.5m

Long-Running Conflict In Springfield (MA) Gangland Sphere Has Manzi Family &...

Wondershare Filmora X v10.1.20.16 x64

Man arrested after fracas in flat

Man charged in ongoing Sexual Assault Investigation Derek Nyilas, 46, Faces...