Quantcast
Channel: haplotypecaller — GATK-Forum
Viewing all articles
Browse latest Browse all 1335

How to run HaplotypeCaller in pooled DNA data locally using scatter-gatter implemented in WDL?

$
0
0

Hola WDL team,

I have whole-genome data of pooled DNA per population (one BAM file for each population) of several populations, and have access to a local Desktop with 125 GB RAM, 32 cores. My goal is to do SNP calling in a cohort of population files using the GVCF mode and WDL in "server" mode. For this I plan to:

a) Do scatter-gather of the HaplotypeCaller (-ERC mode) by scaffold per population file, separately
b) Once I have all g.vcfs, run the HaplotypeCaller in genotype mode (GenotypeGVCFs) to do SNP calling for all populations in once

Questions:

1. Since I am using pooled DNA data (40 to 50 individuals per population, diploid organism), I will set -ploidy to 100 and -max_alternate_alleles to 6. Should I set these parameters when obtaining the g.vcfs (-ERC) or during SNP calling (GenotypeGVCFs)?

2. I made a test run in one file of the scatter part of step a) to have a sense of how long it could take when run locally; commands used below, adapted from the public WG pipeline. I limited run in Cromwell to 10 concurrent jobs, and used HaplotypeCaller -ERC by scaffold with the -L flag, default threading (1 CPU?). So far, the program is using 99-110GB of RAM but 2 days have passed and it has not finished yet. How can I make it run faster?

workflow ScatterHaplotypeCaller {
    File scaffoldsFile
    Array[String] callingScaffolds = read_lines(scaffoldsFile)
    scatter(subScaffold in callingScaffolds) {
        call haplotypeCaller { input: scaffold=subScaffold }
    }
}

task haplotypeCaller {
    File GATK
    File RefFasta
    String sampleName
    File inputBAM
    File RefIndex
    File RefDict
    File bamIndex
    String scaffold
    command {
        java -jar ${GATK} \
            -T HaplotypeCaller \
            -R ${RefFasta} \
            -I ${inputBAM} \
            -L ${scaffold} \
            -ERC GVCF \
            -mbq 20 \
            -minPruning 5 \
            -o ${sampleName}.{scaffold}.rawLikelihoods.g.vcf
    }
    output {
        File GVCF = "${sampleName}.{scaffold}.rawLikelihoods.g.vcf"
    }
}
  • 2.1 I have 14 000 scaffolds in the reference genome, maybe this multiple-scattering is making the run slow? Should I establish intervals of several scaffolds instead of scattering per scaffold? If so, how long they should be? It is a 1GB genome, so maybe I should aim to split to 50 intervals?

  • 2.2 Could I use multi-threading (-nct 4) safely in the HaplotypeCaller -ERC when scattering with WDL?
    I know that several users have reported problems with multi-threading, but in the GATK documentation -nct 4 is recommended and the Intel white document shows 4T reduces significantly running time.

  • 2.4 In the WG public pipeline, within the command section of the task definition of the HaplotypeCaller (see bellow), the lowest memory for java is set with the command java -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx8000m and in runtime section with memory: "10 GB". I wonder if it is necessary to set a memory limit when running GATK locally (in my Desktop computer, I know each file run uses ~10GB, this is why I set Cromwell to run 10 concurrent jobs max.). If so, can I use the very same commands?

# Call variants on a single sample with HaplotypeCaller to produce a GVCF
task HaplotypeCaller {
  File input_bam
  File input_bam_index
  File interval_list
  String gvcf_basename
  File ref_dict
  File ref_fasta
  File ref_fasta_index
  Float? contamination
  Int disk_size
  Int preemptible_tries

  # tried to find lowest memory variable where it would still work, might change once tested on JES
  command {
    java -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Xmx8000m \
      -jar /usr/gitc/GATK35.jar \
      -T HaplotypeCaller \
      -R ${ref_fasta} \
      -o ${gvcf_basename}.vcf.gz \
      -I ${input_bam} \
      -L ${interval_list} \
      -ERC GVCF \
      --max_alternate_alleles 3 \
      -variant_index_parameter 128000 \
      -variant_index_type LINEAR \
      -contamination ${default=0 contamination} \
      --read_filter OverclippedRead
  }
  runtime {
    docker: "broadinstitute/genomes-in-the-cloud:2.2.3-1469027018"
    memory: "10 GB"
    cpu: "1"
    disks: "local-disk " + disk_size + " HDD"
    preemptible: preemptible_tries
  }
  output {
    File output_gvcf = "${gvcf_basename}.vcf.gz"
    File output_gvcf_index = "${gvcf_basename}.vcf.gz.tbi"
  }
}

3. When scattering by scaffold for one file using WDL, should the output g.vcf file of the task HaplotypeCaller definition have the scaffold name or not? I mean, -o ${sampleName}.{scaffold}.rawLikelihoods.g.vcfor -o ${sampleName}.rawLikelihoods.g.vcf? I wonder this because maybe if I don't put the scaffold name then the g.vcfs will be overwritten with the latest scaffold run, is it?

Thanks very much in advance for any help!!


Viewing all articles
Browse latest Browse all 1335

Trending Articles