Quantcast
Channel: haplotypecaller — GATK-Forum
Viewing all 1335 articles
Browse latest View live

Haplotype phasing somatic mutations from MuTect2 using read-backed phasing and parental data

$
0
0

To whom it may concern,

I have both normal and tumour sample and I also have the parental data (both mother and father) for the patient sample. I hope to first haplotype phase the SNP and INDELs from the haplotype caller using PhaseByTransmission. Thereafter, I wanted to haplotype phase the somatic mutations from MuTect2 using Read-Backed Phasing.

I wanted to ask whether the Read-Backed Phasing method will consider both the SNP and INDEL encompassed within the read and whether it will also consider the information from PhaseByTransmission when haplotype phasing the somatic mutations.

Regards,
Sangjin Lee


single-sample GVCF calling on DNAseq with allele-specific annotations

$
0
0
thanks a lot . I read the argument explanation, but due to my low ability, I still can not understand
-G Standard -G AS_Standard

why they need and when should I add they, thanks a lot

a chemotherapy site not appear in vcf and bam-out bam but apper in sorted.dedup.bam?

$
0
0
thanks a lot. I have a important question want to confirm with you.
a very important chemotherapy site not appear in vcf and bam-out bam but apper in sorted.dedup.bam as the figure shows.

the gayk.bam is the the argument --bam-out in haplotcaller, and the sorted.dedup is the bam in the forward steps that you kown as usual.

you can see here are 532 reads here, 228 reads support indel , 17 support del in the orted.dedup.bam.

I kown the --bam-out bam is a reassemble bam and stores the reliable variant gatk model trusts from the statistic.

you can see the region is a poly region, many TA, is this also a bad impact for the gatk model to make decision.

thanks a lot, I want to know how should I give out that site, because 6TA/7TA, 7TA/7TA, 6TA/6TA stands for different chemotherapy toxicity.

which genotype should I give of 6TA/7TA, 7TA/7TA, 6TA/6TA

thanks a lot

HaplotypeCaller failed to detect variant.

$
0
0

I have experienced a variant detection issue in GATK4.1 and older(v3.6) in the following examples.

I have two bam files: NA24385_partial.bam, NA24143_partial.bam. On the IGV's image above, both samples have a variant at 134784873G>A. HaplotypeCaller could detect the variant on NA24385, though, failed to detect it on NA24143.

Detected:

NA24385_bp_region.vcf
chr9    134784873       .       G       A,<NON_REF>     2527.60 .       BaseQRankSum=-9.003;DP=364;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=0.000;RAW_MQandDP=1310400,364;ReadPosRankSum=-4.180  GT:AD:DP:GQ:PL:SB       0/1:256,103,0:359:99:2535,0,8789,3303,9099,12402:217,39,69,34

Failed:

NA24143_bp_region.vcf
chr9    134784873       .       G       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:320,265:585:0:0,0,3619

At the variant site, NA24143's PL looks really odd. The probability of the genotype for the 0/0 and 0/1 are equal. Why did HaplotypeCaller assign such a high probability for 0/0 here?

Using GATK jar /gatk/gatk-package-4.1.0.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /gatk/gatk-package-4.1.0.0-local.jar --version

Base quality for the variant region looks good enough.

$ samtools mpileup -r chr9:134784873-134784873 NA24143_partial.bam
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
chr9    134784873       N       515     G$G$AAAGAAGGGGAAGGAAAAGAAGAAGAAGGGAGGGAGGGAGAGAGGGGGGGGGGGGGGAAAGAGA$GAAGGAGG$AGGAGAGGAGAGGGGGGAGGGGGGGGGAAGGGGGAAaAGGGAGGAAGaaGAAAAAGAAAGGAGGAGAGgGAAAGGAGgAAGGGGAgGGAGGaAGGAGAGGAGAAaAGAGGGAGGAAAGAGGGAGaAAGGAAAAGGGGGGGGGGGGGGGGGGGGGGGAAAAAAGGAGGAagAGGGGGGgGGGGAGAAGGAAGAGaaGGAAGaAAGAAAGGGGGGAAGGAAAGAAAGAGAAGAGGGggAGGAAAGAAGAGGAAggaAGAAGAGGAAGAAGAaggAGGGGagGAGAAAAGagAGAAAGGGagagagaagGGGAAAaAAGGAgGGAAGAGAGGGAAGGAGAAGAGAGGAAGGGAAAAAAGGGAAgaggGAGagGAGGAGGGAAGagagaggaGGAAAAGGAGGGGGAGGGGGGaaagGGAGGAAAGGA^]G^]G^]G^]G^]G^]A^]A^]G^]A^]A^]A^]G^]a^]a^]a^]g^]g       AA=;<?==????;?ADBB00E@CgC/ECbFFADE?E8FFADF/F@EFFEAF9FFFFAFF=.DF@F;F0DFF@A2DFF.F<FFDFDFFFFFFDFFDFFFDDD_8DFEEE8DBDFFFDFFCCEAAFDDDb@FD.DFFDFb@FdFDFDD8FFDFD@CFiFFDDFF/FFADFA@FDFFDFDDA8ADFFFDFFD0/FDFFhDFADCEFCDCDFFFFFFFFFFFFFFFFFFFFFFFDD<D/DFF/FFDADDFFFFFFDFFiFEFcEFF7@F.F@@FF7DF?DDFc0@FFFFFFDDFFE?7FE?D@.F1DF?FFFDDDFFCDDEC7EDEFaDDD?DF@DEOFFCCEDCFCAD@7@EFFdC2DF./D/FA@CE7CCEEE@C@DADAACEaEC?D@BCDD7DEEBCD?DB@DDCbEDCE76AACA3B6@BAA???==?AA@??C@@C3=@@CBABBABBBAAB?B?B?B@?:@?@@@@A@@AFFF1FEFFEE:::=@@?@@_??@@?>>???/?>6>>><<<??
$ samtools --version
samtools 1.2
Using htslib 1.2.1
Copyright (C) 2015 Genome Research Ltd.

Difference between HC and UG methods

$
0
0
When I compared the result of HaplotypeCaller and Unifiedgenotyper, I found some locus that only be calling variation by UG. However, the origin bam file and bamout file produced by reassembly both contain this variation(see the figure 1). The number of reads supporting reference and alter allele are 456(87%) and 69(13%) respectively in raw bam. Meanwhile they are 439(86%) and 69(14%) in bamout file. I have got confused with this problem.

Core dump when using GATK 3.7 haplotyper

$
0
0
I'm using gatk 3.7, java 1.8, human genome. For specific regions I'm using ploidy more than 2 due to specific aim of the question.

/usr/bin/java -Xmx10G -jar /mnt/mfs/hgrcgrid/shared/softwares/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T HaplotypeCaller -R CR1_region1.ploidy_6.fa -I washei36472_ploidy6CR1region1.bam -L CR1_region1.expanded.bed --sample_ploidy 6 --genotyping_mode DISCOVERY --emitRefConfidence GVCF --dontUseSoftClippedBases -o washei36472.3CR1region1.g.vcf

Sorry, I can't quote/highlight text for code and error using github text guidelines. GATK forum didn't allow me to post link, maybe because my account is new.

I tried with 6G, 3G, 10G, 8G. The input bam and bed files are less than 1MB individually, I don't know what's the problem and how to fix it. If I use -nct flag either the error persists. I'm working on cluster node with 15G of memory, so I can easliy provide below 10G when running GATK.

Error log on screen:

Using AVX accelerated implementation of PairHMM
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00002b17edc60ce9, pid=30033, tid=0x00002b178ef5c700
#
# JRE version: OpenJDK Runtime Environment (8.0_151-b12) (build 1.8.0_151-8u151-b12-1~deb9u1-b12)
# Java VM: OpenJDK 64-Bit Server VM (25.151-b12 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C [libVectorLoglessPairHMM8627199239981386149.so+0x1bce9] LoadTimeInitializer::LoadTimeInitializer()+0x1669
#
# Core dump written. Default location: /mnt/mfs/hgrcgrid/shared/GT_ADMIX/INDEL_comparisons/sequencing_projects/darkgenome/internal_pipeline/align_CR1exons/core or core.30033
#
# An error report file with more information is saved as:
# /mnt/mfs/hgrcgrid/shared/GT_ADMIX/INDEL_comparisons/sequencing_projects/darkgenome/internal_pipeline/align_CR1exons/hs_err_pid30033.log
#
# If you would like to submit a bug report, please visit:
#
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.

GATK 4.1 HaplotypeCaller detected the same variant in one technical replicate, but failed in another

$
0
0

Hi,

I tried to use GATK 4.1 HaplotypeCaller to detect EGFR T790M variant in a few technical replicates. It turned out that HaplotypeCaller detected the same variant in one technical replicate, but failed in another replicate. The sample is the commercial EGFR T790M positive control.

The following are the IGV screenshots, the first one HC called T790M while the bottom one HC failed even though we can see T790M in both pre-processed and HC bamout bam files.

I also have a few diluted sample runs with no T790M detection and the bamout bam files have no read at all at that region while the reads are there in the pre-processed bam files.

These are amplicon based data with very high depth ( > 12,000x). I tried to adjust parameters and so far --max-reads-per-alignment-start 0 --disable-read-filter NotDuplicateReadFilter --adaptive-pruning true --kmer-size 10 --kmer-size 15 --kmer-size 20 --kmer-size 25 --kmer-size 30 is the best I can get as I can identify T790M from 4 out of 12 T790M positives.

Any suggestion?

Thanks a lot for the help!

Ying

Why I got chaotic results when I did snp calling for all-sites used by gatk4.0.7.0?

$
0
0

The version I used is gatk4.0.7.0.
The shell script I used is:
software/gatk-4.0.7.0/gatk --java-options "-Xmx4g" HaplotypeCaller -R NewChr.fasta -I split.bam -ERC BP_RESOLUTION -O ss.vcf
I got the some results when I used "HaplotypeCaller's -ERC BP_RESOLUTION "
NewChr1 807256 . T . PASS . GT:AD:DP:GQ:PL 0/0:5,0:5:15:0,15,225
NewChr1 808431 . G . PASS . GT:AD:DP:GQ:PL 0/0:7,0:7:18:0,18,270
NewChr1 809041 . A . PASS . GT:AD:DP:GQ:PL 0/0:5,0:5:15:0,15,225
NewChr1 817071 . A . PASS . GT:AD:DP:GQ:PL 0/0:1,49:50:0:0,0,0
NewChr1 820714 . G A, 136.77 PASS BaseQRankSum=-0.842;DP=5;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=0.000;RAW_MQ=18000.00 GT:AD:DP:GQ:PGT:PID:PL:SB 0/1:1,4,0:5:30:0|1:820688_CA_C:165,0,30,168,42,210:1,0,3,1

I am confused the most genotypes are "0/0" but their reads are different.(For example, "NewChr1 820714" the reads ratio is 1:49, but the genotype is "0/0")
Could you tell me what happend with this? By the way, How can I get the correct all-sites vcf?


[GATK4] combining g.vcf files for single sample on different intervals

$
0
0

I'm working with WGS using GATK 4.1.0 HaplotypeCaller to generate g.vcf file on many different non overlapping intervals.
I'd like to have a single final g.vcf file for the sample.
I'm trying to combine the g.vcf file using CombineGVCF but this seems very slow (>24 hours for 100 g.vcf file).

gatk CombineGVCFs --output out.g.vcf  --reference hg38.fa -V list_g.vcf_file

I tried concatenating the g.vcf with unix tools.

cat <(grep '^#' {first.g.vcf}) <(cat `cat {list_g.vcf_file}` | grep -v '^#' )  > output.g.vcf

That took about a minute and a half.
Considering I know the order of the intervals and the fact that they are non overlapping, is there a fundamental problem with concatenating g.vcf files?

Joint Genotyping

$
0
0
I ran GATKHaplotypeCaller without GVCF mode. I have got the correct output VCF file. I want to do the joint Genotyping for BSA QTLSeqr analysis. The GATK GenoType is only for gVCF. I cannot run this mode, it takes too much time, even if run the aligner file in intervals. My question, how can I use the joint GenoTyping on the VCF file.

Physical Phasing Information HaplotypeCaller 4.1.0.0

$
0
0

Hi,

I am looking to use HaplotypeCaller to call germline variants, and I am particularly interested in the orientation of these variants relative to one another (cis- or trans-). There seems to be reference to physical phasing in the (HaplotypeCaller documentation)[https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_walkers_haplotypecaller_HaplotypeCaller.php#--do-not-run-physical-phasing], but I cannot find any physical phasing information in my VCF file.

For instance, I would expect the two variants below:

1 1647722 . G T 307.60 . AC=1;AF=0.500;AN=2;BaseQRankSum=-2.861;DP=29;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=53.28;MQRankSum=-5.260;QD=10.61;ReadPosRankSum=-0.098;SOR=0.155 GT:AD:DP:GQ:PL 0/1:21,8:29:99:315,0,841
1 1647725 . G A 304.60 . AC=1;AF=0.500;AN=2;BaseQRankSum=-1.277;DP=29;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=52.38;MQRankSum=-5.262;QD=10.50;ReadPosRankSum=-0.448;SOR=0.204 GT:AD:DP:GQ:PL 0/1:20,9:29:99:312,0,883

to be in the cis- orientation because they share nearly identical read counts, but I cannot find a corresponding annotation in the VCF file that says as much.

My command to call HaplotypeCaller is as below:

$gatk_launcher --java-options -Xmx${mem}g HaplotypeCaller \
-R $reference \
-I $bam_file \
-O $out_file \
-L $intervals_split &>> $log_file

Thank you for the help!!

Difference in PL, DP values while running GATK 3.7 HaplotypeCaller on the same sample in two runs

$
0
0

We ran GATK 3.7 HaplotypeCaller upon a sample to get .gVCF file few months back. Recently we tested out the same sample with same parameters of GATK 3.7 HaplotypeCaller and found that there is difference in the DP,PL values for many variants when comparing the two output .GVCF files from these two runs.

The command line parameters used for both the runs:

          java -Xmx32g -Djava.io.tmpdir=Temp/ -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ref.fa -I sample.bam -nct 24 --dbsnp dbsnp138.vcf --genotyping_mode DISCOVERY --minPruning 2 -newQual -stand_call_conf 30 --emitRefConfidence GVCF -variant_index_type LINEAR -variant_index_parameter 128000 -L chr1 -G none -l INFO -log sample.log -o sample_chr1.g.vcf.gz

The sample difference extracted between both the files using the diff command :-

F1 chr1 resemble the line extracted from the .gVCF file generated few months back
F2 chr1 resemble the line extracted from the .gVCF file generated recently

Change 1 observed: DP, PL values different between two output .GVCF files from these two runs

       F1 chr1    1510162    .    A    <NON_REF>    .    .    END=1510162    GT:DP:GQ:MIN_DP:PL    0/0:46:12:46:0,12,1425
       F2 chr1    1510162    .    A    <NON_REF>    .    .    END=1510162    GT:DP:GQ:MIN_DP:PL    0/0:45:9:45:0,9,1380


        F1 chr1    6941045    .    C    <NON_REF>    .    .    END=6941080    GT:DP:GQ:MIN_DP:PL    0/0:14:0:7:0,0,139
        F2 chr1    6941045    .    C    <NON_REF>    .    .    END=6941080    GT:DP:GQ:MIN_DP:PL    0/0:15:0:7:0,0,139


        F1 chr1    45683203    rs34100486    CTTTT    C,<NON_REF>    177.60    .    DB;MLEAC=1,0;MLEAF=0.500,0.00    GT:GQ:PL:SB    0/1:22:185,0,22,188,37,225:1,0,3,2
        F2 chr1    45683203    rs34100486    CTTTT    C,<NON_REF>    168.60    .    DB;MLEAC=1,0;MLEAF=0.500,0.00    GT:GQ:PL:SB    0/1:22:176,0,22,179,37,215:1,0,3,2   

Change 2 observed: 29 variants added in the recent run .gVCF output file which were not in the present in the previous run .gVCF output file
Below are the few sample varaints added to the new run .gVCF output file

        F2 chr1    15357649    .    G    <NON_REF>    .    .    END=15357649    GT:DP:GQ:MIN_DP:PL    0/0:41:94:41:0,94,1235
        F2 chr1    15357650    .    A    <NON_REF>    .    .    END=15357650    GT:DP:GQ:MIN_DP:PL    0/0:39:99:39:0,102,1284 

Change 3 observed: 10 variants present in the previous run .gVCF output file which were not in the present in the recent run .gVCF output file
Below are the few sample varaints present in the previous run .gVCF output file

         F1 chr1    9282514    .    C    CTCCCCCTCCTCCTTGTCTCCTCCTCCCTCTCCCCCT,<NON_REF>    274.01    .    MLEAC=2,0;MLEAF=1.00,0.00    GT:GQ:PL:SB    1/1:20:288,20,0,289,21,290:0,0,0,3
         F1 chr1    9282515    .    T    <NON_REF>    .    .    END=9282515    GT:DP:GQ:MIN_DP:PL    0/0:37:0:37:0,0,820
         F1 chr1    27014608    .    T    <NON_REF>    .    .    END=27014608    GT:DP:GQ:MIN_DP:PL    0/0:35:91:35:0,91,1388** 

Could you please explain why I get different results in two runs of HaplotypeCaller and what this change in values between the two output .gvcf files mean? Can this affect variant calling (Joint genotyping) that will be done at a later stage with all sample together?

Questions about calculating the genotype likelihoods

$
0
0

In this website, https://software.broadinstitute.org/gatk/documentation/article.php?id=4442, you showed the formula used to calculate PL.

I can understand most of the formulas used here. But I can't understand the change on the formula when you are trying to implement G=H1H2 to P(D|G). I tried a lot of times and I cannot finish the math inference on my own. I think the formula you used to calculate P(D|G) should also be available to be generated by pure math deduction.

Therefore, if convenient, would you please show me the process of the math deduction of the formula to prove that P(D|G)=P(D|H1)/2 + P(D|H2)/2 (given a single read sequence).

Thank you!

Two validated variants missed by HaplotypeCaller using MIP data (amplicon like data)

$
0
0

Dear GATK,

We are using MIPs (amplicon like) data to analyze the variants for certain genes. However, in two independent samples two validated variants were missed by the HaplotypeCaller. We were wondering if you have any idea why these variants were not called?

I've used the latest version of GATK (3.6) and the two commands we performed are:
--filter_mismatching_base_and_quals -R hs_ref_GRCh37.p5_all_contigs.fa -I sample1.sorted.bam -T HaplotypeCaller --emitRefConfidence GVCF -L targets.bed --dbsnp dbsnp_137.hg19.vcf -rf BadCigar -stand_call_conf 30.0 -stand_emit_conf 30.0 -nct 1 -o sample1_haplotypecaller.g.vcf
--filter_mismatching_base_and_quals -R hs_ref_GRCh37.p5_all_contigs.fa -I sample2.sorted.bam -T HaplotypeCaller --emitRefConfidence GVCF -L targets.bed --dbsnp dbsnp_137.hg19.vcf -rf BadCigar -stand_call_conf 30.0 -stand_emit_conf 30.0 -nct 1 -o sample2_haplotypecaller.g.vcf

Attached you will find two pictures of the used bam files. The mapping quality of the variant-reads look similar compared to the reference-reads(~60) as well as the base phred quality (~36). I've tried also many other settings/arguments for example by lowering the minimum phred-scaled confidence threshold at which variants should be called and the minimum phred-scaled confidence threshold at which variants should be emitted. Nothing worked to call the variants, However, if I use a smaller target region I am able to call the variant located on chr8.

The output of the GVCF gave:
chr14 31355353 . C . . END=31355353 GT:DP:GQ:MIN_DP:PL 0/0:987:0:987:0,0,11170
and
chr8 117861187 . G . . END=117861187 GT:DP:GQ:MIN_DP:PL 0/0:1253:0:1253:0,0,20903

Thank you very much in advance!
Kind regards,

Maartje

Could you please let me know any tool, to concatenate the gvcf files? Or there is any other solution

$
0
0
Could you please let me know any tool, to concatenate the gvcf files? Or there is any other solution to run the intermediate HaplotypeCaller in GVCF mode on parts of the chromosome to speed up the process and then combine them in one gvcf file before jointgenotyping.

No variants found by HaplotypeCaller

$
0
0
I have human RNA-seq data: paired-end reads (2x150bp) aligned with STAR. I tried to run HaplotypeCaller with usual parameters:
`gatk HaplotypeCaller -R GRCh38.primary_assembly.genome.fa -I S_1_rg_added.bam -stand-call-conf 20.0 -O $S_1.vcf`
Resulting vcf contains all the usual header lines but nothing else, no actual variants. End of the program's output log looks like this:

00:04:44.732 INFO ProgressMeter - chrM:1501 7.7 10294250 1332911.1
00:04:48.991 INFO HaplotypeCaller - 62395557 read(s) filtered by: ((((((((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter) AND NotSecondaryAlignmentReadFilter) AND NotDuplicateReadFilter) AND PassesVendorQualityCheckReadFilter) AND NonZeroReferenceLengthAlignmentReadFilter) AND GoodCigarReadFilter) AND WellformedReadFilter)
62395557 read(s) filtered by: (((((((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter) AND NotSecondaryAlignmentReadFilter) AND NotDuplicateReadFilter) AND PassesVendorQualityCheckReadFilter) AND NonZeroReferenceLengthAlignmentReadFilter) AND GoodCigarReadFilter)
62395557 read(s) filtered by: ((((((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter) AND NotSecondaryAlignmentReadFilter) AND NotDuplicateReadFilter) AND PassesVendorQualityCheckReadFilter) AND NonZeroReferenceLengthAlignmentReadFilter)
62395557 read(s) filtered by: (((((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter) AND NotSecondaryAlignmentReadFilter) AND NotDuplicateReadFilter) AND PassesVendorQualityCheckReadFilter)
62395557 read(s) filtered by: ((((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter) AND NotSecondaryAlignmentReadFilter) AND NotDuplicateReadFilter)
62395557 read(s) filtered by: (((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter) AND NotSecondaryAlignmentReadFilter)
62395557 read(s) filtered by: ((MappingQualityReadFilter AND MappingQualityAvailableReadFilter) AND MappedReadFilter)
62395557 read(s) filtered by: (MappingQualityReadFilter AND MappingQualityAvailableReadFilter)
10670778 read(s) filtered by: MappingQualityReadFilter
51724779 read(s) filtered by: MappingQualityAvailableReadFilter

00:04:48.992 INFO ProgressMeter - KI270757.1:71101 7.8 10332600 1325689.4
00:04:48.992 INFO ProgressMeter - Traversal complete. Processed 10332600 total regions in 7.8 minutes.
00:04:49.000 INFO VectorLoglessPairHMM - Time spent in setup for JNI call : 0.0
00:04:49.000 INFO PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 0.0
00:04:49.000 INFO SmithWatermanAligner - Total compute time in java Smith-Waterman : 0.00 sec
00:04:49.000 INFO HaplotypeCaller - Shutting down engine
[April 24, 2019 12:04:49 AM EDT] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 7.82 minutes.
Runtime.totalMemory()=3580887040

Any ideas of what could be wrong here?
Thanks!

How to create a Haplotype file from HaplotypeCaller

$
0
0
I am new to use GATK4. I would like to generate a Haplotype file (.bam or vcf file with haplotype information). I followed the Best Practice step till haplotypecaller.

REF="refs/ucsc.hg19.fasta"
name="sample1"

gatk --java-options "-Xmx16g" HaplotypeCaller \
-R $REF \
-I ${name}.addRG.mkdup.recal.bam \
-ERC GVCF \
-O ${name}.g.vcf.gz \
-bamout ${name}.haplotypes.bam \
--bam-writer-type CALLED_HAPLOTYPES \
--do-not-run-physical-phasing false

However, the bamout output file is 0kb and there is not warning/error messages.
Is it a create a ${name}.haplotypes.bam bam file with haplotype information ?
or how can i generate a vcf file with haplotype information ("0|1" or with PS")?

Thanks.

HaplotypeCaller bugs

$
0
0

Hi,

I have tried to solve several issues which came up while trying to run the HaplotypeCaller. For this one, I didn't find anything on google and to be honest when pasting the error, google doesn't even find something similar.

ERROR MESSAGE: Badly formed genome loc: Contig NC_007605 given as location, but this contig isn't present in the Fasta sequence dictionary

Can anyone please tell me what's the problem here? The fasta file I got was the one downloaded from the bundle: human_g1k_v37.fasta.gz

Any help would be really appreciated. Thank you!!

Why a variant site is listed in a GVCF run on a single sample, with no reads showing ALT variant?

$
0
0
I have read couple documents on GVCF but still can't understand how it works. Just one example from the GVCF file I got from HaplotypeCaller from a single bam file with `-ERC GVCF` option:
```
chr22 10718959 . T . . END=10718959 GT:DP:GQ:MIN_DP:PL 0/0:1:3:1:0,3,42
chr22 10718960 . T . . END=10718997 GT:DP:GQ:MIN_DP:PL 0/0:1:0:1:0,0,0
chr22 10718998 . C . . END=10719058 GT:DP:GQ:MIN_DP:PL 0/0:2:3:2:0,3,45
```
When I look at the original bam file around position 10718959, I see that there is indeed 1 read (as indicated in `DP` field), but its sequence matches the reference, with no variations! Why it is listed as a potential variant site at all?
Another example of the same kind:
```
chr22 12602453 . G . . END=12602461 GT:DP:GQ:MIN_DP:PL 0/0:33:99:33:0,99,1038
chr22 12602462 . A . . END=12602462 GT:DP:GQ:MIN_DP:PL 0/0:36:96:36:0,96,1440
chr22 12602463 . G . . END=12602464 GT:DP:GQ:MIN_DP:PL 0/0:37:99:37:0,99,1485
```
Very high genotyping quality score, and in the BAM file I see indeed 33-37 reads on this position - but again, all of them are same as a reference.

I will be very grateful if you could point me to any reference/resource that would be detailed enough to learn this sort of details. So far I have read
[GVCF - Genomic Variant Call Format](software.broadinstitute.org/gatk/documentation/article?id=11004) document, [FAQ on GVCF](software.broadinstitute.org/gatk/documentation/article.php?id=4017), and VCFv4.2 specs.

How do we know the stretch of variants that have been phased using haplotype caller

$
0
0

I am using the following command to run haplotype caller

/opt/apps/gatk/4.2.1/gatk HaplotypeCaller
--dbsnp /home/dhwani.dholakia/archive/files_required_for_exome_analysis/dbsnp/GRCH37.p17_refseq.vcf
-R /home/dhwani.dholakia/archive/files_required_for_exome_analysis/reference/Homo_sapiens.GRCh37.dna.chromosome.6.fa
-I base_recalib/abc_aligned_sorted_dupmarked_realigned_recalibrated.bam
-O haplotype_caller/abc_haplotyper.g.vcf
--emit-ref-confidence GVCF
-L home/dhwani.dholakia/archive/files_required_for_exome_analysis/coord.bed
--max-assembly-region-size 1000
-mbq 25
--native-pair-hmm-use-double-precision true
--bam-writer-type CALLED_HAPLOTYPES
-stand-call-conf 40
--activity-profile-out dd.txt

1) I would like to know like GATK 3.6 there was an option to define active regions, is there any option in GATK v4.2.1.
2) How do we know the variants in vcf file that they are phased.
As per my understanding the symbol "|" represents hat they are phased. But which parameters that i had missed could give me information that variants starting from one position to another is phased.

Viewing all 1335 articles
Browse latest View live