I am preparing BAM files from the 1000 genomes project to use in my GATK pipeline (along with other already processed BAMs) and I have the following issues:
- chromosome notation on my BAMs is from GRCh37 but my pipeline uses hg19, so I would like to replace chromosome notation (1 -> chr1)
- the mitochondrial chromosome is slightly different in hg19 and GRCh37 (see here), so I want to leave it out
- and actually leave out all alternate contigs
This sounds quite trivial, but I haven't found a clean way to do this yet. I have tried the following:
i=INPUT.bam
j=OUTPUT.bam
samtools view -h $i | awk 'BEGIN{FS=OFS="\t"} (/^@/ && !/@SQ/){print $0} $2~/^SN:[1-9]|^SN:X|^SN:Y/{print $0} $3~/^[1-9]|X|Y/{$3="chr"$3; print $0} ' | sed 's/SN:/SN:chr/g' | samtools view -bS - > $j
However, when I try running the HaplotypeCaller, I get the following error:
ERROR MESSAGE: BAM file(s) do not have the contig: chrM. You are probably using a different reference than the one this file was aligned with
Could you help me prepare these BAM files for processing? Thanks a lot in advance