Low-frequency variant calling from high-quality mtDNA sequencing data

Marita A. Isokallio; James Stewart

Apr 19, 2018

Low-frequency variant calling from high-quality mtDNA sequencing data

DOI

dx.doi.org/10.17504/protocols.io.nfkdbkw

Marita A. Isokallio¹,
James B Stewart²

¹Max Planck Instutute for Biology of Ageing;
²Biosciences Institute

Stewart-Lab MPI-AGE

Marita A. Isokallio

FIMM, University of Helsinki

DOI: dx.doi.org/10.17504/protocols.io.nfkdbkw

Protocol Citation: Marita A. Isokallio, James B Stewart 2018. Low-frequency variant calling from high-quality mtDNA sequencing data. protocols.io https://dx.doi.org/10.17504/protocols.io.nfkdbkw

Manuscript citation:

Isokallio, M. (2017) The source and fate of mitochondrial DNA mutations using high-sensitivity next-generation sequencing technologies. PhD thesis, Universität zu Köln.
http://kups.ub.uni-koeln.de/7941/

License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

Protocol status: Working

We use this protocol and it's working

Created: February 25, 2018

Last Modified: April 19, 2018

Protocol Integer ID: 10444

Keywords: mitochondrial DNA, mtDNA, NGS, rare variant detection, low-frequency variant

Abstract

High-quality, highly enriched mitochondrial DNA (mtDNA) sample enables detection of extremely low-frequency mtDNA variants. When mtDNA is carefully enriched, fragmented in the presence of EDTA and sequenced with unique dual indices by Illumina HiSeq in paired-end mode, it is possible to reliably detect variants at allele frequencies <0.05 %. To be able to detect variants at every position of the mtDNA genome, 'dual' alignment and variant calling strategy is required.

Guidelines

The unique dual indices require paired-end sequencing and produce R1 and R2 reads. However, to our experience, R2 reads seem to cause more artifactual variant results (mainly GC>TA indicative of oxidative damage). Thus, only use of R1 reads for low-frequency variant detection is recommended.
 
If 'Before starting' sample quality requirements cannot be met, more stringent variant calling thresholds - namely allele frequency threshold - may be required and extremely low-frequency variant calling is not possible due to the chemical/biological artifacts present in the sample. Furthermore, it is always advisable to use control samples and confirm the accuracy, preferably in multiple experiments (meaning re-preparation of the libraries from the same case and control samples and sequencing the samples in another sequencing run).

Before start

Ensure that your mtDNA sample used for the sequencing
1) is of high quality and highly enriched from nuclear DNA (nDNA) contamination,
2) 1 mM of EDTA is included into the sonication step of the Illumina library preparation,
3) unique dual indexing is used and reads are sequenced in paired-end mode, and
4) reads are de-multiplexed based on both indices.
 
Addition of EDTA (or repair enzymes) has been recommended in order to diminish potential oxidative damage or propagation of other damages during library PCR (Costello et al. 2012, Chen et al. 2017). Whereas dual-indexing significantly reduces the possibility of between-sample cross-contamination (Kircher et al. 2012).
 
 
References
 
Chen, L. et al., 2017. DNA damage is a pervasive cause of sequencing errors, directly confounding variant identification. Science, 355, pp.752–756.
 
Costello, M. et al., 2013. Discovery and characterization of artifactual mutations in deep coverage targeted capture sequencing data due to oxidative DNA damage during sample preparation. Nucleic Acids Res., 41(6), pp.1–12.
 
Kircher, M., Sawyer, S. & Meyer, M., 2012. Double indexing overcomes inaccuracies in multiplex sequencing on the Illumina platform. Nucleic Acids Res., 40(1), p.e3.

Read trimming

R1 reads are trimmed for minimum length, minimun base quality and for TruSeq adapter sequences.
Command
-n number of cores
-r path to R1 reads
-t prefix for output
-f quality format
-j produce read length file
-z GZ compressed output
-q minimum base quality for 3' end
-m minimun read length (bp)
-a adapter file
-ao minimum adapter sequence overlap
-at allowed mismatches in the adapter sequence per 10 bp
-ae adapter trimming end (Unix)
flexbar -n 16 -r /path/to/reads1.fastq.gz -t output_prefix -f i1.8 -j -z GZ -q 28 -m 50 -a /path/to/adapters.txt -ao 10 -at 1 -ae ANY > file.log.txt
Software
Flexbar
NAME
Unix
OS
Dodt, M., Roehr, J.T., Ahmed, R., Dietrich, C.
DEVELOPER
https://sourceforge.net/projects/flexbar/files/
SOURCE LINK

Read alignment to NORMAL reference genome

Reads are first aligned to normal mouse mtDNA reference genome.
Command
-t threads
-P single-end reads
-T minimum score to output
-B allowed mismatches
-L penalty for 5' and 3' end clipping (normal reference genome)
# Create index for the reference genome (required only once)
bwa index -p reference reference.fa
# bwa mem alignment
bwa mem -t 15 -P -T 19 -B 3 -L 5,4 /path/to/reference/reference
/path/to/reads1_flexbar.fastq.gz > reads1.sam
# Sorting and indexing
samtools view -Sbu reads1.sam | samtools sort - -T reads1.sorted -o
reads1.sorted.bam
samtools index reads1.sorted.bam
Software
BWA
NAME
Unix
OS
Li, H., Durbin, R.
DEVELOPER
http://bio-bwa.sourceforge.net/
SOURCE LINK
Software
SAMtools
NAME
Unix
OS
Li, H. et al.
DEVELOPER
http://samtools.sourceforge.net/
SOURCE LINK
Note
This will allow variant detection on positions 200 to 16099 i.e. on the full genome excluding the circular genome junction region.

Read alignment to a SPLIT reference genome

Generate a split reference genome. 
 
NOTE: this step only needs to be run once and the generated files can be used for the next analysis.
Command
## Create the split reference genome (required only once) ##
# Create variables
split_genome=reference_split.fa
single_line=reference_singleline.fa
split_data=reference_split.data
# Transform the reference fasta file to contain the sequence as a single line
header=$(cat /path/to/reference.fa | grep '>')
cat /path/to/reference.fa | grep -v '>' | tr -d '\n' > $single_line
# Calculate the length of the input reference genome sequence to determine the cutting position
half_split=$(cat $single_line | awk 'BEGIN {junc=0} junc=int(length($0)/2)
{print junc}')
half_split1=$(echo $half_split | awk '{print $0+1}')
full_len=$(cat $single_line | awk 'BEGIN {len=0} len=length($0) {print len}')
paste <(echo $half_split) <(echo $half_split1) <(echo $full_len) >$split_data
# Take the genome halves according to the calculated positions
gen_start=$(cat $single_line | cut -c1-$half_split)
gen_end=$(cat $single_line | cut -c${half_split1}-$full_len)
# Combine the halves in correct order and restore the fasta format
gen_split=$(paste <(echo $gen_end) <(echo $gen_start) | tr -d '\t')
paste <(echo $header) <(echo $gen_split) | tr '\t' '\n' > $split_genome
Note
This will allow variant detection on positions at the circular genome junction region i.e. positions 1-199 and 16100-16299.

Similar to Step 2, reads are now aligned to the split mouse mtDNA reference genome generated in Step 3.
Command
-t threads
-P single-end reads
-T minimum score to output
-B allowed mismatches
-L penalty for 5' and 3' end clipping (split reference genome)
# Create index for the reference genome (required only once)
bwa index -p reference_split reference_split.fa
# bwa mem alignment
bwa mem -t 15 -P -T 19 -B 3 -L 5,4 /path/to/reference/reference_split
/path/to/reads1_flexbar.fastq.gz > reads1_junction.sam
# Sorting and indexing
samtools view -Sbu reads1_junction.sam | samtools sort - -T
reads1_junction.sorted -o reads1_junction.sorted.bam
samtools index reads1_junction.sorted.bam

Read filtering

In order to exclude poorly aligned reads, the aligned reads are filtered for minimum mapping quality.
Command
-b output BAM format
-q minimum mapping quality to keep (Unix)
# Quality filter only uniquely aligned reads for further processing (bwa mem mapping quality 0 indicates multi-mapping reads)
# First the reads aligned to normal reference genome
samtools view -bq 1 reads1.sorted.bam > reads1.accepted.bam
samtools index reads1.accepted.bam
# Similar to reads aligned to the split reference genome
samtools view -bq 1 reads1_junction.sorted.bam > reads1_junction.accepted.bam
samtools index reads1_junction.accepted.bam
Note
NOTE: Different alignment tools use different mapping qualities, see for example a blog post by Keith Bradnam.

Calculate coverage

Calculate coverage for both alignment files, the one from alignment to the normal reference genome and the one from alignment to the split reference genome.
 
Subset the results such that the entire genome is represented with correct coverage i.e. use the alignment to the normal reference genome for the positions 200-16099 (mouse genome positions) and the alignment to split reference genome for the junction region positions between -200 and +200.
Command
# Firts calculate coverage for the normal reference alignment
bedtools genomecov -d -ibam reads1.accepted.bam -g reference.fa > coverage.txt
# Then the same for the split reference alignment
bedtools genomecov -d -ibam reads1_junction.accepted.bam -g reference_split.fa > coverage_junction.txt
## Combine the coverages in order to represent the entire mtDNA genome
# Take the middle and end points of the split junction genome for extracting correct lines
mid_point=$(cat reference_split.data | awk '{print $2}')
end_point=$(cat reference_split.data | awk '{print $3}')
# Middle part of the normal coverage file
cat coverage.txt | awk -v endpoint=
Software
BEDTools
NAME
Unix
OS
Quinlan, A.R., Hall, I.M.
DEVELOPER
https://github.com/arq5x/bedtools2
SOURCE LINK

Variant calling

Variant calling is done for both alignment files, the one that was aligned to the normal reference genome and the one that was aligned to the split reference genome.
 
Variant calling sets minimum base-calling qualities to be considered for the reference and alternative bases and calls indels simultaneosly.
 
Variants are filtered only for strand bias and minimum quality.
 
NOTE: Disabling the default run of lofreq filter allows detection of extremely rare variants from a high-quality, highly enriched mtDNA sample.
Command
--dindel Add Dindel's indel qualities (Illumina specific)
--pp-threads number of threads
-f reference genome .fasta file
-o output file
-N Don't merge mapping quality in LoFreq's model
-B Disable use of base-alignment quality (BAQ)
-q minimum REF base quality
-Q minimum ALT base quality
--call-indels enable indel calls
--no-default-filter Don't run default 'lofreq filter' automatically after calling variants
 (variant calling for normal and split alignments)
# First set the indel qualities for using --call-indels
lofreq indelqual --dindel --ref reference.fa --out reads1.indelqual.bam reads1.accepted.bam
lofreq indelqual --dindel --ref reference_split.fa --out reads1_junction.indelqual.bam reads1_junction.accepted.bam
# Generate the index for both files
samtools index reads1.indelqual.bam
samtools index reads1_junction.indelqual.bam
# Variant calling including indels
lofreq call-parallel --pp-threads 20 -f reference.fa -o reads1.nofilter.vcf -N -B -q 30 -Q 30 --call-indels --no-default-filter reads1.indelqual.bam
lofreq call-parallel --pp-threads 20 -f reference_split.fa -o reads1_junction.nofilter.vcf -N -B -q 30 -Q 30 --call-indels --no-default-filter reads1_junction.indelqual.bam
# Filtering the results if >85% of variant reads are on single strand and for minimum quality
lofreq filter --no-defaults --snvqual-thresh 70 --indelqual-thresh 70 --sb-incl-indels -B 60 -i reads1.nofilter.vcf -o reads1.sbfiltered.vcf
lofreq filter --no-defaults --snvqual-thresh 70 --indelqual-thresh 70 --sb-incl-indels -B 60 -i reads1_junction.nofilter.vcf -o reads1_junction.sbfiltered.vcf
Software
LoFreq
NAME
Unix
OS
Wilm A. et al.
DEVELOPER
https://github.com/CSB5/lofreq
SOURCE LINK
Software
SAMtools
NAME
Unix
OS
Li, H. et al.
DEVELOPER
http://samtools.sourceforge.net/
SOURCE LINK

Combine the called variants from the both VCF files (normal and split reference genomes) and re-coordinate the positions.
 
First, use the reference_split.data file generated in the Step 3 that describes the middle and last genome position numbers.
 
From the VCF file that used the split reference genome, take only the variants identified on the genome junction region i.e. positions between -200 and +200. Replace the position numbers with those corresponding the normal reference genome in order to combine these variants with the ones in the VCF file that used the normal reference genome.
 
From the VCF file that used the normal reference genome, take all variants except those on the above described genome junction region.
 
Combine the variant lists into a single VCF file that now has the positions according to the normal reference genome and variants are correctly detected also around the junction region.
 
Finally, separate SNVs and indels into separate files for downstream processing.
Command
## Combine the original and junction region results for the junction region ##
# Read in the mid and end points of the split genome junction region (generated in Step 3 for the split reference genome)
mid_point=$(cat reference_split.data | awk '{print $2}')
end_point=$(cat reference_split.data | awk '{print $3}')
# Re-coordinate the variants and take only -200 and +200 region
cat reads1_junction.sbfiltered.vcf | awk '/#CHROM/ {flag = 1; next} flag {print}' | \
awk -v midpoint=
Software
LoFreq
NAME
Unix
OS
Wilm A. et al.
DEVELOPER
https://github.com/CSB5/lofreq
SOURCE LINK

Variant filtering

Variants can be further filtered and annotated with snpEff.
 
First, make sure that you have the snpEff config file correctly set up to use the mitochondrial codon table (use the same genome version as you used for the alignment).
 
Then, filter the variants with wanted thresholds. Here, we use the minimum number of total alternative reads (calculated as depth * allele frequency) together with minimum number of alternative reads on each strand (based on DP4 values).
 
In the next step, we apply an additional filter for variants with extreme strand-bias Phred score (SB>1000), which were not filtered out in the earlier step.
 
Simultaneously, we filter the results for the near-homoplasmic variants that has expanded in our inbred mouse strain (this part should be omitted if your mouse strain does not carry these variants).
Software
SnpEff
NAME
Unix
OS
Cingolani P. et al.
DEVELOPER
https://github.com/pcingola/SnpEff
SOURCE LINK
Command
## Make sure the snpEff config file uses mitochondrial codons for annotations (required only once)
cd ~
mkdir snpEff_data
# Modify snpEff.config file
# data.dir = ~/snpEff_data/
# Fix mitochondrial codon table usage in the .config file to the wanted genome version:
## Original rows in snpEff.config
## GRCm38.82.genome : Mus_musculus
## GRCm38.82.reference : ftp://ftp.ensembl.org/pub/release-82/gtf/
# Add MT.codonTable in between:
## GRCm38.82.genome : Mus_musculus
## GRCm38.82.MT.codonTable: Vertebrate_Mitochondrial
## GRCm38.82.reference : ftp://ftp.ensembl.org/pub/release-82/gtf/
###
Command
# Filter variants for minimum number of supporting reads in total and on both strands
cat reads1.snvs.vcf | \
java -jar /software/snpEff/4.2/SnpSift.jar filter \

Public workspaceLow-frequency variant calling from high-quality mtDNA sequencing data

Low-frequency variant calling from high-quality mtDNA sequencing data