License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Protocol status: Working
We use this protocol and it's working
Created: September 14, 2023
Last Modified: December 26, 2023
Protocol Integer ID: 87782
Keywords: advent of whole genome sequencing, whole genome sequencing, sequencing data, complete wgs data processing in hpc environment, reference genome, parallel processing capabilities of hpc cluster, landscape of genomics research, genome, valuable resource for genomics researcher, genomics researcher, genomics research, complete wgs data processing, complexities of wgs data processing, preprocessed read, meticulous data preprocessing, hpc cluster, hpc infrastructure, exploitation of hpc resource, wgs data processing, intricate demands of wgs data analysis, parallelized tool, hpc resource, efficient computational strategies for data processing, parallelized execution, hpc, hpc environment, focus on hpc, parallel processing capability, wgs data analysis, optimized workflow, slurm scheduler for efficient resource management, performance computing, pivotal step in wgs analysis, data processing, performance computing in the analysis, robust foundation for downstream analysis, wgs analysis, variant annotation
Abstract
In the rapidly evolving landscape of genomics research, the advent of whole genome sequencing (WGS) has ushered in an era of unprecedented data volumes, necessitating robust and efficient computational strategies for data processing. This protocol delineates a comprehensive and optimized workflow specifically tailored for the intricate demands of WGS data analysis on High-Performance Computing (HPC) infrastructure, leveraging the Slurm scheduler for efficient resource management.
The protocol begins with meticulous data preprocessing, encompassing quality control, trimming, and adapter removal, while concurrently harnessing the parallel processing capabilities of HPC clusters. The subsequent alignment of preprocessed reads to a reference genome is performed utilizing parallelized tools, ensuring computational efficiency and minimizing analysis time. Variant calling, a pivotal step in WGS analysis, is seamlessly integrated, and the protocol underscores the exploitation of HPC resources for parallelized execution, essential for managing the computational intensity of this step.
Quality control metrics, variant annotation, and prioritization are addressed with a focus on HPC-specific considerations, providing a robust foundation for downstream analysis.
Emphasizing troubleshooting strategies, best practices, and recommendations, this protocol equips researchers with the tools necessary to navigate the complexities of WGS data processing on HPC infrastructure. By providing a structured and adaptable framework, this protocol serves as a valuable resource for genomics researchers aiming to harness the power of high-performance computing in the analysis of whole genome sequencing data.
Troubleshooting
Pre-Demultiplexing sequencing run QC
The very first data obtained after the NGS run are not the well-known FASTQ files. The first data consists of raw pixel-based data and intensity measures taken by the instrument in the form of .bcl files. The data are generated by the Real Time Analysis software by the Illumina sequencing systems during the run.
* The desired specifications of the run to be reached are available on the illumina website per each sequencing system
Install Sequencing Analysis Viewer
Pre-Demultiplexing sequencing run QC
Note
The input data for SAV are stored in the:
- InterOp catalog
- RunInfo.xml file
- RunParameters.xml file
We recommend sanity checks for signal intensity separation for each channel, error rate calculated as a percentage from phiX alignment, Q30 fraction, and cluster passing filter to identify potential problems with the suboptimal run.
The analysis tab
Expected result
The summary tab
Expected result
Software stack installation for HPC data processing
To process the NGS data from WGS experiment in an HPC environment we recommend installing the necessary software through conda or by other means.
* mamba is an optional add-on, which makes installation of the downstream software much more efficient. We highly recommend installation of mamba.
* We recommend to install mamba in the base environment
Create conda virtual environment through mamba
Command
create virtual env
mamba create -n wgs
Activate conda virtual environment
* When working in scheduler mode (sbatch), please remember to activate the virtual environment when on local access node before submitting the job to the slurm scheduler
* When workin in interactie mode (srun), activate the virtual environment after being allocated a compuational node (srun --pty /bin/bash)
Command
new command name
mamba activate wgs
Install basic software into virtual environment
* remember to conduct software installation while being on the local access mode, and after activation of the selected virtual environment (conda activate virtual_env name)
* installation of software while being on a computational node might revert any changes to the virtual environment once exiting
Illumina data demultiplexing software
Software
bcl2fastq
NAME
Command
new command name
mamba install -c dranew bcl2fastq
NGS read alignment software
Software
bwa
NAME
Command
new command name
mamba install -c bioconda bwa
Pre and Post-Alignment QC software
Software
multiqc
NAME
Command
new command name
mamba install -c bioconda multiqc
Pre-alignment data filtering software
Software
AdapterRemoval
NAME
Command
new command name
mamba install -c bioconda adapterremoval
Post-Alignment data processing software
Software
picardtools
NAME
Command
new command name
mamba install -c bioconda picardtools
Software
samtools
NAME
Command
new command name
mamba install -c bioconda samtools
Variant Calling software
Software
GATK4
NAME
Command
new command name
mamba install -c bioconda gatk4
Create scheduling scripts
To fully automate HPC data processing prepare scheduling scripts.
Demultiplexing scheduling script
#!/bin/bash
file=$1
while read run; do
echo -n $run
cp Demultiplex.sl Demultiplex_${run}.sl
sed -i "s/RUNID/$run/g" Demultiplex_${run}.sl
sbatch Demultiplex_${run}.sl
done < $file
Alignment scheduling script
#!/bin/bash
file=$1
while read line; do
runid=$(cut -f1 <(printf %s "$line"))
folder=$(cut -f2 <(printf %s "$line"))
sample=$(cut -f3 <(printf %s "$line"))
echo -n $runid $folder $sample
cp map.sl map_${runid}_${sample}.sl
sed -i "s/RUNID/$runid/g" map_${runid}_${sample}.sl
sed -i "s/FOLDER/$folder/g" map_${runid}_${sample}.sl
sed -i "s/SAMPLE_NAME/$sample/g" map_${runid}_${sample}.sl
sbatch map_${runid}_${sample}.sl
done < $file
Genotyping scheduling script
#!/bin/bash
file=$1
while read line; do
runid=$(cut -f1 <(printf %s "$line"))
sample=$(cut -f2 <(printf %s "$line"))
echo -n $runid $sample
cp genotype.sl genotype_${runid}_${sample}.sl
sed -i "s/RUNID/$runid/g" genotype_${runid}_${sample}.sl
sed -i "s/XXXX/$sample/g" genotype_${runid}_${sample}.sl
sed -i "s/RUNID/$runid/g" filter_variants_${runid}_${sample}.sl
sed -i "s/XXXX/$sample/g" filter_variants_${runid}_${sample}.sl
sbatch filter_variants_${runid}_${sample}.sl
done < $file
Variant annotation scheduling scripts
#!/bin/bash
file=$1
while read sample; do
echo -n "$sample "
cp vep.sl vep_${sample}.sl
sed -i "s/XXXX/$sample/g" vep_${sample}.sl
sbatch vep_${sample}.sl
done < $file
Variant prioritization scheduling script
#!/bin/bash
file=$1
while read sample; do
echo -n "$sample "
cp vep_filter.sl vep_filter_${sample}.sl
sed -i "s/XXXX/$sample/g" vep_filter_${sample}.sl
sbatch vep_filter_${sample}.sl
done < $file
Create base workflow script
Here we introduce a scheme of a basic slurm workflow script written in bash. This script automates all of the steps required to conduct the desired analyses
The first part is the header. It contains the desired resources that will be allocated on a computing node for the scheduled job.
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=10
#SBATCH --mem=80gb
#SBATCH --time=168:00:00
The second part consists of all important placeholders and paths to the input and output data
QC of the NGS data can be performed with a variety of tools. Here we present a typical output of an interactive QC tool multiqc
multiqc fastq_data_dir/
Expected result
Pre-alignment General Statistics
Pre-Alignment data filtering
AdapterRemoval \
--gzip \
--file1 read1.fastq.gz \
--file2 read2.fastq.gz \
--threads 24 \
--trimns \
--trimqualities \
--minquality 30 \
--minlength 25 \
--basename file_name
Alignment
Mapping paired truncated
bwa mem \
-t 25\
-M \
$DIR_REF/Homo_sapiens_assembly38.fasta \
file_name.pair1.truncated.gz \
file_name.pair2.truncated.gz |
samtools sort -@ 27 -O BAM -o file_name.bam -
Indexing
samtools index file_name.bam
Removing duplicates
java -jar $PICARD MarkDuplicates \
I=file_name.bam \
O=file_name_rmdup.bam \
M=file_name_rmdupt.txt \
REMOVE_DUPLICATES=true \
ASSUME_SORTED=false \
VALIDATION_STRINGENCY=SILENT \
CREATE_INDEX=true
Adding read group information
java -jar $PICARD AddOrReplaceReadGroups \
I=file_name_rmdup.bam \
O=file_name_rmdup_RG.bam \
VALIDATION_STRINGENCY=SILENT \
SORT_ORDER=coordinate \
RGID=file_name \
RGLB=file_name \
RGPL=illumina \
RGPU=file_name \
RGSM=file_name \
CREATE_INDEX=true
Optimizing HPC resources
To optimize resource allocation for human WGS data processing first we need to establish the minimal hardware requirements. The basic components consist of number of CPUs used, RAM usage and computation time. All three parameters are interrelated. The faster we want to finish the alignment step, the more CPUs we need to allocate, which requires more RAM allocation.
As a rule of thumb we are targeting to finish the alignment within 24-36 hours (this mostly depends on the hardware quality at our disposal). Given that, the required resources most optial to paralellize the computation are per 1 computational node:
- 24 CPUs
- 32 GB RAM
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=24
#SBATCH --mem=32gb
#SBATCH --time=168:00:00
Note
Data storage requirements
Typical data sizes for the human WGS run with 24 multiplexed samples are:
Pre-demultiplexed catalog - 1.2 TB
Raw fastq.gz - 50 GB x 2 reads (paired-end run)
BAM - 80 GB
VCF - 3 GB
Post-Alignment QC
Expected result
Post-Alignment General Statistics
Alignment Scores - percent of aligned reads
Additionally as a sanity check we advise to calculate average genome coverage per each sample
Van der Auwera GA & O'Connor BD. (2020). Genomics in the Cloud: Using Docker, GATK, and WDL in Terra (1st Edition). O'Reilly Media.
Poplin R, Ruano-Rubio V, DePristo MA, Fennell TJ, Carneiro MO, Van der Auwera GA, Kling DE, Gauthier LD, Levy-Moonshine A, Roazen D, Shakir K, Thibault J, Chandran S, Whelan C, Lek M, Gabriel S, Daly MJ, Neale B, MacArthur DG, Banks E. (2017). Scaling accurate genetic variant discovery to tens of thousands of samplesbioRxiv, 201178. DOI: 10.1101/201178
Van der Auwera GA, Carneiro M, Hartl C, Poplin R, del Angel G, Levy-Moonshine A, Jordan T, Shakir K, Roazen D, Thibault J, Banks E, Garimella K, Altshuler D, Gabriel S, DePristo M. (2013). From FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline.Curr Protoc Bioinformatics, 43:11.10.1-11.10.33. DOI: 10.1002/0471250953.bi1110s43.
DePristo M, Banks E, Poplin R, Garimella K, Maguire J, Hartl C, Philippakis A, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell T, Kernytsky A, Sivachenko A, Cibulskis K, Gabriel S, Altshuler D, Daly M. (2011). A framework for variation discovery and genotyping using next-generation DNA sequencing data.Nat Genet, 43:491-498. DOI: 10.1038/ng.806.
McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA. (2010). The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.Genome Res, 20:1297-303. DOI: 10.1101/gr.107524.110.
Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics, 25(14), 1754–1760.
Twelve years of SAMtools and BCFtools. Petr Danecek, James K Bonfield, Jennifer Liddle, John Marshall, Valeriu Ohan, Martin O Pollard, Andrew Whitwham, Thomas Keane, Shane A McCarthy, Robert M Davies, Heng Li. GigaScience, Volume 10, Issue 2, February 2021, giab008, https://doi.org/10.1093/gigascience/giab008
Schubert, M., Lindgreen, S. & Orlando, L. AdapterRemoval v2: rapid adapter trimming, identification, and read merging. BMC Res Notes9, 88 (2016). https://doi.org/10.1186/s13104-016-1900-2