Investigation and identification of somatic and germline variants for colorectal cancer exomes using the NGS pipeline: a computational analysis perspective
Protocol Citation: Chandrashekar K, Anagha S Setlur, Vidya Niranjan 2023. Investigation and identification of somatic and germline variants for colorectal cancer exomes using the NGS pipeline: a computational analysis perspective. protocols.io https://dx.doi.org/10.17504/protocols.io.x54v9dqxqg3e/v1
License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Protocol status: Working
We use this protocol and it's working
Created: May 23, 2023
Last Modified: May 24, 2023
Protocol Integer ID: 82294
Keywords: Colorectal cancer exomes, quality controls, somatic & germline variants, annotation & post-processing, computational analysis, germline variants for colorectal cancer exome, colorectal cancer exome, cancer exome dataset, germline variant identification, colorectal cancer, identified variant, germline mutant, human reference genome, germline variant, identified mutant, potential mutations such as single nucleotide polymorphism, potential mutation, important gene, using snpeff, variant identification, single nucleotide polymorphism, germline, haplotype caller, early disease detection, process of early disease detection
Disclaimer
This protocol was run with colorectal cancer exome datasets. However, the SRR-IDs of the datasets have not been mentioned to make this protocol universal. The steps may be followed for any cancer exome, with the intent of investigating the somatic and germline mutations.
Abstract
A thorough analysis on colorectal cancer exomes reveals potential mutations such as single nucleotide polymorphisms that can be beneficial for early detection of the disease. Thus, through a comprehensive computational protocol that identifies, investigates and analyzes the identified variants, this process of early disease detection becomes much simpler. In brief, the cancer exome datasets were retrieved from publicly available databases, followed by performing quality control checks. The datasets that qualified the quality control checks were then aligned with the human reference genome. Somatic and germline mutants were then identified and called separately, with specific tools for each case. Haplotype Caller was employed for germline variant identification, and Mutect2 for somatic. The identified mutants were then normalized, annotated and post-processed using snpEFF, wANNOVAR and VEP. This protocol helps in garnering insights on the various alterations that might possibly lead to colorectal cancer and suggests the possibility of utilizing the most important genes identified for wet-lab experimentation.
The commands mentioned in the protocol can be run in the Linux terminal.
Enough storage space is recommended while running this protocol.
Materials
Tools mentioned in the current protocol must initially be installed in the system before running the commands for the NGS pipeline. These include: SRAtoolkit, Fastqc, Multiqc, Cutadapt, Bowtie2, Samtools, picard, GATK, and bcftools. wANNOVAR, VEP are available online and snpEFF is command line based.
Troubleshooting
Safety warnings
Nil
Ethics statement
The datasets used to arrive at this protocol were all from publicly available databases such as NCBI-SRA.
Before start
All the mentioned tools in the materials section must be installed and checked for appropriate installation.
COLORECTAL CANCER EXOME RETRIEVAL
Retrieval of colorectal cancer exomes
This protocol was run with colorectal cancer exome datasets. However, the SRR-IDs of datasets have not been mentioned to make this protocol universal. Therefore, NCBI-SRA database (https://www.ncbi.nlm.nih.gov/sra) was employed to retrieve the colorectal cancer exome datasets. The SRR-IDs were noted down for further use. The exome datasets were selected based on a set of criteria. The strategy used for the datasets must be whole exome sequenced and the exomes should be of genomic source. In the present protocol, all the selected sequences were of paired layout, with Illumina used for sequencing and acquiring the reads. Other criteria as desired by the user depending on the requirement can be employed for retrieving the cancer exomes.
Additionally, the tools mentioned in the current protocol must initially be installed in the system before running the commands for the NGS pipeline. These include: SRAtoolkit, Fastqc, Multiqc, Cutadapt, Bowtie2, Samtools, picard, GATK, and bcftools. wANNOVAR, VEP are available online and snpEFF is command line based.
QUALITY CONTROL
Quality control assessments
Prior to calling variants, the quality of raw data was assessed using the below-mentioned steps. The codes used for running the quality checks are provided in the following sections.
Quality Check using FastQC and MultiQC (for multiple datasets)
FastQC (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) and MultiQC (https://multiqc.info/) in the present study were used to effectively assess the sequencing data quality, to identify any possible issues with the raw sequence reads and generate informative reports. Scrutiny of the mean sequence quality per reading and per base, nucleotide content per base position, distribution of GC, etc were analyzed via FastQC. A cumulative report for all FastQC outcomes was obtained from MultiQC.
For multiple datasets, FastQC and MultiQC commands used were:
Command
fastQC
$ ./fastqc
Command
multiQC
$ multiqc .
Expected multiQC heatchart
Adaptor sequence removal via Cutadapt
This tool identifies and eliminates the adaptor sequences from raw reads and improves the quality and accuracy of downstream analysis. All unwanted sequences was thus removed using Cutadapt (https://cutadapt.readthedocs.io/en/stable/) by running:
Command
Running Cutadapt
$ cutadapt -a adapter-sequence* -o SRRID_cut.fastq.gz SRRID.fastq.gz
ALIGNMENT WITH HUMAN REFERENCE GENOME
Retrieval and indexing of human reference genome for analysis
Indexing is a necessary step to pre-process the reference genome so the aligner can seek potential alignment locations for the reads. This is useful during alignment with Bowtie2 (https://bowtie-bio.sourceforge.net/bowtie2/index.shtml), where short read aligners are processed with the reference genome. The following codes were run in the terminal for obtaining Bowtie2 outputs.
Running Bowtie2 with the specified parameters allows alignment of paired-end sequencing reads to a reference genome, enabling downstream analysis such as variant calling, differential gene expression, or identification of genomic features specific to colorectal cancer.
SAM TO BAM CONVERSION
SAM (Sequence Alignment/Map) to BAM (Binary Alignment/Map) conversion
To obtain the alignment outcomes in a readable format and to act as appropriate input for the next steps, SAM to BAM (http://www.htslib.org/) conversion was performed. The code used was:
Command
SAM to BAM file conversion
$ samtools view -bS SRRID.sam > SRRID.bam
Sorting BAM
BAM sorting and merging was performed using command:
Command
BAM sorting
$ samtools sort SRRID.bam -o SRRID.sorted.bam
BAM sorting helps in supporting quick retrieval of alignments and has a compact size for further analysis.
Identifying and marking duplicate reads in BAM file
Duplicate reads may arise at times during sequencing due to several factors, inclusive of technical biases, PCR amplification artifacts and preparation of library protocols. These may skew the downstream analyses and impact the result accuracy. Therefore, marking duplicates is essential to address this issue, by using the “MarkDuplicates” function of Picard (https://broadinstitute.github.io/picard/) tool.
Computing and collecting alignment summary metrics from BAM file
To obtain various statistics and metrics associated with alignment of sequencing reads to a reference genome, the “CollectAlignmentSummaryMetrics” function was used. The command used to run this function was:
ref.fa: fasta sequence file of the human reference genome
Estimating and collecting the insert size distribution of paired-end reads from BAM file
The primary purpose of collecting the insert sizes is to check the distribution and characteristics of the DNA fragment sizes, in a sequencing library. This enables the collection and visualization of the insert size metrics and distribution, that are useful for the quality scrutiny of the library, alignment of the reads, selection of the fragment size and quality control assessments in several genomic analyses. For this purpose, in the present work flow, the “CollectInsertSizeMetrics” function was used. The distribution of the paired end reads from BAM was analyzed using the command:
Calculating coverage depth at each position in a sorted and duplicate-marked BAM file
The “samtools depth” function was used to examine the depth of coverage at every position in the genome of reference depending on the aligned reads. Computation and estimation of depth of coverage important for assessing the sequence data quality, calling of variants, structure and gene expression analysis and overall exploration of data becomes much easier with the samtools depth function. Thus, to better understand and obtain insights into the coverage depth, this function was performed via:
Command
Determining coverage depth of sorted, deleted, & duplicated BAMs
$ samtools depth -a SRRID_sorted_dedup_reads.bam > SRRID_depth_out.txt
Add or modify read group information in BAM file
To assign or update the read group information to the sequence reads in a BAM file, the add or replace read groups function was employed. This was carried out using "picard.jar AddOrReplaceReadGroups". Sample identification, tracking of the data, sequence data differentiation, integrity checks of data and overall sequence reads compatibility was obtained after this step was performed, ensuring that the reads were properly annotated and organized, facilitating meaningful data analyses.
This step was run using the following code in the terminal:
Indexing BAM file- necessary before variant calling
Prior to variant calling, another indexing of the final BAM output file was carried out to allow faster retrieval of specific genomic regions, enhanced visualization, and to facilitate random access.
Command
Indexing BAM file
$ samtools index SRRID_output.bam
VARIANT CALLING - GERMLINE & SOMATIC
Calling of germline variants
The below sections detail preliminary steps followed for calling germline variants.
The purpose of using the "HaplotypeCaller" function is to identify and call genetic variants, including single nucleotide variants (SNVs), insertions, deletions, and complex variants, based on the sequencing data and the provided reference genome. The following command was employed to call the variants:
The "SelectVariants" function with the "-select-type SNP" option was utilized to filter and extract only the SNPs from the original VCF file. This step enabled filtering and data refining to focus on SNPs, simplified further analyses, optimized computational resources, and ensured compatibility with SNP-specific analysis tools or pipelines. For working with larger datasets, this function works best for SNP-centric analyses. Therefore, to select the SNPs, the code run was:
The purpose of using the "SelectVariants" function with the "-select-type INDEL" option is to filter and extract only the INDELs from the original VCF file. Using the GATK "SelectVariants" with the "-select-type INDEL" option allows for the selection and extraction of INDELs from a VCF file. This step enables filtering and refining the data to focus on INDELs. This step was run prior to variant filtration.
To remove potential false positives and retain only the high quality variants, a series of filters were applied to the raw SNPs using the following codes:
To filter out the variants that do not meet the specific quality (as defined by the user), the “--exclude-filtered" option in the "SelectVariants" step was used. This ensured removal of mutants that failed specific criteria and thereby enhanced quality of the data, prioritized the high-confidence variants only, decreased the number of false positives, allowed compatibility access between datasets and also optimized the computational resources. This is a crucial step to be followed in the data refinement process.
To augment the accuracy and reliability of the called variants, a base recalibration was carried out by correcting the systematic errors, addressing biases, and providing more accurate base quality scores. This step is essential in ensuring that the sequencing data quality is maintained appropriately.
To filter and extract only the SNPs from the recalibrated VCF files, functions “SelectVariants” along with “-selectType SNP” were used. This was useful since the data authors were working with were recalibrated variant sets.
A series of filters were then applied to the raw SNPs to remove the potential false positives and retain high-quality variants. The following filters were applied:
The purpose of this step is to identify and call somatic variants, which are genetic variations specific to the tumor sample compared to the matched normal sample. This was first carried out via:
To identify and call somatic variants for more than 2 samples (which are variations specific to the tumor samples compared to the matched normal ones).
Additionally, to call somatic variants while applying filters depending on the population frequencies and a panel of normals, the following codes were run:
Filtering variants are an essential step before the VCF files are prepared, since it ensures that the obtained variants are more likely to be true somatic mutations. In the present work, the mutations were filtered that will further help in understanding the biology of colorectal cancer.
The variant files obtained from both germline and somatic calling were then used for further analysis as described in the following steps.
VCF FILE PREPARATION, ANNOTATION, POST-PROCESSING
VCF file preparation, annotation and post-processing of variants
Once both germline and somatic mutations were identified and called, the VCF files for both these variants were prepared using bcftools for further analysis. Annotation and processing of these variants were then carried out using snpEFF (https://pcingola.github.io/SnpEff/), wANNOVAR (https://wannovar.wglab.org/) and VEF (Variant Effect Predictor) (https://asia.ensembl.org/Tools/VEP).
Additional configurations and the type of transcript database to be used for VEP analysis
Expected result for VEP
CONCLUSIONS AND FUTURE PERSPECTIVES
Colorectal cancers have several diagnostic and treatment options to combat it, however, a delay in disease detection is life-threatening. Therefore, this protocol emphasizes on the identification and analysis of somatic and germline variants for colorectal cancer exome datasets, retrieved from publicly available databases such as NCBI-SRA. These steps describe the various tools required to run this pipeline, with the codes required to run each tool. Moreover, although this protocol is described for colorectal cancer exomes, the same set of codes and steps may be followed for the overall analysis of any cancer exome, when the desired outcome is to obtain and analyze somatic and germline variants separately. This computational pipeline requires further in-vitro and in-vivo studies, however, the outcomes will offer prospects for similar such studies that are crucial for designing a cure, prognosis or a treatment strategy for colorectal cancers, based on the scrutinized variants.
The outcomes from this protocol can be carried forward to building and designing decision support systems using artificial intelligence and machine learning (AI/ML), so that the identified somatic and germline mutants can be used for early colorectal cancer prediction through a prediction model. This will aid clinicians/ researchers/diagnosticians to help take better decisions for treatment strategies.
Protocol references
Leinonen R, Sugawara H, Shumway M, International Nucleotide Sequence Database Collaboration. The sequence read archive. Nucleic acids research. 2010 Nov 8;39(suppl_1):D19-21.
Andrews S. FastQC: a quality control tool for high throughput sequence data.
Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8.
Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet. journal. 2011 May 2;17(1):10-2.
Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nature methods. 2012 Apr;9(4):357-9.
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, 1000 Genome Project Data Processing Subgroup. The sequence alignment/map format and SAMtools. bioinformatics. 2009 Aug 15;25(16):2078-9.
Cingolani P, Patel VM, Coon M, Nguyen T, Land SJ, Ruden DM, Lu X. Using Drosophila melanogaster as a model for genotoxic chemical mutational studies with a new program, SnpSift. Frontiers in genetics. 2012 Mar 15;3:35.
Yang H, Wang K. Genomic variant annotation and prioritization with ANNOVAR and wANNOVAR. Nature protocols. 2015 Oct;10(10):1556-66.
McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GR, Thormann A, Flicek P, Cunningham F. The ensembl variant effect predictor. Genome biology. 2016 Dec;17(1):1-4.
McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome research. 2010 Sep 1;20(9):1297-303.