Mapping Metagenomic Reads to Reference Sequences (Cyverse)

Benjamin Bolduc

Dec 31, 2016

Version 3

Mapping Metagenomic Reads to Reference Sequences (Cyverse) V.3

DOI

dx.doi.org/10.17504/protocols.io.gv2bw8e

Benjamin Bolduc¹

¹The Ohio State University

VERVE Net
Sullivan Lab

Benjamin Bolduc

The Ohio State University

DOI: dx.doi.org/10.17504/protocols.io.gv2bw8e

Protocol Citation: Benjamin Bolduc 2016. Mapping Metagenomic Reads to Reference Sequences (Cyverse). protocols.io https://dx.doi.org/10.17504/protocols.io.gv2bw8e

License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

Protocol status: Working

Created: December 31, 2016

Last Modified: March 27, 2018

Protocol Integer ID: 4762

Abstract

Mapping metagenomic reads from Ocean Sampling Day (OSD) 2014 against NCBI's ViralRefSeq alongside viral sequences identified from the Tara Oceans survey using VirSorter.

Guidelines

One of the most commonly used procedures for analyzing viral metagenomic data is to map their reads (or reads from another dataset) against a set of references, often those from the read assembly. For example, if one wanted to know how well-represented viruses in NCBI's Viral Reference Sequences (ViralRefSeq) were in ocean viromes, they could map reads from lots of ocean viral metagenomes against ViralRefSeq. This is generally done using Bowtie2 or BWA, by selecting a reference set of sequences, and then providing paired or unpaired reads to Bowtie2/BWA. Then the results must be processed/filtered to generate coverage tables. Dealing with setting up multiple reads files (10 paired metagenomes = 10 alignment runs) and the processing those read files can be challenging (not to mention requring computational resources).

In this protocol, we'll be using reads fromÂ Ocean Sampling Day (OSD) 2014 and map them against ViralRefSeq alongside viral sequences identified from the Tara Oceans survey using VirSorter.

Before start

To run this protocol, users must first register for Cyverse account. All data (both inputs and outputs) are available within Cyverse's data store at /iplant/home/shared/iVirus/ExampleData/
 
All source code is available at the Sullivan lab bitbucket repository, MAVERICLab - either *-bowtiebatch or *-read2refmapper. * denotes either docker or stampede-based repos.

BowtieBatch

Open BowtieBatch 
Open BowtieBatch-1.0.0 from 'Apps'

Select Inputs
Select the 'Inputs' tab.

For Reference file with FASTA-formatted sequences: this is a FASTA-formatted file with reference sequences to map reads against
Navigate to Community Data --> iVirus --> ExampleData --> BowtieBatch --> Inputs. Select VirRefSeq_and_Virsorted_Tara_10kb.fasta Alternatively, copy-and-paste the location: /iplant/home/shared/iVirus/ExampleData/BowtieBatch/Inputs into the navigation bar and select the fasta file.
For Directory with reads: this is a directory containing the reads files that are to be mapped
Navigate to Community Data --> iVirus --> ExampleData --> BowtieBatch --> Inputs. Select the Reads folder. Alternatively, copy-and-paste the location: /iplant/home/shared/iVirus/ExampleData/BowtieBatch/Inputs into the navigation bar and select the reads folder.



Note
TIP: Reads directory can contain any number of reads files. They can be paired, unpaired, gzipped or uncompressed. They do have to have some semblence of ordering, either computer or human-natural sorting.

Select Parameters
The default options will be sufficient for this example, however additional options are explain below.
 
File extension: Extension of the reads. GZIP files are automatically recognized.
Interleaved files: Are reads interleaved? i.e. Are forward and reverse reads in the same file? These read files contain both pairs, usually Forward1, Reverse1, Forward2, etc... By using this option users can automatically handle interleaved files w/out worrying about having another app handling this conversion.
Read type: What type are the reads? Paired, unpaired, a mix of both?
Levenshtein distance: "Distance" between the read names. i.e. "CAT" and "BAT" have a distance of 1 because the C/B difference. Generally, read pairs/unpairs differ by 1 and mixed differ by 3. If reads do not group correctly, try changing this value.
Keep SAM files: Bowtie2 generates SAM files as a result of mapping reads. However, these take up a lot of space and aren't kept if they're not needed. Users can select if they'd like to keep these files around in case they have other analyses that use SAM files. For this example, they are unnecessary.
Log file name: Information about the run's processing will be stored with this file name. This file can become very large as it contains trimming information.
Merge results: Bowtie2 can create a single SAM file that combines all the individual paired/unpaired alignments. Because the next tool in this pipeline (Read2RefMapper) requires individually separated files, do not select this.
Merged file name: Name to use if results are combined.

The remaining options are bowtie specific, please consult the Bowtie2 documentation.



Note
While the default options will suffice for this example, play around with a few of the options (excluding file extension).
Note
Reads can be paired, unpaired or a mix of both. In this example there are 2 sets of paired reads. Alternatively, one could set all the reads to be unpaired. 
 

Launch Analysis
Run the job!
 
Depending on the number of files to align and/or the number of sequences in the reference database, the job could take many hours. 

Results
Expected results can be found from the 'output' directory of BowtieBatch. They'll consist of 3 directories:
Reads: containing the original files
bowtie2-db: containing the bowtie2-formatted sequences
reads: any uncompressed version of the reads (those files actually used by bowtie2) and BAM files. If Keep Sam files was selecting, they'll be SAM files there as well.
 
Content of reads directory:
 
Additionally, the read mapping log file contains very useful information, such as the overall read mapping statistics:
 
Notice the grouping of the paired files, under "Sorted files." It shows 2 groups, 0 and 1, that have OSD24 and OSD46 as read groups. 
 
Note
A sample output directory is located at: /iplant/home/shared/iVirus/ExampleData/BowtieBatch/Outputs
 
Note
If the reads were selected to be unpaired, a BAM file would be associated with each input file.
 
And each file would be grouped individually:

Read2RefMapper

Open Read2RefMapper
Open Read2RefMapper-1.1.0 from 'Apps'

Select Inputs
Select the 'Inputs' tab.

For BAM File Directory (optionally including metagenome sizes file): 
Navigate to Community Data --> iVirus --> ExampleData --> Read2ReferenceMapper --> Inputs. Select the BAM_files_and_coverages_file folder. This will contain BAM files and a coverage file. Alternatively, copy-and-paste the location: /iplant/home/shared/iVirus/ExampleData/Read2ReferenceMapper/Inputs into the navigation bar and select the folder.
For Metagenome Sizes File (CSV-formatted): 
Navigate to Community Data --> iVirus --> ExampleData --> Read2ReferenceMapper --> Inputs. Select the BAM_files_and_coverages_file folder. Select metagenome_bp.csv. Alternatively, copy-and-paste the location: /iplant/home/shared/iVirus/ExampleData/Read2ReferenceMapper/Inputs into the navigation bar and select the metagenome_bp.csv file.



Note
NEW: Prior versions of Read2RefMapper required users to type in the filename for the metagenome sizes file and place it within the same directory. Updates to the core code have eliminated this requirement. This example still uses the old directory structure.

Select Parameters and Outputs
For this example, the defaults are sufficient.

% Read Coverage: Percent of the reference sequence that must be covered by reads. For example, if 75% of a reference must be covered by reads to be considered 'present' in that sample.
Percent ID: Minimum identity a read must be against the reference sequence. For example, 0.90 is 90% identity between the read and reference.
Percent Alignment: What percent of the read length must be aligned to be considered matching. For example, 0.90 is 90% of the read length must align to the reference. This allows users to excludes reads where only a small local alignment is present.
Coverage Mode: How coverage should be calculated.
Output resolution for adjusted coverage table heatmap (DPI):If metagenome sizes file provided, will dictate the resolution of the heatmap for the adjusted coverage.
Log file: Filename to store logging information.
Coverage table: Name of the coverage table.
Output format for adjusted coverage table heatmap: If metagenome sizes file provided, will generate a heatmap for the adjusted coverage.

Launch Analysis
Run the job!
 
In terms of run time, Read2RefMapper is considerably faster than BatchBowtie (above). 

Results
Expect results can be found in the associated app's 'output' directory.

There will be several directories, corresponding to each step along the pipeline. Notably are the coverage_table.csv, which will have raw coverages for each contig across all metagenomes, and the adj_coverage_table.csv. In this example, we specified a metagenome bp file, meaning Read2RefMapper will adjust/normalize the coverage table according to the metagenome sizes. If no metagenome bp file is found, then the adjusted/normalized coverage table will not be generated!
 
Note
The non-adjusted coverage table results:
 
Expected result
The adjusted coverage table.
 
The output PNG file:

Public workspaceMapping Metagenomic Reads to Reference Sequences (Cyverse) V.3

Open BowtieBatch

Select Inputs

Select Parameters

Launch Analysis

Results

Open Read2RefMapper

Select Inputs

Select Parameters and Outputs

Launch Analysis

Results

Mapping Metagenomic Reads to Reference Sequences (Cyverse) V.3