Feb 03, 2016

Public workspaceModeling ecological drivers in marine viral communities using comparative metagenomics and network analyses

  • Bonnie Hurwitz1,
  • Ken Youens-Clark1
  • 1University of Arizona
  • VERVE Net
  • Hurwitz Lab
Icon indicating open access to content
QR code linking to this content
Protocol CitationBonnie Hurwitz, Ken Youens-Clark 2016. Modeling ecological drivers in marine viral communities using comparative metagenomics and network analyses. protocols.io https://dx.doi.org/10.17504/protocols.io.efgbbjw
License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License,  which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Protocol status: Working
Created: January 13, 2016
Last Modified: March 27, 2018
Protocol Integer ID: 2248
Keywords: ecological drivers in marine viral community, marine viral community, questions in marine viral ecology, marine viral ecology, occurring viral diversity, viral metagenome, viral community, pacific ocean virome, comparative metagenomic, using comparative metagenomic, metagenomic, free strategy for comparative metagenomic, viromes from diverse site, visualization of complex sample network, metagenome, ecology, community structure, complex sample network, social network analysis, network analysis, virome, fundamental ecological question, modeling ecological driver, diverse site, ecological factor, driving community structure, pacific ocean
Abstract
Long-standing questions in marine viral ecology are centered on understanding how viral assemblages change along gradients in space and time. However, investigating these fundamental ecological questions has been challenging due to incomplete representation of naturally occurring viral diversity in single gene- or morphology-based studies and an inability to identify up to 90% of reads in viral metagenomes (viromes).  In this protocol, I describe how to use an annotation- and assembly-free strategy for comparative metagenomics that combines shared k-mer and social network analyses (regression modeling). This robust statistical framework enables visualization of complex sample networks and determination of ecological factors driving community structure.  This tutorial describes a protocol to reproduce work from the Pacific Ocean virome comprised of 32 viromes from diverse sites in the Pacific Ocean.


"Modeling ecological drivers in marine viral communities using comparative metagenomics and network analyses" (July 7, 2014, doi: 10.1073/pnas.1319778111, PNAS July 22, 2014 vol. 111 no. 29 10714-10719)

Code is freely available at Github.











Troubleshooting
Log into to iPlant/CyVerse (http://www.cyverse.org/, http://de.iplantcollaborative.org) Discovery Environment.
Upload FASTA-formatted sequence files and a tab-delimited file of metadata.  Example data can be found in the Data Store at "/iplant/home/shared/imicrobe/fizkin/pov."  To view in the Discovery Environment:
  • Click on the "Data" button in the DE
  • Go to the "Community Data -> imicrobe -> fizkin -> pov -> fasta" directory


A sample metadata file is also included ("meta.tab").  The headers of the metadata file should include the 'name' of the file and fields ending in '.d' for 'discrete' value (e.g., 'Male' or 'Female'), '.c' for 'continuuous' data (e.g., numbers in a range), or '.ll' for 'latitude/longitude' data.  Field names should not include underscores with the exception of 'lat_long.ll.'  

Here is an example table:
Select the "Apps" button on the left, then look under "Public Apps -> Experimental -> iMicrobe -> Fizkin."  Open the "Required Args" section and select your FASTA directory as the "Input directory."  You can leave "Output directory" alone or change it if you wish.  Use the file selector to find your "Metadata file" described in step 2.



Optional args:
  • K-mer size: Default is 20.  Values between 16 and 31 are best.
  • Mode minimum: Default is 1.  Increase to require more stringent matching.
  • Max. num. sequences: Default is 300K.  Use a lower value to reduce runtime.  Use a higher value to get deeper coverage.  Samples containing more than this parameter will be randomly sampled.
  • Max. num. samples: Default is 15.  Keep in mind that Fizkin runs a pair-wise analysis, so runtime is O(n^2).  If your number of samples is greater than this argument, the samples will be randomly selected.
  • Files list: The subset of files you wish to run, one file on each line
Press "Launch analysis" and wait for notification of the completion of your job.
Common failures include something like this from R (GBME):

Error in summary(fit1)$cov.unscaled[(2 * n):length(fit1$coef), (2 * n):length(fit1$coef)] :
subscript out of bounds
Calls: gbme -> gbme.glmstart -> as.matrix
Execution halted

This is usually due to the metadata being too homogenous or entirely heterogenous.  Remove any offending metadata and try again.
The ultimate result should be a social network graph showing the grouping of samples similar to this: