Mar 11, 2016

Public workspaceScript R4: Virome Taxonomy

  • HANNIGAN GD1,
  • GRICE EA1,
  • ET AL.1
  • 1DEPARTMENT OF DERMATOLOGY UNIVERSITY OF PENNSYLVANIA
  • VERVE Net
  • Club Grice
Icon indicating open access to content
QR code linking to this content
Protocol CitationHANNIGAN GD, GRICE EA, ET AL. 2016. Script R4: Virome Taxonomy. protocols.io https://dx.doi.org/10.17504/protocols.io.eiabcae
Manuscript citation:
Kindler L, Stoliartchouk A, Teytelman L, Hurwitz BL, Method-centered digital communities on protocols.io for fast-paced scientific innovation. F1000Research doi: 10.12688/f1000research.9453.2
License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License,  which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Protocol status: Working
Created: February 10, 2016
Last Modified: November 14, 2017
Protocol Integer ID: 2338
Keywords: virome taxonomy, stranded dna virome, virus taxonomy analysis, virome, virome taxonomy this protocol, phage species relative abundance profiles for each patient, bacteriophage, host microbiome, dynamic associations with the host microbiome, phage species, virus, relative abundance of specific species, specific species, relative abundance for each site, relative abundance profile, genetic enrichment, phage, human skin, dna
Abstract
This protocol outlines our bacteriophage/virus taxonomy analyses. This analysis includes profiles for the average order relative abundance for each site, the phage species relative abundance profiles for each patient, and the relative abundance of specific species between sites. Intermediate files are also included with the publication, and the paths are specified below. Based on the methods from the following publication:

Hannigan, Geoffrey D., et al. "The Human Skin Double-Stranded DNA Virome: Topographical and Temporal Diversity, Genetic Enrichment, and Dynamic Associations with the Host Microbiome." mBio 6.5 (2015): e01578-15.


Guidelines
sessionInfo()

## R version 3.2.0 (2015-04-16)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.4 (Yosemite)
## ## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] magrittr_1.5    formatR_1.2    tools_3.2.0    htmltools_0.2.6
## [5] yaml_2.1.13    stringi_0.4-1    rmarkdown_0.7    knitr_1.10.5
## [9] stringr_1.0.0    digest_0.6.8    evaluate_0.7
Troubleshooting
Load the required R packages.
Command
library(vegan)
packageVersion(
Expected result
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.3-0
Expected result
## [1] '2.3.0'
Expected result
## [1] '1.0.1'
Expected result
## [1] '1.6.0'
Expected result
## [1] '1.8.2'
Expected result
## [1] '1.4.1'
Import the data tables for order relative abundance.

Command
INPUT_ORDER <- read.delim(
Expected result
##V1 V2 V3
## 1No_hit 99626.40000 MG100098
## 2Unclassified_Order 5596.65000 MG100098
## 3Caudovirales 11968.50000 MG100098
## 4Herpesvirales 7.61005 MG100098
## 5Mononegavirales 49.08570 MG100098
Import the data tables for species relative abundance.
Command
INPUT_SPECIES <- read.delim(
Expected result
##V1 V2 V3
## 1Achromobacter_phage 0.00000 MG100098
## 3Human_immunodeficiency_virus 1.08351 MG100098
## 4Silicibacter_phage 0.00000 MG100098
## 5Clavibacter_phage 0.00000 MG100098
Serratia_phage 0.00000 MG100098
Input the mapping file.
Command
INPUT_MAP <- read.delim(
Expected result
##NexteraXT_SampleID NexteraXT_RunName NexteraXT_Virome_SampleID
## 1MG100151 NexteraXT_007 MG100102
## 2MG100150 NexteraXT_007 MG100101
## 3MG100149 NexteraXT_007 <NA>
## 4MG100146 NexteraXT_007 MG100098
## NexteraXT_Virome_RunName SubjectID
## 1NexteraXT_005 1
## 2NexteraXT_005 1
## 3<NA> 1
## 4NexteraXT_005 1
Here we need to reformat the mapping files. This means only looking at the two time points for which we have a complete data set (we have only partial data for time point 1), as well as excluding the sites and subjects for which we only have partial sampling (c(“Ba”,“Ph”,“Vf”,“Neg”)).
Command
SUBSET_MAP <- INPUT_MAP[which(INPUT_MAP$TimePoint %in% c(2,3)), ]
SUBSET_MAP <- SUBSET_MAP[-which(SUBSET_MAP$Site_Symbol %in% c(
Now we can easily plot the viral and phage order relative abundance information by skin site. Get a vector of the sample IDs that we want to pull out for analysis.
Command
KEEP_SAMPLES <- as.vector(SUBSET_MAP$NexteraXT_Virome_SampleID)
INPUT_SUBSET <- INPUT_ORDER[which(INPUT_ORDER$V3 %in% c(KEEP_SAMPLES)), ]
For this analysis we are removing those contigs which did not have a hit. We are only look at those contigs with phage/virus hits.
Command
INPUT_SUBSET <- INPUT_SUBSET[-which(INPUT_SUBSET$V1 %in% c(
These variables are called species but they are only named this. They really contain the order information.
Command
SPECIES_MEAN <- ddply(INPUT_MERGE, c(
Also filter out those taxa that account for less than 0.5% of the mean relative abundance.
Command
MeanForExclusion <- ddply(SPECIES_MEAN, c(
Now only use the filtered taxa.
Command
SpeciesMeanFiltered <- SPECIES_MEAN[c(SPECIES_MEAN$V1 %in% MeanForExclusionCut$V1),]
Take a look at the data frame.
Command
head(SpeciesMeanFiltered)
Expected result
##Site_Symbol V1 mean
## 1Ac Caudovirales 6804.984
## 4Ac Multiple 3488.321
## 6Ac Unclassified_Order 1544.165
## 7Ac Caudovirales 8729.835
## 10Ac Multiple 3893.010
## 12Ac Unclassified_Order 2434.145
Plot the results.
Command
ggplot(SpeciesMeanFiltered, aes(x=Site_Symbol, y=mean, group=V1, fill=V1)) + theme_bw() + geom_bar(stat=
Expected result
Now that we have looked at the information for taxonomic order, we also want to look at the species level. Here we are only going to look at the top 10 taxa, and include the remaining taxa relative abundance information as “Other”.
Command
#Same formatting and parsing as above
KEEP_SAMPLES <- as.vector(SUBSET_MAP$NexteraXT_Virome_SampleID)
INPUT_SUBSET <- INPUT_SPECIES[which(INPUT_SPECIES$V3 %in% c(KEEP_SAMPLES)), ]
INPUT_SUBSET <- INPUT_SUBSET[-which(INPUT_SUBSET$V1 %in% c(
Get the top ten taxa so we can specifically look at them.
Command
TOP_TEN_MEAN <- ddply(SPECIES_MEAN, c(
Get the rest of the relative abundance taxa into the "other" category.
Command
FINAL_OTHER_SUM <- data.frame(tapply(FINAL_OTHER$V2, INDEX=list(FINAL_OTHER$V3), FUN=sum))
FINAL_OTHER_SUM$SampleID <- c(row.names(FINAL_OTHER_SUM))
colnames(FINAL_OTHER_SUM) <- c(
Expected result
##V3 V1 V2 NexteraXT_SampleID
## 21MG100195 Multiple 6.39256e+03 MG100171
## 49MG100195 Mycobacterium_phage 1.31141e+02 MG100171
## 57MG100195 Propionibacterium_phage 1.69708e+00 MG100171
## 81MG100195 Pseudomonas_phage 1.16539e+05 MG100171
## 85MG100195 Streptococcus_phage 1.36360e+02 MG100171
## 87MG100195 Human_papillomavirus 1.89501e+03 MG100171
Order according to relative abundance.
Command
TOP10_WITH_OTHER <- ddply(TOTAL_FINAL, c(
Append other to the vector.
Command
ORDER_MEAN_NAMES_WITH_OTHER <- c(ORDER_MEAN_NAMES_WITH_OTHER, 
Get plotting of species by patient for graphing.
Note
The colors and other minor cosmetics can be fixed in illustrator.
Command
ggplot(TOTAL_FINAL, aes(x=factor(SubjectID), y=V2, fill=V1, order=V1)) + theme_bw() + geom_bar(stat=
Expected result
We are also interested in some specific viral species relative abundances by site, after looking at the general subject profiles above. Here we look at the specific relative abundances of HPV, propionibacterium phages, and staphylococcus phages. First format the data. Get a list of the sample names.
Command
SAMPLE_NAMES <- as.vector(unique(INPUT_SUBSET$V3))
IN_SUBSET_REL_ABUND <- data.frame(lapply(SAMPLE_NAMES, function(i) { SUBSET <- INPUT_SUBSET[c(INPUT_SUBSET$V3==i),] SUM <- sum(SUBSET$V2) SUBSET$Rel_Abund <- 100 * SUBSET$V2 / SUM colnames(SUBSET) <- c(
First we looked at the relative abundances of Staphylococcus phages. This is written as a set of copy/pasted script sections for taxonomic group.
Command
STAPH_REL_ABUND <- REL_ABUND_MELT[c(REL_ABUND_MELT$Taxa==
Plot the data by occlusion status.
Command
ggplot(STAPH_MERGE, aes(x=Occlusion, y=value)) + theme_bw() + geom_boxplot(notch=TRUE) + scale_y_log10() + ggtitle(
Expected result
## Warning in loop_apply(n, do.ply): Removed 1 rows containing non-finite values (stat_boxplot).



Perform stats using the kruskalmc subroutine.
Command
STAPH_MERGE$Occlusion <- factor(STAPH_MERGE$Occlusion)
kruskalmc(STAPH_MERGE$value, STAPH_MERGE$Occlusion)
Expected result
## Multiple comparison test after Kruskal-Wallis
## p.value: 0.05
## Comparisons

obs.dif critical.dif difference
Exposed-Intermittently_Occluded10.76401 36.09383 FALSE
Exposed-Occluded44.70092 23.71043 TRUE
Intermittently_Occulded-Occluded55.46493 34.98437 TRUE
Plot by site category.
Command
ggplot(STAPH_MERGE, aes(x=Site_Categories, y=value)) + theme_bw() + geom_boxplot(notch=TRUE) + scale_y_log10() + ggtitle(
Expected result
## Warning in loop_apply(n, do.ply): Removed 1 rows containing non-finite values (stat_boxplot).

Run the stats.
Command
STAPH_MERGE$Site_Categories <- factor(STAPH_MERGE$Site_Categories)
kruskalmc(STAPH_MERGE$value, STAPH_MERGE$Site_Categories)
Expected result
## Multiple comparison test after Kruskal-Wallis
## p.value: 0.05
## Comparisons

obs.dif critical.dif difference
Intermittently_Moist-Moist51.958476 28.54913 TRUE
Intermittently_Moist-Sebaceous53.065524 28.43069 TRUE
Moist-Sebaceous1.107048 25.32023 FALSE
Get the Propionibacerium phage rows.
Command
PROP_REL_ABUND <- REL_ABUND_MELT[c(REL_ABUND_MELT$Taxa==
Plot by occlusion status.
Command
ggplot(PROP_MERGE, aes(x=Occlusion, y=value)) + theme_bw() + geom_boxplot(notch=TRUE) + scale_y_log10() + ggtitle(
Expected result
## Warning in loop_apply(n, do.ply): Removed 27 rows containing non-finite values (stat_boxplot).



Run the stats.
Command
PROP_MERGE$Occlusion <- factor(PROP_MERGE$Occlusion)
kruskalmc(PROP_MERGE$value, PROP_MERGE$Occlusion)
Expected result
## Multiple comparison test after Kruskal-Wallis
## p.value: 0.05
## Comparisons

obs.dif critical.dif difference
Exposed-Intermittently_Occluded31.54397 36.09383 FALSE
Exposed-Occluded20.64971 23.71043 FALSE
Intermittently_Occluded-Occluded10.89427 34.98437 FALSE
Plot by site category.
Command
ggplot(PROP_MERGE, aes(x=Site_Categories, y=value)) + theme_bw() + geom_boxplot(notch=TRUE) + scale_y_log10() + ggtitle(
Expected result
## Warning in loop_apply(n, do.ply): Removed 27 rows containing non-finite values (stat_boxplot).

Run the stats.
Command
PROP_MERGE$Site_Categories <- factor(PROP_MERGE$Site_Categories)
kruskalmc(PROP_MERGE$value, PROP_MERGE$Site_Categories)
Expected result
obs.dif critical.dif difference
Intermittently_Moist-Moist10.71380 28.54913 FALSE
Intermittently_Moist-Sebaceous44.36156 28.43069 TRUE
Moist-Sebaceous55.07535 25.32023 TRUE
Finally do the same analysis for HPV.
Command
HPV_REL_ABUND <- REL_ABUND_MELT[c(REL_ABUND_MELT$Taxa==
Plot by occlusion status.
Command
ggplot(HPV_MERGE, aes(x=Occlusion, y=value)) + theme_bw() + geom_boxplot(notch=TRUE) + scale_y_log10() + ggtitle(
Expected result
## Warning in loop_apply(n, do.ply): Removed 56 rows containing non-finite values (stat_boxplot).

Run the stats.
Command
HPV_MERGE$Occlusion <- factor(HPV_MERGE$Occlusion)
kruskalmc(HPV_MERGE$value, HPV_MERGE$Occlusion)
Expected result
obs.dif critical.dif difference
Exposed-Intermittently_Occluded58.94431 36.09383 TRUE
Exposed-Occluded33.67678 23.71043 TRUE
Intermittently_Occulded-Occluded25.26754 34.98437 FALSE
Plot by site category.
Command
ggplot(HPV_MERGE, aes(x=Site_Categories, y=value)) + theme_bw() + geom_boxplot(notch=TRUE) + scale_y_log10() + ggtitle(
Expected result
## Warning in loop_apply(n, do.ply): Removed 56 rows containing non-finite values (stat_boxplot).

Run the stats.
Command
HPV_MERGE$Site_Categories <- factor(HPV_MERGE$Site_Categories)
kruskalmc(HPV_MERGE$value, HPV_MERGE$Site_Categories)
Expected result
## Multiple comparison test after Kruskal-Wallis
## p.value: 0.05
## Comparisons

obs.dif critical.dif difference
Intermittently_Moist-Moist22.01579 28.54913 FALSE
Intermittently_Moist-Sebaceous36.03696 28.43069 TRUE
Moist-Sebaceous58.05275 25.32023 TRUE