Script R4: Virome Taxonomy

HANNIGAN GD; GRICE EA; ET AL.

Mar 11, 2016

Script R4: Virome Taxonomy

F1000Research

DOI

https://dx.doi.org/10.17504/protocols.io.eiabcae

HANNIGAN GD¹,
GRICE EA¹,
ET AL.¹

¹DEPARTMENT OF DERMATOLOGY UNIVERSITY OF PENNSYLVANIA

VERVE Net
Club Grice

Geoffrey D Hannigan

Merck Research Labs

DOI: https://dx.doi.org/10.17504/protocols.io.eiabcae

External link: http://mbio.asm.org/content/6/5/e01578-15.full

Protocol Citation: HANNIGAN GD, GRICE EA, ET AL. 2016. Script R4: Virome Taxonomy. protocols.io https://dx.doi.org/10.17504/protocols.io.eiabcae

Manuscript citation:

Kindler L,  Stoliartchouk A,  Teytelman L,  Hurwitz BL, Method-centered digital communities on protocols.io for fast-paced scientific innovation. F1000Research  doi: 10.12688/f1000research.9453.2

License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

Protocol status: Working

Created: February 10, 2016

Last Modified: November 14, 2017

Protocol Integer ID: 2338

Keywords: virome taxonomy, stranded dna virome, virus taxonomy analysis, virome, virome taxonomy this protocol, phage species relative abundance profiles for each patient, bacteriophage, host microbiome, dynamic associations with the host microbiome, phage species, virus, relative abundance of specific species, specific species, relative abundance for each site, relative abundance profile, genetic enrichment, phage, human skin, dna

Abstract

This protocol outlines our bacteriophage/virus taxonomy analyses. This analysis includes profiles for the average order relative abundance for each site, the phage species relative abundance profiles for each patient, and the relative abundance of specific species between sites. Intermediate files are also included with the publication, and the paths are specified below. Based on the methods from the following publication:

Hannigan, Geoffrey D., et al. "The Human Skin Double-Stranded DNA Virome: Topographical and Temporal Diversity, Genetic Enrichment, and Dynamic Associations with the Host Microbiome." mBio 6.5 (2015): e01578-15.

Guidelines

sessionInfo()

## R version 3.2.0 (2015-04-16) 
## Platform: x86_64-apple-darwin13.4.0 (64-bit) 
## Running under: OS X 10.10.4 (Yosemite) 
## ## locale: 
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 
## 
## attached base packages: 
## [1] stats graphics grDevices utils datasets methods base 
## 
## loaded via a namespace (and not attached): 
## [1] magrittr_1.5    formatR_1.2    tools_3.2.0    htmltools_0.2.6 
## [5] yaml_2.1.13    stringi_0.4-1    rmarkdown_0.7    knitr_1.10.5 
## [9] stringr_1.0.0    digest_0.6.8    evaluate_0.7

Troubleshooting

Before start

Supplemental information available at:

https://figshare.com/articles/The_Human_Skin_dsDNA_Virome_Topographical_and_Temporal_Diversity_Genetic_Enrichment_and_Dynamic_Associations_with_the_Host_Microbiome/1281248

Load the required R packages.  
Command
library(vegan)
packageVersion(
Expected result
## Loading required package: permute
## Loading required package: lattice 
## This is vegan 2.3-0
Expected result
## [1] '2.3.0'
Expected result
## [1] '1.0.1'
Expected result
## [1] '1.6.0'
Expected result
## [1] '1.8.2'
Expected result
## [1] '1.4.1'

Import the data tables for order relative abundance.

Command
INPUT_ORDER <- read.delim(
Expected result
##V1 V2 V3
## 1No_hit 99626.40000 MG100098
## 2Unclassified_Order 5596.65000 MG100098
## 3Caudovirales 11968.50000 MG100098
## 4Herpesvirales 7.61005 MG100098
## 5Mononegavirales 49.08570 MG100098

Import the data tables for species relative abundance.
Command
INPUT_SPECIES <- read.delim(
Expected result
##V1 V2 V3
## 1Achromobacter_phage 0.00000 MG100098
## 3Human_immunodeficiency_virus 1.08351 MG100098
## 4Silicibacter_phage 0.00000 MG100098
## 5Clavibacter_phage 0.00000 MG100098
 Serratia_phage 0.00000 MG100098

Input the mapping file.  
Command
INPUT_MAP <- read.delim(
Expected result
##NexteraXT_SampleID NexteraXT_RunName NexteraXT_Virome_SampleID
## 1MG100151 NexteraXT_007 MG100102
## 2MG100150 NexteraXT_007 MG100101
## 3MG100149 NexteraXT_007 <NA>
## 4MG100146 NexteraXT_007 MG100098
## NexteraXT_Virome_RunName SubjectID  
## 1NexteraXT_005 1  
## 2NexteraXT_005 1  
## 3<NA> 1  
## 4NexteraXT_005 1  

Here we need to reformat the mapping files. This means only looking at the two time points for which we have a complete data set (we have only partial data for time point 1), as well as excluding the sites and subjects for which we only have partial sampling (c(“Ba”,“Ph”,“Vf”,“Neg”)).  
Command
SUBSET_MAP <- INPUT_MAP[which(INPUT_MAP$TimePoint %in% c(2,3)), ]
SUBSET_MAP <- SUBSET_MAP[-which(SUBSET_MAP$Site_Symbol %in% c(

Now we can easily plot the viral and phage order relative abundance information by skin site. Get a vector of the sample IDs that we want to pull out for analysis.  
Command
KEEP_SAMPLES <- as.vector(SUBSET_MAP$NexteraXT_Virome_SampleID)
INPUT_SUBSET <- INPUT_ORDER[which(INPUT_ORDER$V3 %in% c(KEEP_SAMPLES)), ]

For this analysis we are removing those contigs which did not have a hit. We are only look at those contigs with phage/virus hits.  
Command
INPUT_SUBSET <- INPUT_SUBSET[-which(INPUT_SUBSET$V1 %in% c(

These variables are called species but they are only named this. They really contain the order information.  
Command
SPECIES_MEAN <- ddply(INPUT_MERGE, c(

Also filter out those taxa that account for less than 0.5% of the mean relative abundance.  
Command
MeanForExclusion <- ddply(SPECIES_MEAN, c(

Now only use the filtered taxa.  
Command
SpeciesMeanFiltered <- SPECIES_MEAN[c(SPECIES_MEAN$V1 %in% MeanForExclusionCut$V1),]

Take a look at the data frame.  
Command
head(SpeciesMeanFiltered)
Expected result
##Site_Symbol V1 mean
## 1Ac Caudovirales 6804.984
## 4Ac Multiple 3488.321
## 6Ac Unclassified_Order 1544.165
## 7Ac Caudovirales 8729.835
 ## 10Ac Multiple 3893.010
## 12Ac Unclassified_Order 2434.145

Plot the results.  
Command
ggplot(SpeciesMeanFiltered, aes(x=Site_Symbol, y=mean, group=V1, fill=V1)) + theme_bw() + geom_bar(stat=
Expected result

Now that we have looked at the information for taxonomic order, we also want to look at the species level. Here we are only going to look at the top 10 taxa, and include the remaining taxa relative abundance information as “Other”.  
Command
#Same formatting and parsing as above
KEEP_SAMPLES <- as.vector(SUBSET_MAP$NexteraXT_Virome_SampleID)
INPUT_SUBSET <- INPUT_SPECIES[which(INPUT_SPECIES$V3 %in% c(KEEP_SAMPLES)), ]
INPUT_SUBSET <- INPUT_SUBSET[-which(INPUT_SUBSET$V1 %in% c(

Get the top ten taxa so we can specifically look at them.  
Command
TOP_TEN_MEAN <- ddply(SPECIES_MEAN, c(

Get the rest of the relative abundance taxa into the "other" category.  
Command
FINAL_OTHER_SUM <- data.frame(tapply(FINAL_OTHER$V2, INDEX=list(FINAL_OTHER$V3), FUN=sum))
FINAL_OTHER_SUM$SampleID <- c(row.names(FINAL_OTHER_SUM))
colnames(FINAL_OTHER_SUM) <- c(
Expected result
##V3 V1 V2 NexteraXT_SampleID
## 21MG100195 Multiple 6.39256e+03 MG100171
## 49MG100195 Mycobacterium_phage 1.31141e+02 MG100171
## 57MG100195 Propionibacterium_phage 1.69708e+00 MG100171
## 81MG100195 Pseudomonas_phage 1.16539e+05 MG100171
## 85MG100195 Streptococcus_phage 1.36360e+02 MG100171
## 87MG100195 Human_papillomavirus 1.89501e+03 MG100171

Order according to relative abundance.  
Command
TOP10_WITH_OTHER <- ddply(TOTAL_FINAL, c(

Append other to the vector.  
Command
ORDER_MEAN_NAMES_WITH_OTHER <- c(ORDER_MEAN_NAMES_WITH_OTHER, 

Get plotting of species by patient for graphing.  
Note
The colors and other minor cosmetics can be fixed in illustrator.
Command
ggplot(TOTAL_FINAL, aes(x=factor(SubjectID), y=V2, fill=V1, order=V1)) + theme_bw() + geom_bar(stat=
Expected result

We are also interested in some specific viral species relative abundances by site, after looking at the general subject profiles above. Here we look at the specific relative abundances of HPV, propionibacterium phages, and staphylococcus phages. First format the data. Get a list of the sample names.  
Command
SAMPLE_NAMES <- as.vector(unique(INPUT_SUBSET$V3))
IN_SUBSET_REL_ABUND <- data.frame(lapply(SAMPLE_NAMES, function(i) { SUBSET <- INPUT_SUBSET[c(INPUT_SUBSET$V3==i),] SUM <- sum(SUBSET$V2) SUBSET$Rel_Abund <- 100 * SUBSET$V2 / SUM colnames(SUBSET) <- c(

First we looked at the relative abundances of Staphylococcus phages. This is written as a set of copy/pasted script sections for taxonomic group.  
Command
STAPH_REL_ABUND <- REL_ABUND_MELT[c(REL_ABUND_MELT$Taxa==

Plot the data by occlusion status.  
Command
ggplot(STAPH_MERGE, aes(x=Occlusion, y=value)) + theme_bw() + geom_boxplot(notch=TRUE) + scale_y_log10() + ggtitle(
Expected result
## Warning in loop_apply(n, do.ply): Removed 1 rows containing non-finite values (stat_boxplot).

Perform stats using the kruskalmc subroutine.  
Command
STAPH_MERGE$Occlusion <- factor(STAPH_MERGE$Occlusion)
kruskalmc(STAPH_MERGE$value, STAPH_MERGE$Occlusion)
Expected result
## Multiple comparison test after Kruskal-Wallis
## p.value: 0.05
## Comparisons

 obs.dif critical.dif difference
Exposed-Intermittently_Occluded10.76401 36.09383 FALSE
Exposed-Occluded44.70092 23.71043 TRUE
Intermittently_Occulded-Occluded55.46493 34.98437 TRUE

Plot by site category.  
Command
ggplot(STAPH_MERGE, aes(x=Site_Categories, y=value)) + theme_bw() + geom_boxplot(notch=TRUE) + scale_y_log10() + ggtitle(
Expected result
## Warning in loop_apply(n, do.ply): Removed 1 rows containing non-finite values (stat_boxplot).

Run the stats.  
Command
STAPH_MERGE$Site_Categories <- factor(STAPH_MERGE$Site_Categories)
kruskalmc(STAPH_MERGE$value, STAPH_MERGE$Site_Categories)
Expected result
## Multiple comparison test after Kruskal-Wallis
## p.value: 0.05
## Comparisons

 obs.dif critical.dif difference
Intermittently_Moist-Moist51.958476 28.54913 TRUE
Intermittently_Moist-Sebaceous53.065524 28.43069 TRUE
Moist-Sebaceous1.107048 25.32023 FALSE

Get the Propionibacerium phage rows.  
Command
PROP_REL_ABUND <- REL_ABUND_MELT[c(REL_ABUND_MELT$Taxa==

Plot by occlusion status.  
Command
ggplot(PROP_MERGE, aes(x=Occlusion, y=value)) + theme_bw() + geom_boxplot(notch=TRUE) + scale_y_log10() + ggtitle(
Expected result
## Warning in loop_apply(n, do.ply): Removed 27 rows containing non-finite values (stat_boxplot).

Run the stats.  
Command
PROP_MERGE$Occlusion <- factor(PROP_MERGE$Occlusion)
kruskalmc(PROP_MERGE$value, PROP_MERGE$Occlusion)
Expected result
## Multiple comparison test after Kruskal-Wallis
## p.value: 0.05
## Comparisons

 obs.dif critical.dif difference
Exposed-Intermittently_Occluded31.54397 36.09383 FALSE
Exposed-Occluded20.64971 23.71043 FALSE
Intermittently_Occluded-Occluded10.89427 34.98437 FALSE

Plot by site category.  
Command
ggplot(PROP_MERGE, aes(x=Site_Categories, y=value)) + theme_bw() + geom_boxplot(notch=TRUE) + scale_y_log10() + ggtitle(
Expected result
## Warning in loop_apply(n, do.ply): Removed 27 rows containing non-finite values (stat_boxplot).

Run the stats.  
Command
PROP_MERGE$Site_Categories <- factor(PROP_MERGE$Site_Categories)
kruskalmc(PROP_MERGE$value, PROP_MERGE$Site_Categories)
Expected result
 obs.dif critical.dif difference
Intermittently_Moist-Moist10.71380 28.54913 FALSE
Intermittently_Moist-Sebaceous44.36156 28.43069 TRUE
Moist-Sebaceous55.07535 25.32023 TRUE

Finally do the same analysis for HPV.  
Command
HPV_REL_ABUND <- REL_ABUND_MELT[c(REL_ABUND_MELT$Taxa==

Plot by occlusion status.  
Command
ggplot(HPV_MERGE, aes(x=Occlusion, y=value)) + theme_bw() + geom_boxplot(notch=TRUE) + scale_y_log10() + ggtitle(
Expected result
## Warning in loop_apply(n, do.ply): Removed 56 rows containing non-finite values (stat_boxplot).

Run the stats.  
Command
HPV_MERGE$Occlusion <- factor(HPV_MERGE$Occlusion)
kruskalmc(HPV_MERGE$value, HPV_MERGE$Occlusion)
Expected result
 obs.dif critical.dif difference
Exposed-Intermittently_Occluded58.94431 36.09383 TRUE
Exposed-Occluded33.67678 23.71043 TRUE
Intermittently_Occulded-Occluded25.26754 34.98437 FALSE

Plot by site category.  
Command
ggplot(HPV_MERGE, aes(x=Site_Categories, y=value)) + theme_bw() + geom_boxplot(notch=TRUE) + scale_y_log10() + ggtitle(
Expected result
## Warning in loop_apply(n, do.ply): Removed 56 rows containing non-finite values (stat_boxplot).

Run the stats.  
Command
HPV_MERGE$Site_Categories <- factor(HPV_MERGE$Site_Categories)
kruskalmc(HPV_MERGE$value, HPV_MERGE$Site_Categories)
Expected result
## Multiple comparison test after Kruskal-Wallis
## p.value: 0.05
## Comparisons

 obs.dif critical.dif difference
Intermittently_Moist-Moist22.01579 28.54913 FALSE
Intermittently_Moist-Sebaceous36.03696 28.43069 TRUE
Moist-Sebaceous58.05275 25.32023 TRUE

##	V1	V2	V3
## 1	No_hit	99626.40000	MG100098
## 2	Unclassified_Order	5596.65000	MG100098
## 3	Caudovirales	11968.50000	MG100098
## 4	Herpesvirales	7.61005	MG100098
## 5	Mononegavirales	49.08570	MG100098

##	V1	V2	V3
## 1	Achromobacter_phage	0.00000	MG100098
## 3	Human_immunodeficiency_virus	1.08351	MG100098
## 4	Silicibacter_phage	0.00000	MG100098
## 5	Clavibacter_phage	0.00000	MG100098
	Serratia_phage	0.00000	MG100098

##	NexteraXT_SampleID	NexteraXT_RunName	NexteraXT_Virome_SampleID
## 1	MG100151	NexteraXT_007	MG100102
## 2	MG100150	NexteraXT_007	MG100101
## 3	MG100149	NexteraXT_007	<NA>
## 4	MG100146	NexteraXT_007	MG100098
##	NexteraXT_Virome_RunName	SubjectID
## 1	NexteraXT_005	1
## 2	NexteraXT_005	1
## 3	<NA>	1
## 4	NexteraXT_005	1

##	Site_Symbol	V1	mean
## 1	Ac	Caudovirales	6804.984
## 4	Ac	Multiple	3488.321
## 6	Ac	Unclassified_Order	1544.165
## 7	Ac	Caudovirales	8729.835
## 10	Ac	Multiple	3893.010
## 12	Ac	Unclassified_Order	2434.145

##	V3	V1	V2	NexteraXT_SampleID
## 21	MG100195	Multiple	6.39256e+03	MG100171
## 49	MG100195	Mycobacterium_phage	1.31141e+02	MG100171
## 57	MG100195	Propionibacterium_phage	1.69708e+00	MG100171
## 81	MG100195	Pseudomonas_phage	1.16539e+05	MG100171
## 85	MG100195	Streptococcus_phage	1.36360e+02	MG100171
## 87	MG100195	Human_papillomavirus	1.89501e+03	MG100171

	obs.dif	critical.dif	difference
Exposed-Intermittently_Occluded	10.76401	36.09383	FALSE
Exposed-Occluded	44.70092	23.71043	TRUE
Intermittently_Occulded-Occluded	55.46493	34.98437	TRUE

Public workspaceScript R4: Virome Taxonomy

Script R4: Virome Taxonomy