Mar 11, 2016

Public workspaceScript R13: OPF and AMG Analysis

  • HANNIGAN GD1,
  • GRICE EA1,
  • ET AL.1
  • 1DEPARTMENT OF DERMATOLOGY UNIVERSITY OF PENNSYLVANIA
  • VERVE Net
  • Club Grice
Icon indicating open access to content
QR code linking to this content
Protocol CitationHANNIGAN GD, GRICE EA, ET AL. 2016. Script R13: OPF and AMG Analysis. protocols.io https://dx.doi.org/10.17504/protocols.io.ejkbckw
Manuscript citation:
Kindler L, Stoliartchouk A, Teytelman L, Hurwitz BL, Method-centered digital communities on protocols.io for fast-paced scientific innovation. F1000Research doi: 10.12688/f1000research.9453.2
License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License,  which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Protocol status: Working
Created: February 16, 2016
Last Modified: November 09, 2017
Protocol Integer ID: 2380
Abstract
This protocols outlines our definitions of core operational protein families (OPFs), potential auxiliary metabolic genes (AMGs), sharing of genes across anatomical sites, and Bray-Curtis dissimilarity of the OPFs by anatomical site. Based on methods from the following publication:

Hannigan, Geoffrey D., et al. "The Human Skin Double-Stranded DNA Virome: Topographical and Temporal Diversity, Genetic Enrichment, and Dynamic Associations with the Host Microbiome." mBio 6.5 (2015): e01578-15.


Guidelines
sessionInfo()

## R version 3.2.0 (2015-04-16)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.4 (Yosemite)
## ## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] magrittr_1.5    formatR_1.2    tools_3.2.0    htmltools_0.2.6
## [5] yaml_2.1.13    stringi_0.4-1    rmarkdown_0.7    knitr_1.10.5
## [9] stringr_1.0.0    digest_0.6.8    evaluate_0.7
Load the required R packages.
Command
library(ggplot2)
packageVersion(
Expected result

## [1] '1.0.1'


Expected result

## [1] '2.3.0'


Expected result

## [1] '1.0.1'


Expected result

## [1] '0.3.35'

Import the needed data files and quantify the number of core operational protein families. Also write the names of the core OPFs so that I can query them and identify them after subsetting the input files.

Command
# Import the protein cluster relative abundance data frame
INPUT <- read.delim(
Keep only those columns with values matching these found in the mapping file subset. This will get rid of the locations we are not evaluating.
Command
InputSubset <- INPUT[,which(colnames(INPUT) %in% MAP_SUBSET$NexteraXT_Virome_SampleID)]
InputSubsetWithNames <- INPUT[,c(1,which(colnames(INPUT) %in% MAP_SUBSET$NexteraXT_Virome_SampleID))]
rownames(InputSubsetWithNames) <- c(as.character(InputSubsetWithNames[,1]))
InputSubsetWithNames <- InputSubsetWithNames[,-1]
From here you can go to the diversity calculations below.
Command
TotalOrfs <- length(InputSubsetWithNames[,1])
# Remove the rows with any fields having a relative abundance of zero (gene not present)
# First go through and see what rows have zero values; see http://stackoverflow.com/questions/9977686/how-to-remove-rows-with-a-zero-value-in-r
RowSubset <- apply(InputSubsetWithNames, 1, function(row) all(row !=0 ))
CoreAMG <- InputSubsetWithNames[RowSubset,]
TotalCoreOrfs <- length(CoreAMG[,1])
Write the names of the core protein clusters so that I can pull out the fasta sequences and query them against UniProt.
Command
write.table(rownames(CoreAMG), file=
Determine how many genes are core to sites by location type.
Command
UniqSiteTypes <- factor(unique(MAP_SUBSET$Site_Categories))
for(x in UniqSiteTypes)
{ MapSebaceous <- MAP_SUBSET[which(MAP_SUBSET$Site_Categories %in% x), ] InputSebaceous <- InputSubset[,which(colnames(InputSubset) %in% MapSebaceous$NexteraXT_Virome_SampleID)] RowSubset <- apply(InputSebaceous, 1, function(row) all(row !=0 )) SebaceousAMG <- InputSebaceous[RowSubset,] print(x) print(length(SebaceousAMG[,1]))
}
Expected result
## [1] "Sebaceous"
## [1] 25
## [1] "Moist"
## [1] 15
## [1] "Intermittently_Moist"
## [1] 38
See number of core protein clusters by site.
Command
UniqOcc <- factor(unique(MAP_SUBSET$Occlusion))
for(x in UniqOcc)
{ MapSebaceous <- MAP_SUBSET[which(MAP_SUBSET$Occlusion %in% x), ] InputSebaceous <- InputSubset[,which(colnames(InputSubset) %in% MapSebaceous$NexteraXT_Virome_SampleID)] RowSubset <- apply(InputSebaceous, 1, function(row) all(row !=0 )) SebaceousAMG <- InputSebaceous[RowSubset,] print(x) print(length(SebaceousAMG[,1]))
}
Expected result
## [1] "Occluded"
## [1] 15
## [1] "Exposed"
## [1] 24
## [1] "Intermittently_Occluded"
## [1] 174
Try again for all of the different anatomical groups for plotting.
Command
UniqSites <- factor(unique(MAP_SUBSET$Site_Symbol))
anatomicalSiteDF <- c()
for(x in UniqSites) { MapSebaceous <- MAP_SUBSET[which(MAP_SUBSET$Site_Symbol %in% x), ] InputSebaceous <- InputSubset[,which(colnames(InputSubset) %in% MapSebaceous$NexteraXT_Virome_SampleID)] RowSubset <- apply(InputSebaceous, 1, function(row) all(row !=0 )) SebaceousAMG <- InputSebaceous[RowSubset,] anatomicalSiteDF <- rbind(anatomicalSiteDF,c(x,length(SebaceousAMG[,1])))
}
anatomicalSiteDF <- as.data.frame(anatomicalSiteDF)
as.numeric(as.character(anatomicalSiteDF[,2]))
Expected result

## [1] 72 114 64 494 41 15 45 174


Plot the protein cluster count.
Command
AmgSitePlot <- ggplot(anatomicalSiteDF, aes(x=V1, y=as.numeric(as.character(V2)))) + theme_bw() + geom_bar(stat=
Expected result
Finally we calculated the diversity of ORFs by different anatomical sites. Transpose the input file.
Command
InputT <- data.frame(t(InputSubset))
INPUT_SUBSET_DIST_MATRIX <- vegdist(InputT, method = 
Visualize the distance matrix using NMDS.
Command
BRAY_ORD_NMDS <- metaMDS(INPUT_SUBSET_DIST_MATRIX,k=3)
BRAY_ORD_FIT = data.frame(MDS1 = BRAY_ORD_NMDS$points[,1], MDS2 = BRAY_ORD_NMDS$points[,2], MDS3 = BRAY_ORD_NMDS$points[,3])
#Record the stress value
BRAY_ORD_NMDS$stress
BRAY_ORD_FIT$SampleID <- rownames(BRAY_ORD_FIT)
NMDS_AND_MAP <- merge(BRAY_ORD_FIT, MAP_SUBSET, by.x=
Expected result
## Run 0 stress 0.142143
## Run 1 stress 0.1460666
## Run 2 stress 0.1438215
## Run 3 stress 0.1435173
## Run 4 stress 0.1408786
## ... New best solution
## ... procrustes: rmse 0.03486057 max resid 0.3573651
## Run 5 stress 0.1429197
## Run 6 stress 0.1436577
## Run 7 stress 0.1439074
## Run 8 stress 0.1473716
## Run 9 stress 0.1443224
## Run 10 stress 0.1431153
## Run 11 stress 0.147699
## Run 12 stress 0.1447852
## Run 13 stress 0.1427725
## Run 14 stress 0.1455643
## Run 15 stress 0.1470337
## Run 16 stress 0.1446069
## Run 17 stress 0.1431871
## Run 18 stress 0.1442179
## Run 19 stress 0.146792
## Run 20 stress 0.1463412
Expected result

## [1] 0.1408786


Plot the data.
Command
s3d <- scatterplot3d(NMDS_AND_MAP$MDS1,NMDS_AND_MAP$MDS2,NMDS_AND_MAP$MDS3, pch=16, color=as.integer(factor(NMDS_AND_MAP$Site_Categories)), type=
Expected result
Plot by occlusion site status.
Command
s3d <- scatterplot3d(NMDS_AND_MAP$MDS1,NMDS_AND_MAP$MDS2,NMDS_AND_MAP$MDS3, pch=16, color=as.integer(factor(NMDS_AND_MAP$Occlusion)), type=
Expected result