May 14, 2026

Bioinformatics and Ecological Risk Classification of Environmental Water Microbiomes

  • Grace Semabia Kpeli1,
  • Prince Agyirey-Kwakye2,
  • Counseller Nutifafa Livingstone1,
  • Lillian Teye Cudjoe1,
  • Emmanuel Edem Dotse1,
  • Ebenezer Nyarko1,
  • Ebenezer Zar1,
  • Obed Nyarko-Otu1,
  • Hubert Kwame Agbogli1,
  • Godwin Glilekpeh1,
  • Solomon Korankye1,
  • Priscilla Essandoh1,
  • Daniel Elorm Kabotso3
  • 1Department of Biomedical Science, University of Health and Allied Sciences, Sokode Lokoe, Ho, Ghana;
  • 2AttoDiagnostics Limited, Norwich, UK;
  • 3Department of Basic Science, University of Health and Allied Sciences, Sokode Lokoe, Ho, Ghana
  • GSK's workspace
Icon indicating open access to content
QR code linking to this content
Protocol CitationGrace Semabia Kpeli, Prince Agyirey-Kwakye, Counseller Nutifafa Livingstone, Lillian Teye Cudjoe, Emmanuel Edem Dotse, Ebenezer Nyarko, Ebenezer Zar, Obed Nyarko-Otu, Hubert Kwame Agbogli, Godwin Glilekpeh, Solomon Korankye, Priscilla Essandoh, Daniel Elorm Kabotso 2026. Bioinformatics and Ecological Risk Classification of Environmental Water Microbiomes. protocols.io https://dx.doi.org/10.17504/protocols.io.rm7vz44o2lx1/v1
License: This is an open access  protocol  distributed under the terms of the  Creative Commons Attribution License,  which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Protocol status: Working
We use this protocol and it's working
Created: May 14, 2026
Last Modified: May 14, 2026
Protocol  Integer ID: 317102
Keywords: bioinformatics, 16S rRNA, Greengenes, alpha diversity, ecological classification, risk classification, R, environmental microbiology, ecological risk classification of environmental water microbiome, environmental water microbiome, sequencing bioinformatics pipeline, classification of microbial isolate, bioinformatics pipeline, microbial isolate, ecological risk classification framework, rrna amplicon, ecological risk classification, sequencing data, bioinformatics, greengenes database, environmental water sample, data from environmental water sample, taxonomic assignment, alpha diversity analysis in rstudio, water sample
Disclaimer
Creative Commons Attribution License (CC BY 4.0)
Abstract
This protocol describes the post-sequencing bioinformatics pipeline and ecological risk classification framework applied to 16S rRNA amplicon sequencing data from environmental water samples. It covers taxonomic assignment using the Greengenes database, alpha diversity analysis in RStudio, normalization thresholds, and a four-category ecological and risk-based classification of microbial isolates.
Guidelines
- Map sequencing reads against the Greengenes database version 13.5 to assign taxonomy at kingdom, phylum, class, order, family, genus, and species levels.
- Apply minimum read count thresholds: Genus and Family level — minimum 500 reads; Species level — minimum 100 reads.
- Apply normalization thresholds: Genus and Family level — 0.5; Species level — 0.25.
Materials
- Ion Reporter™ Software v5.18 (Thermo Fisher Scientific) — for demultiplexing output
- Greengenes database v13.5 — for taxonomic assignment
- RStudio IDE v2025.05.0+496 "Mariposa Orchid" (Posit Software, PBC)
- R packages: ggplot2, dplyr, tidyr, readr, viridis, pheatmap
- Microsoft Excel LTSC Standard v16.99.1 — for data organization
Part A — Taxonomic Assignment
Import quality-controlled, demultiplexed reads into the analysis pipeline.
Map sequencing reads against the Greengenes database version 13.5 to assign taxonomy at kingdom, phylum, class, order, family, genus, and species levels.
Apply minimum read count thresholds: Genus and Family level — minimum 500 reads; Species level — minimum 100 reads.
Apply normalization thresholds: Genus and Family level — 0.5; Species level — 0.25.
Part B — Alpha Diversity Analysis
Import taxonomic assignment tables into RStudio IDE. Ensure data is in a tidy format compatible with ggplot2 and pheatmap workflows.
Calculate alpha diversity indices (e.g., Shannon, Simpson, observed species) for each sample using appropriate R functions.
Generate diversity plots using ggplot2 and pheatmap packages. Apply the viridis colour palette for accessible, publication-quality figures.
Part C — Ecological and Risk-Based Classification
Categorise all detected microorganisms into one of four ecological and risk-based classes:
Healthcare-associated: Organisms commonly found in healthcare settings, typically associated with nosocomial infections.
Environmental/aquatic: Organisms naturally occurring in soil, water bodies, and surrounding ecosystems without direct association with human disease.
Enteric: Organisms originating from faecal contamination, typically associated with the gastrointestinal tract of humans and animals.
Opportunistic pathogens: Organisms generally harmless in healthy individuals but capable of causing infection in immunocompromised hosts.
For each organism assigned to healthcare-associated, enteric, or opportunistic categories, record the supporting reference or database source used for classification.
Part D — Data Organization and Visualization
Transfer all taxonomic, diversity, and classification results into Microsoft Excel. Maintain separate sheets for raw counts, normalized data, diversity metrics, and classification outputs.
Produce publication-quality figures in RStudio using ggplot2 (bar charts, diversity plots), pheatmap (heatmaps), and viridis (colour scales).
Quality Control
Apply minimum read count and normalization thresholds consistently across all samples before any comparative analysis.
Cross-reference classification assignments against at least two independent published sources or databases.
Retain raw (pre-normalization) data files alongside processed files for reproducibility.
Protocol references
Greengenes database v13.5 — DeSantis et al., 2006.
R packages: ggplot2 (Wickham, 2016), dplyr, tidyr, readr (Wickham et al.), viridis (Garnier et al.), pheatmap (Kolde, 2019).
RStudio IDE v2025.05.0+496 — Posit Software, PBC.