Dec 23, 2021

Public workspaceCreating Planet Microbe Data Packages

  • 1University of Arizona
Icon indicating open access to content
QR code linking to this content
Protocol Citation: Kai Blumberg, Alise J Ponsero, Bonnie Hurwitz 2021. Creating Planet Microbe Data Packages. protocols.io https://dx.doi.org/10.17504/protocols.io.bzsdp6a6
Manuscript citation:
Blumberg KL, Ponsero AJ, Bomhoff M, Wood-Charlson EM, DeLong EF, Hurwitz BL, Ontology-Enriched Specifications Enabling Findable, Accessible, Interoperable, and Reusable Marine Metagenomic Datasets in Cyberinfrastructure Systems. Frontiers in Microbiology doi: 10.3389/fmicb.2021.765268
License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License,  which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Protocol status: Working
Created: November 04, 2021
Last Modified: December 23, 2021
Protocol Integer ID: 54821
Funders Acknowledgements:
National Science Foundation
Grant ID: OCE-1639614
Abstract
Marine microbial ecology requires the systematic comparison of biogeochemical and sequence data to analyze environmental influences on the distribution and variability of microbial communities. With ever-increasing quantities of metagenomic data, there is a growing need to make datasets Findable, Accessible, Interoperable, and Reusable (FAIR) across diverse ecosystems. FAIR data is essential to developing analytical frameworks that integrate microbiological, genomic, ecological, oceanographic, and computational methods. Although community standards defining the minimal metadata required to accompany sequence data exist, they haven't been consistently used across projects, precluding interoperability. Moreover, these data are not machine-actionable or discoverable by cyberinfrastructure systems. By making ‘omic and physicochemical datasets FAIR to machine systems, we can enable sequence data discovery and reuse based on machine-readable descriptions of environments or physicochemical gradients.

In this work, we developed a novel technical specification for dataset encapsulation for the FAIR reuse of marine metagenomic and physicochemical datasets within cyberinfrastructure systems. This includes using Frictionless Data Packages enriched with terminology from environmental and life-science ontologies to annotate measured variables, their units, and the measurement devices used. This approach was implemented in Planet Microbe, a cyberinfrastructure platform and marine metagenomic web-portal. Here, we discuss the data properties built into the specification to make global ocean datasets FAIR within the Planet Microbe portal. We additionally discuss the selection of, and contributions to marine-science ontologies used within the specification. Finally, we use the system to discover data by which to answer various biological questions about environments, physicochemical gradients, and microbial communities in meta-analyses. This work represents a future direction in marine metagenomic research by proposing a specification for FAIR dataset encapsulation that, if adopted within cyberinfrastructure systems, would automate the discovery, exchange, and re-use of data needed to answer broader reaching questions than originally intended.
Setup
Setup
Step one including downloading the relevant repositories. You'll want to download the relevant repositories to the same directory.

Open a terminal or shell program and navigate to a base working directory.

git clone git@github.com:hurwitzlab/planet-microbe-datapackages.git
Note you could also use the https version


git clone git@github.com:hurwitzlab/planet-microbe-scripts.git


Prepare datapackage.json
Prepare datapackage.json
Prepare data for creation of Planet Microbe Frictionless Datapackage

How to create the datapackage.json wrapper for your tabular dataset(s).

Prepare ONTOLOGY_MAPPING.tsv file(s)

Make sure to do this for every individual TSV file that is part of one datapackage product

This can be done by transposing the column headers of your tabular dataset to be the items in the first row of a new csv file. Simply copy and paste the column headers of the original dataset using excel (or similar program) and in a new sheet past them (transposed).



Next add the following as the new column headers:

parameterrdf type purl labelrdf type purlpm:searchableunits labelunits purlmeasurement source purl labelmeasurement source purlpm:measurement source protocolpm:source urlfrictionless typefrictionless format
Populate the new ONTOLOGY_MAPPING.tsv one column at a time.

The "pm:searchable" column is a boolean flag (TRUE or FALSE) for columns that have a corresponding "rdf type purl" annotation columns. If the column from the dataset is about a physicochemical variable from the pmo_searchable_terms.tsv that you want to made searchable make sure to fill it out completely and set the "pm:searchable" to "TRUE". Otherwise the for a given column it can be set to false and everything else left blank.

  1. Fill in the "rdf type purl" column with IRIs from of terms from this pmo_searchable_terms.tsv.
  2. Fill in the "units purl" column with IRIs from of terms from the UO (Units Ontology) which you can view and browser from here.
  3. Fill in the "measurement source purl" columns with term from the OBI (Ontology for Biomedical Investigations) device hierarchy, which you can view and browser from here.
  4. Fill in the "pm:source url" column with any link relevant to the dataset columns collection or source e.g., "https://www.ncbi.nlm.nih.gov/bioproject/213098"
  5. Fill in the "frictionless type" column with one of: "string", "number", "date", or "datetime".
  6. If necessary, fill in the "frictionless format" column for dates datetypes with custom formatting string patterns, e.g., "%Y-%m-%dT%H:%M" or default for ISO standard datetimes.
  7. If uploading INSDC submited data (from EBI or NCBI) make sure to fill out one column with a "Biosample identifier assigned by the National Center for Biotechnology Information" aka "http://purl.obolibrary.org/obo/PMO_00000122"
  8. If uploading MIxS compliant data make sure to annotated the appropriate columns for following ENVO MIxS triad:



Note that the columns "rdf type purl label', "units label" and "measurement source purl label" do not need to be filled in those are just for your convenience as you fill out the accompanying "purl" columns.
Critical
An example filled out ONTOLOGY_MAPPING.tsv file might look like the following:

parameterrdf type purl labelrdf type purlpm:searchableunits labelunits purlmeasurement source purl labelmeasurement source purlpm:measurement source protocolpm:source urlfrictionless typefrictionless format
SampleID_Taracentrally registered identifierhttp://purl.obolibrary.org/obo/IAO_0000578FALSEhttps://www.ncbi.nlm.nih.gov/bioproject/213098string
BioSampleBiosample identifier assigned by the National Center for Biotechnology Informationhttp://purl.obolibrary.org/obo/PMO_00000122FALSEhttps://www.ncbi.nlm.nih.gov/bioproject/213098string
Chlorophyll Sensorconcentration of chlorophyll in waterhttp://purl.obolibrary.org/obo/ENVO_3100036TRUEmiligram per cubic meterhttp://purl.obolibrary.org/obo/PMO_00000132https://www.ncbi.nlm.nih.gov/bioproject/213098number
Depthdepth of waterhttp://purl.obolibrary.org/obo/ENVO_3100031TRUEmeterhttp://purl.obolibrary.org/obo/UO_0000008https://www.ncbi.nlm.nih.gov/bioproject/213098number
Descriptioncomment on investigationhttp://purl.obolibrary.org/obo/OBI_0001898FALSEhttps://www.ncbi.nlm.nih.gov/bioproject/213098string
Event Date/Time Endspecimen collection time measurement datum stophttp://purl.obolibrary.org/obo/PMO_00000009TRUEtime unithttp://purl.obolibrary.org/obo/UO_0000003https://www.ncbi.nlm.nih.gov/bioproject/213098datetime%Y-%m-%dT%H:%M
Event Date/Time Startspecimen collection time measurement datum starthttp://purl.obolibrary.org/obo/PMO_00000008TRUEtime unithttp://purl.obolibrary.org/obo/UO_0000003https://www.ncbi.nlm.nih.gov/bioproject/213098datetime%Y-%m-%dT%H:%M
Event Labelcentrally registered specimen collection event identifierhttp://purl.obolibrary.org/obo/PMO_00000056FALSEhttps://www.ncbi.nlm.nih.gov/bioproject/213098string
Latitude Endlatitude coordinate measurement datum stophttp://purl.obolibrary.org/obo/PMO_00000079TRUEdegreehttp://purl.obolibrary.org/obo/UO_0000185https://www.ncbi.nlm.nih.gov/bioproject/213098number
Latitude Startlatitude coordinate measurement datum starthttp://purl.obolibrary.org/obo/PMO_00000076TRUEdegreehttp://purl.obolibrary.org/obo/UO_0000185https://www.ncbi.nlm.nih.gov/bioproject/213098number
Longitude Endlongitude coordinate measurement datum stophttp://purl.obolibrary.org/obo/PMO_00000078TRUEdegreehttp://purl.obolibrary.org/obo/UO_0000185https://www.ncbi.nlm.nih.gov/bioproject/213098number
Longitude Startlongitude coordinate measurement datum starthttp://purl.obolibrary.org/obo/PMO_00000077TRUEdegreehttp://purl.obolibrary.org/obo/UO_0000185https://www.ncbi.nlm.nih.gov/bioproject/213098number
Nitrate Sensorconcentration of nitrate in waterhttp://purl.obolibrary.org/obo/ENVO_3100022TRUEmicromole per litrehttp://purl.obolibrary.org/obo/UO_0010003https://www.ncbi.nlm.nih.gov/bioproject/213098number
Oxygen Sensorconcentration of oxygen in waterhttp://purl.obolibrary.org/obo/ENVO_09200021TRUEmicromole per kilogramhttp://purl.obolibrary.org/obo/UO_0010004https://www.ncbi.nlm.nih.gov/bioproject/213098number
Salinity Sensorliquid water salinityhttp://purl.obolibrary.org/obo/PMO_00000014TRUEpractical salinity unithttp://purl.obolibrary.org/obo/PMO_00000037https://www.ncbi.nlm.nih.gov/bioproject/213098number
Size Fraction Lower Thresholdaquatic sample minimum filter fractionation size thresholdhttp://purl.obolibrary.org/obo/PMO_00000022TRUEmicrometerhttp://purl.obolibrary.org/obo/UO_0000017https://www.ncbi.nlm.nih.gov/bioproject/213098number
Size Fraction Upper Thresholdaquatic sample maximum filter fractionation size thresholdhttp://purl.obolibrary.org/obo/PMO_00000023TRUEmicrometerhttp://purl.obolibrary.org/obo/UO_0000017https://www.ncbi.nlm.nih.gov/bioproject/213098number
Temperaturetemperature of waterhttp://purl.obolibrary.org/obo/ENVO_09200014TRUEdegree Celsiushttp://purl.obolibrary.org/obo/UO_0000027https://www.ncbi.nlm.nih.gov/bioproject/213098number
purl_biomebiomehttp://purl.obolibrary.org/obo/ENVO_00000428TRUE
purl_featureenvironmental featurehttp://purl.obolibrary.org/obo/ENVO_00002297TRUE
purl_materialenvironmental materialhttp://purl.obolibrary.org/obo/ENVO_00010483TRUE
Creating Data Package Templates
This example command is how one can generate a tabular data package JSON template for the OSD data set:

cd planet-microbe-scripts

cat example_ontology_mappings/OSD.tsv | ./scripts/schema_tsv_to_json.py > example_data_packages/osd/datapackage.json
The JSON was then hand-edited to add missing information and correct names, types, and units.

For more information on FD Table Schemas see http://frictionlessdata.io/specs/table-schema/

This can be replicate for a new dataset by running the similar command with the newly created ONTOLOGY_MAPPING.tsv file(s). Recreate this command in the new directory with the new project files in the planet-microbe-datapackages repository. E.g.,


cd planet-microbe-datapackages/NEW_DATASET/

cat ONTOLOGY_MAPPING.tsv | #PATH_TO_planet-microbe-scripts_REPO#./scripts/schema_tsv_to_json.py > datapackage_component1.json


Finalize the datapackage_component.json file

Open the file in a text editor such as atom or sublime text and modify the following information specific to the resource in question.

{
"name": "#ADD NAME e.g., sample",
"title": "#ADD tile",
"profile": "#tabular-data-resource",
"pm:resourceType": "#ADD TYPE",
"path": "#ADD FILEPATH e.g., FILEPATH.tsv",
"dialect": {
"delimiter": "#Add delimiter e.g., \t",
"header": true,
"caseSensitiveHeader": true
},
"format": "csv",
"mediatype": "text/tab-separated-values",
"encoding": "UTF-8",
"hash": "OPTIONALLY ADD file e.g., hasheac36d6747691e1061718509828598b1",
"schema": {
"fields": [ ...
}


Note the following "pm:resourceTypes" are accepted:
"pm:resourceType": "niskin",

"pm:resourceType": "campaign",

"pm:resourceType": "sample",

"pm:resourceType": "sampling_event",

"pm:resourceType": "ctd",

The following is an skeleton of the Planet Microbe Datapackage main json file for a new project
{
"@context": {
},
"profile": "tabular-data-package",
"name": "#ADD NAME",
"title": "#ADD TITLE",
"description": "#ADD DESCRIPTION",
"homepage": "#ADD HOMEPAGE",
"keywords": [
"#ADD KEYWORDS",
"#ADD MORE KEYWORDS"
],
"sources": [
{
"title": "#ADD SOURCE(s)",
"path": "ADD URL FOR SOURCE"
}
],
"licenses": [
{
"name": "CC-BY-3.0",
"title": "Creative Commons Attribution 3.0 Unported",
}
],
"resources": [
{#ADD datapackage_component1.json CODE HERE},
{... #Add more datapackage components if needed}
]
}
Save this as a new (main) datapackage.json file for the project.

For each tabular dataset file that was prepared in the above steps open up the "datapackage_componentX.json" files one at a time and paste the entirety of the json file into the {}'s inside the "resources" block. This way each file will be annotated in a FAIR way that can be uploaded into the Planet Microbe Database.

Make sure to also fill in the other information about the datapackage for example the "name" "description" "keywords" etc.

Optional Step, add constraints for select fields

Optionally for fields which should follow a numeric constraint such as such as Latitude and Longitude, add and or modify the json file for each column header for each resource to change them from something like the following:


{
"name": "Latitude",
"type": "number",
"format": "default",
},


into something like this with a constraint block specified with numerical values e.g., -90 and 90 for Latitude.

{
"name": "Latitude",
"type": "number",
"format": "default",
"constraints": {
"required": false,
"minimum": -90,
"maximum": 90
},

Validate datapackage.json
Validate datapackage.json
Validating Data Packages

First, make sure you have a Python 3 virtual environment setup:

virtualenv -p $(which python3) python3
source python3/bin/activate
pip install datapackage

Alternatively create a conda environment:

conda create --name planet_microbe
conda activate planet_microbe
conda install -c conda-forge datapackage

Run the validation script:

scripts/validate_datapackage.py [-r resource]

Example command to validate a Datapackage:


scripts/validate_datapackage.py ../planet-microbe-datapackages/OSD/datapackage.json

Optional Step, run GoodTables validation for constraints

This is based on the original goodtables script avaiable from: https://goodtables.readthedocs.io/en/latest/

First install the goodtables library using pip:
pip install goodtables

Example call using the script
goodtables OSD/datapackage.json