Protocol Citation: Maria Balkey, Ruth Timme, Candace Hope Bias, Errol Strain, Tina Pfefer 2024. Guidance for populating and validating GenomeTrakr metadata templates (BioSample and SRA). protocols.io https://dx.doi.org/10.17504/protocols.io.eq2ly3x1pgx9/v11Version created by Ruth Timme
Manuscript citation:
Timme, R.E., Wolfgang, W.J., Balkey, M. et al. Optimizing open data to support one health: best practices to ensure interoperability of genomic data from bacterial pathogens. One Health Outlook 2, 20 (2020). https://doi.org/10.1186/s42522-020-00026-3
License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Please note that this protocol is public domain, which supersedes the CC-BY license default used by protocols.io.
Abstract
PURPOSE: This protocol provides instructions for preparing and filling out the metadata templates necessary for direct submission to the National Center for Biotechnology Information (NCBI). These instructions are relevant for the majority of whole genome sequencing data submissions derived from enteric bacterial pathogens collected for surveillance purposes.
SCOPE: This protocol provides detailed instructions for the following two metadata templates:
1. BioSample metadata: guidelines for obtaining, populating, and validating the BioSample metadata template.
2. SRA metadata: Guidelines for populating sequence-level metadata template.
Version history:
v11: Change of guidance to use the One Health Enteric BioSample package for all submissions.
v10: updates to the GenomeTrakr-extended pathogen biosample template (GT-pathogen package-OHE v0.3.xlsx) and release of newly available One Health Enteric package custom templates.
v9: Bug fix
v8: Updated the picklists in the GenomeTrakr-extended pathogen package, "GT-pathogen package-OHE v0.2.2.xlsx". Also provided a direct link to the newly published One Health Enteric package.
v7: Updated the picklists in the GenomeTrakr-extended pathogen package, "GT-pathogen package-OHE v0.2.2.xlsx" and added an incremental update file for the DRAFT One Health Enteric Package that includes extensive edits compared to v6.
v6: Added the One Health Enteric package presented at IAFP 2021 meeting.
Materials
Gather the following contextual information for each pure culture isolate:
organism name
lab name that collected the sample
collection date
collection source
Geographic location of sample collection
Before start
Before collecting sequence data for your isolates, ensure that you can provide the minimum metadata recommended by your coordinating surveillance body.
Overview
Overview
This protocol provides instructions on acquiring and completing two distinct metadata templates essential for the submission of enteric bacterial pathogen surveillance data to the National Center for Biotechnology Information (NCBI).
Two metadata templates are required for each NCBI submission:
1. BioSample: metadata describing the isolate, sample collected, and submitting lab information.
2. SRA: metadata describing the sequence data collection
BioSample metadata
BioSample metadata
Templates for BioSample submission:
Visit GenomeTrakr Metadata Validation System (GMVS) at https://gmvs.fda.gov/ to download custom, version-controlled, biosample metadata template(s). Current and previous versions of these templates can also be at the OHE GitHub page.
Our custom templates include extensive guidance and controlled vocabularies for most attributes in the package.
Sub-packages are available for download covering the major One Health samples types (human/animal hosts, food, food facilities, and farm/environment). Users can choose to populate the full package, or one more more of the sub-packages.
When visiting GMVS, click on the ONE HEALTH ENTERIC icon within the NCBI Metadata Validation box.
Follow GMVS instructions to download BioSample metadata template (click on the cloud download icon). Chose the most appropriate template for your sample types (the full package or one of the sub-packages).
One Health Enteric Metadata Sheet Upload
Review the excel -Instructions- sheet within the OHE excel file.
Instructions Sheet within OHE excel file
Proceed to fill out the BioSample metadata template in the -UserEntry- excel sheet. Where possible, use terms from dropdown menus for each metadata attribute.
User Entry Sheet within the OHE excel file.
Validate BioSample metadata template
Validate BioSample metadata template
Upload the completed OHE metadata template to GMVS and click on -VALIDATE- icon.
The GMVS validation system will check each entry and also run LexMapr for auto-assignment of the IFSAC category.
After completion GMVS will report out results of the validation.
Click -OK-.
No validation errors:
If metadata passes GMVS validation, each record will be displayed with all the metadata and you will have an option to export metadata.
Click on the -EXPORT METADATA- icon.
Review validated BioSample metadata and lexmapr output (cleaned up isolation_source entries and proposed IFSAC_category).go to step #3 for reviewing lexmapr output.
Address validation errors:
If there are validation errors, GMVS will generate a log report. If few errors are reported, edit values by clicking the EDIT icon, otherwise, export reviewed template by clicking -EXPORT ERRORS-.
Make required changes and click -RE-IMPORT SHEET- and proceed to re-validate the template.
Evaluation of LexMapr Output
Evaluation of LexMapr Output
LexMapr is a tool that processes free text from isolation_source and generates standard terminologies from controlled vocabulary/ontologies, including FoodOn, GenEpiO, UBERON, ENVO, NCBI Taxon, and specific food and environmental categories from
Interagency Food Safety Analytics Collaboration (IFSAC) controlled vocabulary.
Each GMVS record subject to validation is analyzed with LexMapr, the attribute isolation source gets an ontological descriptor and a category from IFSAC+ terminology for food safety. After records are processed with LexMapr, a report is generated
The LexMapr report generated at GMVS contains the following columns: strain, isolation_source, isolation_source (LexMapr generated), and IFSAC_category.
strain
isolation_source
isolation_source (Lexmapr generated)
IFSAC_category
FDA189213897_s001
ENV swab sponge
environmental swab sponge
environmental-factory/production facility
LexMapr Output generated during validation.
Review the Lexmpr generated recommendations for isolation_source and IFSAC_category. If you agree with the recomendations, copy these the contents of these fields into the validated BioSample metadata template, under the isolation_source and IFSAC_category fields, respectively.
If the IFSAC category(s) recommended for the sample type are incorrect or not appropriate, leave that entry blank for the submission and submit a bug report to genometrakr@fda.hhs.gov.
Save the validated biosample metadata template and proceed with NCBI submissions.
If you have sequences to submit that belong to more than one BioProject, create a separate submission + metadata table for each of your BioProjects.
Entering fastq filenames in the spreadsheet: On a Mac, you can directly copy the file names from the folder into a spreadsheet. This is not possible on a PC using copy and paste but can be done with some command-line operation.
Finally, it is important to develop a QA/QC step to make sure the files are associated with the correct sample name. For example, use a left function in excel to strip of the appended text in the file name and then use the exact match to make sure the name matches the sample name.
A
B
C
Field
Description
Example
sample_name
Include the same ID here as you entered for "sample_name" in the BioSample submission template.
UT-12345
library_ID
The library name should be a unique ID relevant to your workflow. It can be an autogenerated ID from your LIMS system or a modification of your sample_name.
UT-12345.6
Title
Short, free text description that identifies the data on public pages.
For Example:
{methodology} of {organism}: {sample_name}
WGS of Salmonella enterica: UT-12345
library_strategy
Overall sequencing strategy or approach.
Choose from NCBI pick list
WGS
library_source
molecule type used to make the library
genomic
library_selection
Library capture method
random
Library_layout
Choose from NCBI pick list
paired
platform
Sequencing platform
Illumina
instrument_model
Name of the sequencing instrument.
MiSeq
Design_description
Free text description of methods
Filetype
File format name for the raw sequence data
Choose from NCBI pick list
Fastq
Filename
include ALL of the files resulting from this library. **Add additional fields if there are more than two files (e.g. Filename3).
genome_r1.fastq (*must be exact)
Filename2
genome_r2.fastq (*must be exact)
genome_r2.fastq (*must be exact)
Filename3-8
list other fastq file names (e.g. for NextSeq data)
SRA metadata data template guidance and examples for WGS submission.
Save the second sheet (SRA_data) as a TSV (tab-delimited file) for upload in the “SRA metadata” tab within the submission portal.
*NCBI should also accept the original excel formatted file.