Jul 09, 2020

Public workspaceSOP for populating EBI submission templates (ENA)

  • 1Quadram Institute Bioscience;
  • 2University of British Columbia;
  • 3US Food and Drug Administration;
  • 4Centers for Disease Control and Prevention
  • Coronavirus Method Development Community
  • PHA4GE
Icon indicating open access to content
QR code linking to this content
Protocol Citation: Nabil-Fareed Alikhan, Emma Griffiths, Ruth Timme, Duncan MacCannell 2020. SOP for populating EBI submission templates (ENA). protocols.io https://dx.doi.org/10.17504/protocols.io.bh5dj826
License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License,  which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Protocol status: Working
We use this protocol and it's working
Created: July 01, 2020
Last Modified: November 10, 2021
Protocol Integer ID: 38789
Keywords: metadata, INSDC, ERC000033, ENA, EBI, SARS-Cov2, COVID-19
Disclaimer
Please note that this protocol is public domain, which supersedes the CC-BY license default used by protocols.io.
Abstract
Guidance on how to populate the extended PHA4GE metadata package for SARS-CoV-2 submissions, maximizing interoperability for covid-19 surveillance.
Three templates needed for NCBI SARS-CoV-2 submission
Three templates needed for NCBI SARS-CoV-2 submission
Guidance for populating the three templates for SARS-CoV-2 submission to EBI.

This protocol helps to describe the fields needed to populate the ENA virus pathogen checklist, however, the primary PHA4GE guidance should be followed to ensure the controlled vocabularies and ontology terms are used to populate these fields.

Link to PHA4GE SARS-CoV-2 metadata specification: https://github.com/pha4ge/SARS-CoV-2-Contextual-Data-Specification

Link to PHA4GE SARS-CoV-2 EBI submission protocol: ENA, BioSample, and BioProject: https://www.protocols.io/private/BD4C35AC52C942B2D927E662ABC3D195

Link to PHA4GE SARS-CoV2 EBI assembly submission protocol 

ENA virus pathogen reporting
ENA virus pathogen reporting
PHA4GE ENA virus pathogen reporting standard checklist for sample metadata:

ENA submission spreadsheets are not tables with fields name in the first row, and specfics about each sample in subsequent rows. Instead the spreadsheet requires particular headers (denoted with "#"). e.g

#checklist_accessionERC000033
#unique_name_prefix
sample_nametax_idscientific_namehost ageorganismcollection dategeographic location (country and/or sea)
hCOV-sample-ENG-112697049Severe acute respiratory syndrome coronavirus 250Severe acute respiratory syndrome coronavirus 205/05/2020United Kingdom
#template
#unitsyears


As described in the PHA4GE SARS-CoV-2 EBI submission protocol, there will be an oppotunity to download a template spreadsheet from ENA directly. We recommend that you add aditional fields as per the PHA4GE guidelines and keep a blank copy of this spreadsheet for subsequent submissions.
Guidlines to populate the sample metadata sheet
These fields are required for all ENA submission. They do not appear as part of the checklist. The description and guidance is tabulated below.

Description of required fields
ENA Required FieldsDefinition
sample_nameThe user-provided name of the sample.
tax_idThe NCBITaxon identifier for the organism being sequenced.
scientific_nameThe taxonomic name of the organism.
common_nameThe common name of the organism.
sample_descriptionFree text description of the sample.
instrument_model Name of the sequencing instrument.
library_source Molecule type used to make the library.
library_selection Library capture method.
library_strategy Overall sequencing strategy or approach.
library_layout Single or paired.
file_name Include ALL of the files resulting from this library. **Add additional fields if there are more than two files (e.g. Filename3)
All sample submissions to ENA require this information. The description and mapping to the respective PHA4GE field should help you transfer your metadata from the PHA4GE table to the something acceptable for ENA submission.
Guidance of required fields
ENA Required FieldsPHA4GE FieldPHA4GE Guidance
sample_namespecimen collector sample IDThis field can be populated by the PHA4GE field "specimen collector sample ID".
tax_idN/AUse "2697049" as the tad_id for SARS-CoV-2.
scientific_nameorganismThis field can be populated by the PHA4GE field "organism". Provide the full name "Severe acute respiratory syndrome coronavirus 2".
common_nameN/AThis field can be populated by the PHA4GE field "host (common name)". Provide "Sars-CoV-2".
sample_descriptionN/A
instrument_model N/ASee ENA SRA pick list. (e.g. Illumina MiSeq, iSeq 100, GridION, MinION, PacBio Sequel II)
library_source N/ASee ENA SRA pick list. (e.g. viral RNA, metagenomic)
library_selection N/ASee ENA SRA pick list. (e.g. random, PCR)
library_strategy N/ASee ENA SRA pick list. (e.g. WGS, RNA-Seq, Amplicon)
library_layout N/ASee ENA SRA pick list. (single, paired)
file_name r1 fastq filenameThis field can be populated by the PHA4GE field "r1 fastq filename".
All sample submissions to ENA require this information. The description and mapping to the respective PHA4GE field should help you transfer your metadata from the PHA4GE table to the something acceptable for ENA submission.

For SARSCov2 submission, ENA ask that the metadata comply with Checklist ERC000033: https://www.ebi.ac.uk/ena/browser/view/ERC000033. These fields below are drawn from the ENA Checklist, with description and mapping to the respective PHA4GE field should help you transfer your metadata from the PHA4GE table to the something acceptable for ENA submission.

ENA Virus Checklist FieldENA DefinitionENA Requirement Status
subject exposure Exposure of the subject to infected human or animals, such as poultry, wild bird or swine. If multiple exposures are applicable, please state them separated by semicolon. Example: poultry; wild birdoptional
subject exposure durationDuration of the exposure of the subject to an infected human or animal. If multiple exposures are applicable, please state their duration in the same order in which you reported the exposure in the field 'subject exposure'. Example: 1 day; 0.33 daysoptional
type exposureSetting within which the subject is exposed to animals, such as farm, slaughterhouse, food preparation. If multiple exposures are applicable, please state their type in the same order in which you reported the exposure in the field 'subject exposure'. Example: backyard flock; confined animal feeding operationoptional
personal protective equipmentUse of personal protective equipment, such as gloves, gowns, during any type of exposure. Example: maskoptional
hospitalisationWas the subject confined to a hospital as a result of virus infection or problems occurring secondary to virus infection?optional
illness durationThe number of days the illness lasted. Example: 4optional
illness symptomsThe symptoms that have been reported in relation to the illness, such as cough, diarrhea, fever, headache, malaise, myalgia, nausea, runny_nose, shortness_of_breath, sore_throat. If multiple exposures are applicabloptional
collection dateThe date of sampling, either as an instance (single point in time) or interval. In case no exact time is available, the date/time can be right truncated i.e. all of these are valid ISO8601 compliant times: 2008-01-23T19:23:10+00:00; 2008-01-23T19:23:10; 2008-01-23; 2008-01; 2008.recommended
geographic location (country and/or sea)The geographical origin of the sample as defined by the country or sea. Country or sea names should be chosen from the INSDC country list (http://insdc.org/country.html).mandatory
geographic location (latitude)The geographical origin of the sample as defined by latitude and longitude. The values should be reported in decimal degrees and in WGS84 systemrecommended
geographic location (longitude)The geographical origin of the sample as defined by latitude and longitude. The values should be reported in decimal degrees and in WGS84 systemrecommended
geographic location (region and locality)The geographical origin of the sample as defined by the specific region name followed by the locality name.recommended
sample capture statusReason for the sample collection.recommended
host disease outcomeDisease outcome in the host.recommended
host common namecommon name of the host, e.g. humanmandatory
host subject ida unique identifier by which each subject can be referred to, de-identified, e.g. #131mandatory
host ageage of host at the time of sampling; relevant scale depends on species and study, e.g. could be seconds for amoebae or centuries for treesrecommended
host health stateHealth status of the host at the time of sample collection.mandatory
host sexGender or sex of the host.mandatory
host scientific nameScientific name of the natural (as opposed to laboratory) host to the organism from which sample was obtained.mandatory
virus identifierUnique laboratory identifier assigned to the virus by the investigator. Strain name is not sufficient since it might not be unique due to various passsages of the same virus. Format: up to 50 alphanumeric charactersrecommended
collector nameName of the person who collected the specimen. Example: John Smithmandatory
collecting institutionName of the institution to which the person collecting the specimen belongs. Format: Institute Name, Institute Addressmandatory
receipt dateDate on which the sample was received. Format:YYYY-MM-DD. Please provide the highest precision possible. If the sample was received by the institution and not collected, the 'receipt date' must be provided instead. Either the 'collection date' or 'receipt date' must be provided. If available, provide both dates.recommended
sample storage conditionsConditions at which sample was stored, usually storage temperature, duration and locationoptional
definition for seropositive sampleThe cut off value used by an investigatior in determining that a sample was seropositive.recommended
serotype (required for a seropositive sample)Serological variety of a species characterised by its antigenic properties. For Influenza, HA subtype should be the letter H followed by a number between 1-16 unless novel subtype is identified and the NA subtype should be the letter N followed by a number between 1-9 unless novel subtype is identified. If only one of the subtypes have been tested then use the format H5Nx or HxN1. Example: H1N1recommended
isolateindividual isolate from which the sample was obtainedmandatory
strainName of the strain from which the sample was obtained.optional
host habitatNatural habitat of the avian or mammalian host.recommended
isolation source host-associatedName of host tissue or organ sampled for analysis. Example: tracheal tissuerecommended
host descriptionOther descriptive information relating to the host.optional
gravidityWhether or not the subject is gravid. If so, report date due or date post-conception and specify which of these two dates is being reported.optional
host behaviourNatural behaviour of the host.recommended
isolation source non-host-associatedDescribes the physical, environmental and/or local geographical source of the biological sample from which the sample was derived. Example: soilrecommended
Fields from ENA Checklist ERC000033. The description and mapping to the respective PHA4GE field should help you transfer your metadata from the PHA4GE table to the something acceptable for ENA submission.

ENA Virus Checklist FieldPHA4GE FieldPHA4GE Guidance
subject exposure exposure eventThis field can be populated by the PHA4GE field "exposure event". Caution: this may be sensistive information. Consult the data steward before sharing. If the information is unknown, not applicable, or can not be shared, leave blank or provide a null value.
subject exposure durationN/AIf the information is unknown, not applicable, or not available, leave blank or provide a null value.
type exposureexposure eventThe PHA4GE field "exposure event" describes subject exposures and settings. Exposure information captured in the PHA4GE specification can be provided either in the ENA "subject exposure" or the "type exposure" field.
personal protective equipmentN/AIf the information is unknown, not applicable, or not available, leave blank or provide a null value.
hospitalisationspecified as value under host health statusThis information can be found as values in the "host health status" field in the PHA4GE specification e.g. Hospitalized, Hospitalized (ICU), Hospitalized (Non-ICU). The ENA "hospitalisation" field requires yes/no values. Provide yes/no values for this field if submitting to ENA. If the information is unknown, or can not be shared, leave blank or provide a null value.
illness durationN/AIf the information is unknown, not applicable, or not available, leave blank or provide a null value.
illness symptomssigns and symptomsThis field can be populated by the PHA4GE field "signs and symptoms". If the information is unknown, not applicable, or not available, leave blank or provide a null value.
collection datesample collection dateThis field can be populated by the PHA4GE field "sample collection date". Caution: the sample collection date may be considered public health identifiable information. Consult the data steward before sharing. Acceptable formats are YYYY-MM-DD, YYYY-MM, or YYYY.
geographic location (country and/or sea)geo_loc (country)This field can be populated by the PHA4GE field "geo_loc name (country)". The values in the PHA4GE pick list are derived from the INSDC country list.
geographic location (latitude)geo_loc latitudeThis field can be populated by the PHA4GE field "geo_loc latitude". Caution: this is likely sensitive information. Consult the data steward before sharing. Do not provide latitude of the institution, nor the centre of the city/region where the sample was collected as this falsely implicates an existing location. If the information is unknown or can not be shared, leave blank or provide a null value.
geographic location (longitude)geo_loc longitudeThis field can be populated by the PHA4GE field "geo_loc longitude". Caution: this is likely sensitive information. Consult the data steward before sharing. Do not provide longitude of the institution, nor the centre of the city/region where the sample was collected as this falsely implicates an existing location. If the information is unknown or can not be shared, leave blank or provide a null value.
geographic location (region and locality)geo_loc (state/province/region)This field can be populated by the PHA4GE field "geo_loc (state/province/region". Caution: this maye be sensitive information depending on the number of cases in this geographic area. Consult the data steward before sharing. If the information is unknown or can not be shared, leave blank or provide a null value.
sample capture statuspurpose of samplingWhile the meanings of ENA's "sample capture status" and PHA4GE's "purpose of sampling" fields overlap in meaning, ENA provides a specific pick list of terms for populating the field. Use the ENA pick list if submitting to ENA. If the information is unknown or can not be shared, leave blank or provide a null value.
host disease outcomehost disease outcomeWhile the meanings of ENA's "host disease outcome" and PHA4GE's "host disease outcome" fields overlap in meaning, ENA provides a specific pick list of terms for populating the field. Use the ENA pick list if submitting to ENA. If the information is unknown or can not be shared, leave blank or provide a null value.
host common namehost (common name)This field can be populated by the PHA4GE field "host (common name)".
host subject idhost subject IDThis field can be populated by the PHA4GE field "host subject ID". Caution: the host subject ID may be considered public health identifiable information. Consult the data steward before sharing. If unknown or considered identifiable, provide an alternative ID or a null value.
host agehost ageThis field can be populated by the PHA4GE field "host age". Caution: the host age may be considered public health identifiable information. Consult the data steward before sharing. If the information is unknown or can not be shared, leave blank or provide a null value.
host health statehost health stateWhile the meanings of the ENA and PHA4GE "host health state" fields overlap, ENA requires certain values in this field. If known, provide "diseased" or "healthy". If the information is unknown or can not be shared, provide a null value.
host sexhost genderWhile the meanings of ENA's "host sex" and PHA4GE's "host gender" fields overlap in meaning, ENA provides a specific pick list of terms for populating the field. Use the ENA pick list if submitting to ENA. Caution: the host gender may be considered public health identifiable information. Consult your data steward before sharing. If the information is unknown or can not be shared, provide a null value.
host scientific namehost (scientific name)This field can be populated by the PHA4GE field "host (scientific name)".
virus identifierspecimen collector sample IDThis field can be populated by the PHA4GE field "specimen collector sample ID". Caution: the sample ID may be considered sensitive information. Consult the data steward. You may need to provide an alternative ID.
collector nameN/AProvide the name of the person who collected the sample. If the information is unknown or can not be shared, provide a null value.
collecting institutionsample collected byThis field can be populated by the PHA4GE field "sample collected by". Caution: if the name of the lab reveals geographic information, this may be considered public health identifiable information. Consult the data steward before sharing. If information ca not be shared, provide a null value.
receipt datereceived dateThis field can be populated by the PHA4GE field "received date". If the information is unknown or can not be shared, leave blank or provide a null value.
sample storage conditionsN/AIf the information is unknown, not applicable, or not available, leave blank or provide a null value.
definition for seropositive sampleN/AIf the information is unknown, not applicable, or not available, leave blank or provide a null value.
serotype (required for a seropositive sample)N/AIf the information is unknown, not applicable, or not available, leave blank or provide a null value.
isolateisolateThis field can be populated by the PHA4GE field "isolate".
strainN/AIf the information is unknown, not applicable, or not available, leave blank or provide a null value.
host habitatN/AIf the information is unknown, not applicable, or not available, leave blank or provide a null value.
isolation source host-associatedanatomical material; anatomical part; body productThis field can be populated by the PHA4GE fields "anatomical material", "anatomical part" and "body product". If the information is unknown, not applicable, or can not be shared, leave blank or provide a null value.
host descriptionN/ACan be left blank.
gravidityN/AIf the information is unknown, not applicable, or not available, leave blank or provide a null value.
host behaviourN/AIf the information is unknown, not applicable, or not available, leave blank or provide a null value.
isolation source non-host-associatedenvironmental site; environmental materialThis field can be populated by the PHA4GE fields "environmental site" and "environmental material". If the information is unknown, not applicable, or can not be shared, leave blank or provide a null value.
Fields from ENA Checklist ERC000033. The description and mapping to the respective PHA4GE field should help you transfer your metadata from the PHA4GE table to the something acceptable for ENA submission.

There is extended guideance available at the PHA4GE SARS-CoV-2 metadata specification: https://github.com/pha4ge/SARS-CoV-2-Contextual-Data-Specification

ENA run metadata
ENA run metadata
Populate ENA run metadata table:


PRO TIPS:
  1. If you have sequences to submit that have drastically different metadata, create a separate submission + metadata table for each case.
  2. Entering fastq filenames in the spreadsheet: On a Mac, you can directly copy the file names from the folder into a spreadsheet. This is not possible on a PC using copy and paste but can be done with some command-line operation.
  3. Finally, it is important to develop a QA/QC step to make sure the files are associated with the correct sample name. For example, use a left function in excel to strip of the appended text in the file name and then use the exact match to make sure the name matches the sample name.
As described in the PHA4GE SARS-CoV-2 EBI submission protocol, there will be an oppotunity to download a template spreadsheet from ENA directly. There is a description of these fields in the table below.
FieldDescriptionExample
Sample referenceInclude the same ID here as you entered for "sample_name" in the BioSample submission template. This field can be populated by the PHA4GE field "specimen collector sample ID".UT-12345
Library nameThe library name should be a unique ID relevant to your workflow. It can be an autogenerated ID from your LIMS system or a modification of your sample_name. This field can be populated by the PHA4GE field "library_id".UT-12345.6
TitleShort, free text description that identifies the data on public pages. For Example: {methodology} of {organism}: {sample_name}Amplicon-based sequencing of SARS-CoV-2: UT-12345
Library strategyOverall sequencing strategy or approach. Choose from NCBI pick liste.g. WGS, RNA-Seq, Amplicon
Library sourcemolecule type used to make the librarye.g. viral RNA, metagenomic
Library selectionLibrary capture methode.g. random, PCR
instrument modelName of the sequencing instrumente.g. Illumina MiSeq, iSeq 100, GridION, MinION, PacBio Sequel II
Design descriptionoptional field for free text description of methodsARTIC PCR-tiling of viral cDNA (V3), sequenced on Illumina MiSeq with DNA Flex library prep-kit. Only reads aligned to SARS-CoV-2 reference (NC_045512.2) retained
File nameIncludes files resulting from this library. This maybe named "First file name" if multiple files need to be submitted This field can be populated by the PHA4GE field "r1 fastq filename".genome_r1.fastq (*must be exact)
Second file namegenome_r2.fastq (*must be exact) This field can be populated by the PHA4GE field "r2 fastq filename". This field will only be shown for certain file types (i.e. paired FASTQ)genome_r2.fastq (*must be exact)
Filename3-8list other fastq file names (e.g. for NextSeq data)
(First/Second) MD5 ChecksumMD5 checksum of the file being submitted07182d8b0....