Quality control assessment for microbial genomes: GalaxyTrakr MicroRunQC workflow

Candace Bias; Ruth Timme; Yesha Shrestha; Tina Pfefer; Paul Morin; Maria Balkey; Errol Strain

Jan 09, 2026

Version 2

Quality control assessment for microbial genomes: GalaxyTrakr MicroRunQC workflow V.2

Version 1 is forked from Quality control assessment for microbial genomes: GalaxyTrakr MicroRunQC workflow

DOI

https://dx.doi.org/10.17504/protocols.io.261ge138dv47/v2

Quality control assessment for microbial genomes: GalaxyTrakr MicroRunQC workflow

Candace Bias¹,
Ruth Timme¹,
Yesha Shrestha²,
Tina Pfefer³,
Paul Morin⁴,
Maria Balkey³,
Errol Strain³

¹US Food and Drug Administration;
²Center for Veterinary Medicine, US Food and Drug Administration;
³Center for Food Safety and Applied Nutrition, U.S. Food and Drug Administration, College Park, Maryland, USA;
⁴U.S. Food and Drug Administration, Jamaica, New York, USA

GenomeTrakr
Tech. support email: [email protected]

Maria Balkey

US Food and Drug Administration

DOI: https://dx.doi.org/10.17504/protocols.io.261ge138dv47/v2

Protocol Citation: Candace Bias, Ruth Timme, Yesha Shrestha, Tina Pfefer, Paul Morin, Maria Balkey, Errol Strain 2026. Quality control assessment for microbial genomes: GalaxyTrakr MicroRunQC workflow. protocols.io https://dx.doi.org/10.17504/protocols.io.261ge138dv47/v2Version created by Maria Balkey

License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

Protocol status: Working

We use this protocol and it's working

Created: January 09, 2026

Last Modified: January 09, 2026

Protocol Integer ID: 238313

Keywords: WGS, Quality Control, GalaxyTrakr, GenomeTrakr, microbial pathogen survielliance, galaxytrakr microrunqc workflow, galaxytrakr microrunqc workflow purpose, quick access to genometrakr sequence quality threshold, wgs sequence quality for bacterial pathogen, quality control assessment for microbial genome, genometrakr sequence quality threshold, check against genometrakr qc, microrunqc workflow, microrunqc, genometrakr qc, microbial genome, checking wgs sequence quality, quality assessments for raw read, genometrakr, microbial pathogen, galaxytrakr, most microbial pathogen, bacterial pathogen, sequence type for each isolate, sequence type definition file, de novo assembly, available in sequence type definition file, end fastq file, pathogen, galaxytrakr account, raw read, cronobacter threshold, account in galaxytrakr, custom galaxy instance, mlst method, quality control assessment, assembly qc, added enterobacter qc, additional mlst data field, galaxytrakr upgrade, nextseq, entire miseq, miseq

Disclaimer

Please note that this protocol is public domain, which supersedes the CC-BY license default used by protocols.io.

Abstract

PURPOSE: Step-by-step instructions for checking WGS sequence quality for bacterial pathogens. The MicroRunQC workflow, implemented in a custom Galaxy instance, will produce quality assessments for raw reads (Illumina paired-end fastq files) and draft de novo assemblies, along with reporting the sequence type for each isolate. This workflow will work on most microbial pathogens, so we advise laboratories to upload their entire MiSeq/NextSeq run through this workflow.   

SCOPE: This protocol covers the following tasks:

1. Quick access to GenomeTrakr sequence quality thresholds by organism
2. Create a GalaxyTrakr account
3. Set up an account in GalaxyTrakr
4. Create a new history/workspace
5. Upload data
6. Execute the MicroRunQC workflow
7. Interpret the results - check against GenomeTrakr QC thresholds

Version updates:
V7: Edits to incorporate GalaxyTrakr upgrades and new interface.  
V6: Minor edits, including section reorganization and addition of clarifying notes
V5: New column in the output table to capture additional mlst data fields when available in Sequence Type definition files (not available for all species)
V4: MicroRunQC updated to V1.1 Includes updates to skeza and mlst methods, as well as adjusted assembly QC thresholds for E.coli. Added Enterobacter QC thresholds to threshold table.
V3: updated with Cronobacter thresholds

Quick Access to QC Benchmarks

This protocol will walk the user through various aspects of the quality assessment of bacterial genome sequences, from setting up a GalaxyTrakr account to the quality control (QC) benchmarks GenomeTrakr uses for its sequencing efforts. For quick access, GenomeTrakr QC benchmarks are included in the table below. 

These are also relevant for NARMS and VetLIRN contributors. 

*MicroRunQC users should follow QC threshold guidelines established by their respective surveillance coordinating body(s).
 
ABCDEFGHIJ
Quality metricSalmonellaListeriaE. coliShigellaCampylobacterVibrio para.CronobacterEnterococcus faeciumEnterococcus faecalis
Average read quality Q score for R1 and R2>=30>=30>=30>=30>=30>=30>=30>=30>=30
Average coverage>=30X>=20X>=40X>=40X>=20X>=40X>=20X>=50X>=40X
De novo assembly: Seq. length (Mbp)~4.3-5.2~2.7-3.2~4.5-5.9~4.0-5.0~1.5-1.9~4.8-5.5~4-5~2.5-3.5~2.5-3.25
De novo assembly: no. contigs<=300<=300<=400<=550<=300<=300<=500<=350<=200
 

Account set up

Create a GalaxyTrakr account here: https://account.galaxytrakr.org/Account/Register
     
Note
This is a more detailed form than what is available by clicking "Register here" on the genometrakr.org main page. Please use the form linked here when creating a new account.

Log into your GalaxyTrakr account: https://galaxytrakr.org

Create a new history

Create a new history. 

We recommend creating a new history for each new MiSeq Run and including the flow-cell ID and date in the history name. 

Save your MicroRunQC output here and any other relevant analyses, like serotyping, or AMR detection. 

After all the analysis output from this run is saved to your internal data network or computer, older histories should be purged/deleted so as not to occupy the limited storage space in your account. In some cases it may be useful to save, for a limited time, multiple histories or to run analyses concurrently in multiple histories. In these cases you need to pay attention to your % usage bar (shows % used of allocated storage space) in the upper right corner of the GalaxyTrakr page. If you need additional space you can contact [email protected] and request additional storage.

Click on the + icon in the upper right History panel

Name your new History by clicking on the “Unnamed history" text, type in desired name, and hit Enter.  We recommend including the run cell ID and the date the run was started.

Upload data

This section will describe the process for uploading raw fastq files into your active History panel. After the files have been uploaded they will stay in your account until they are deleted. 

Click on the Upload icon in the GalaxyTrakr menu to start an upload process.

Select "Type (set all):auto-detect." Click "Choose local files" button and navigate to the desired fastq files, then click "Start" to upload files. These files should be paired (two per sample/isolate).

As the file uploads complete, each row will turn green. Samples in yellow are still in process.

You have just upload a set of forward and reverse reads.  For further analysis, these files need to be paired properly so the platform knows which R1 and R2 files go together.  GalaxyTrakr does this by creating a List of Dataset Pairs.

Within your newly created History panel, click the check mark box, then select all the files you just uploaded.

Screenshot of History panel showing recently uploaded files. Note the way the files are named, using R1 and R2 to identify the paired reads. This will be important in the next step. Some naming conventions can be slightly different.

Click "Select All" and choose "Advance Build List", select "List of Paired Datasets"

A new window will open to help you pair the fastq files properly.  By default _R1 and _R2 are the selected options to pair fastq files. Note how your paired reads are named.

Click on the filter icon if  “_R1”,"_R2" have to be replaced by other suffix.  

Click Next, double check the name of the datasets. If the names are accurate, name the List of Paired Datasets and click Build.

This List of Paired Datasets will be available for analysis in your history panel. You can run multiple analyses on the same dataset rather than upload the same sequence data to a new history to perform additional analyses. This will help you use your allocated storage space efficiently.

Run the MicroRunQC workflow

Add the MicroRunQC workflow to your own "Workflows" panel.  You only have to do this step once for each new workflow you need.

Navigate to the "Workflow“ tab on the main menu, click "Public workflows", and search for "MicroRunQC_v1.2."

Click "Import" to select MicroRunQC.

To see the new imported workflow, click the “Workflow” tab on the main menu, click "My Workflows. 

Click the star to bookmark this workflow. 

Click the play icon to run the MicroRunQC_v1.2 and select the dataset collection that was created earlier.

Click Run Workflow. This can take some time depending on the number of samples you are analyzing. If you choose to you can log out of GalaxyTrakr and log back in at a later time to see if the job is completed.

Upon completion of the pipeline all tiles in the History pane will be green and each of the steps in the pipeline will show "completed".

In the “MicroRunQC_v1_2_Report” tile, click on the “Eye” icon to view the output table in the GalaxyTrakr window.

Interpret the results

Download and interpret the results:

Click MicroRunQC_v1.2_Report and then the floppy disc icon. The tabular file can be opened in a text reader or converted to a format (.txt) that can be opened in Excel.

 The MicroRunQC output file includes the following columns:

ABC
ParameterInputDescription
ContigsAssemblyNumber of contigs in the de-novo SKESA assembly. Contigs smaller than 200 base-pairs (bp) are not counted. 
LengthAssemblyTotal length of all contigs > 200bp. This should approximate the size of the genome for the target organism.
EstCovAssemblyMean coverage for contigs in the SKESA assembly.
N50AssemblySequence length of the shortest contig at 50% of the total genome length
MedianInsertReadDistance between forward and reverse reads. Calculated by mapping reads to SKESA assembly using bwa.
MeanLength_R1ReadMean length of forward read
MeanLength_R2ReadMean length of reverse read
MeanQ_R1ReadMean Q-score of forward read
MeanQ_R2ReadMean Q-score of reverse read
SchemeAssemblyPubMLST scheme name (output from mlst application that scans contig files against traditional PubMLST typing schemes.
STAssemblySequence Type 
MLST extraAssemblye.g. Listeria clonal complex info
LociAssemblygene (allele number) – for example aroC(118)
MicroRunQC output table headers.  This table lists the summary metrics for sequence quality, number of contigs, and estimated genome size, along with other common metrics for reads (Median Insert Size and Mean Length) and assemblies (N50).  Additionally, if the Multi-Locus Sequence Type (MLST) for the isolate is available from pubmlst, the workflow also reports Sequence Type (ST) and the associated alleles.

*MLST extra: Additional data fields reported when available in Sequence Type definition files (not available for all species)
1. clonal_complex – sequences grouped by similarity to central allelic profile  (e.g., Campylobacter ST-21 complex)
2. CC – clonal_complex – Abbreviation used for organism like Listeria, ST profiles are maintained by different groups
3. Lineage – Listeria monocytogenes lineage (I,II,III, and IV), Listeria species also reported here (e.g.L.innocua)
4. species – e.g., Vibrio alginolyticus
 
**This output should be saved either to your LIMS or to a spreadsheet linked to the sequencing run and samples.

Example output for 1 Salmonella and 5 Listeria isolates. 
 
AB
Srain IDLab Confirmation
FDA1216271-C001-001Listeria mono
FDA817806-S073-001Listeria mono
FDA746634Listeria mono
FDA1213377-C001-002Listeria grayi
FDA933376-S060-005Listeria innocua
FDA1213835-C001-001Salmonella
Lab confirmed IDs for 6 isolates
 
 
ABCDEFGHIJKLMNOPQRST
FileContigsLengthEstCovN50Median InsertMean Length_R1Mean Length_R2Mean Q_R1Mean Q_R2SchemeSTMLST extra       
FDA1216271-C001-00116291194936.7476210321148.4148.436.434.6listeria_25CC=CC5,Lineage=IabcZ(2)bglA(1)cat(11)dapE(3)dat(3)ldh(1)lhkA(7)
FDA817806-S073-001203068354179.6525438329234.7235.236.731.9listeria_2321CC=CC321,Lineage=IIabcZ(5)bglA(6)cat(8)dapE(62)dat(6)ldh(7)lhkA(34)
FDA74663430305288841.4293947320148.4148.436.536listeria_2-abcZ(2)bglA(1)cat(11)dapE(3)dat(3)ldh(1)lhkA(~7)
FDA1213377-C001-002202672180155.1473181270147.3147.337.236.1--       
FDA933376-S060-005928818692131498790303232.1232.23736.2listeria_21489CC=CC1489,Lineage=L. innocuaabcZ(250)bglA(21)cat(83)dapE(298)dat(20)ldh(458)lhkA(216)
FDA1213835-C001-00137483236534.429493635414914936.635.7senterica_achtman_2214aroC(14)dnaN(72)hemD(21)hisD(12)purE(6)sucA(19)thrA(15)
MicroRunQC example report showing mlst ST results for different Listeria species.
  
The mlst Listeria database includes multiple species, including Listeria monocytogenes and L. innocua.  When available, the Listeria clonal complex (CC) or L. monocytogenes lineage is listed alongside the ST.

For quality control threshold guidelines for the GenomeTrakr surveillance network,   These are also relevant for NARMS and VetLIRN contributors. 

*MicroRunQC users should follow QC threshold guidelines established by their respective surveillance coordinating body(s).


A	B	C	D	E	F	G	H	I	J
Quality metric	*Salmonella*	*Listeria*	*E. coli*	*Shigella*	*Campylobacter*	*Vibrio para.*	*Cronobacter*	*Enterococcus faecium*	*Enterococcus faecalis*
Average read quality Q score for R1 and R2	>=30	>=30	>=30	>=30	>=30	>=30	>=30	>=30	>=30
Average coverage	>=30X	>=20X	>=40X	>=40X	>=20X	>=40X	>=20X	>=50X	>=40X
De novo assembly: Seq. length (Mbp)	~4.3-5.2	~2.7-3.2	~4.5-5.9	~4.0-5.0	~1.5-1.9	~4.8-5.5	~4-5	~2.5-3.5	~2.5-3.25
De novo assembly: no. contigs	<=300	<=300	<=400	<=550	<=300	<=300	<=500	<=350	<=200

A	B	C
*Parameter*	*Input*	*Description*
Contigs	Assembly	Number of contigs in the de-novo SKESA assembly. Contigs smaller than 200 base-pairs (bp) are not counted.
Length	Assembly	Total length of all contigs > 200bp. This should approximate the size of the genome for the target organism.
EstCov	Assembly	Mean coverage for contigs in the SKESA assembly.
N50	Assembly	Sequence length of the shortest contig at 50% of the total genome length
MedianInsert	Read	Distance between forward and reverse reads. Calculated by mapping reads to SKESA assembly using bwa.
MeanLength_R1	Read	Mean length of forward read
MeanLength_R2	Read	Mean length of reverse read
MeanQ_R1	Read	Mean Q-score of forward read
MeanQ_R2	Read	Mean Q-score of reverse read
Scheme	Assembly	PubMLST scheme name (output from mlst application that scans contig files against traditional PubMLST typing schemes.
ST	Assembly	Sequence Type
MLST extra	Assembly	e.g. Listeria clonal complex info
Loci	Assembly	gene (allele number) – for example aroC(118)

	A	B
	Srain ID	Lab Confirmation
	FDA1216271-C001-001	Listeria mono
	FDA817806-S073-001	Listeria mono
	FDA746634	Listeria mono
	FDA1213377-C001-002	Listeria grayi
	FDA933376-S060-005	Listeria innocua
	FDA1213835-C001-001	Salmonella

A	B	C	D	E	F	G	H	I	J	K	L	M	N	O	P	Q	R	S	T
File	Contigs	Length	EstCov	N50	Median Insert	Mean Length_R1	Mean Length_R2	Mean Q_R1	Mean Q_R2	Scheme	ST	MLST extra
FDA1216271-C001-001	16	2911949	36.7	476210	321	148.4	148.4	36.4	34.6	listeria_2	5	CC=CC5,Lineage=I	abcZ(2)	bglA(1)	cat(11)	dapE(3)	dat(3)	ldh(1)	lhkA(7)
FDA817806-S073-001	20	3068354	179.6	525438	329	234.7	235.2	36.7	31.9	listeria_2	321	CC=CC321,Lineage=II	abcZ(5)	bglA(6)	cat(8)	dapE(62)	dat(6)	ldh(7)	lhkA(34)
FDA746634	30	3052888	41.4	293947	320	148.4	148.4	36.5	36	listeria_2	-		abcZ(2)	bglA(1)	cat(11)	dapE(3)	dat(3)	ldh(1)	lhkA(~7)
FDA1213377-C001-002	20	2672180	155.1	473181	270	147.3	147.3	37.2	36.1	-	-
FDA933376-S060-005	9	2881869	213	1498790	303	232.1	232.2	37	36.2	listeria_2	1489	CC=CC1489,Lineage=L. innocua	abcZ(250)	bglA(21)	cat(83)	dapE(298)	dat(20)	ldh(458)	lhkA(216)
FDA1213835-C001-001	37	4832365	34.4	294936	354	149	149	36.6	35.7	senterica_achtman_2	214		aroC(14)	dnaN(72)	hemD(21)	hisD(12)	purE(6)	sucA(19)	thrA(15)