Exploring isolate data using PubMLST

Anastasia Unitt

May 04, 2026

Exploring isolate data using PubMLST

DOI

https://dx.doi.org/10.17504/protocols.io.14egnyoxpv5d/v1

Anastasia Unitt¹

¹University of Oxford

Anastasia Unitt

University of Oxford

DOI: https://dx.doi.org/10.17504/protocols.io.14egnyoxpv5d/v1

External link: https://pubmlst.org/

Protocol Citation: Anastasia Unitt 2026. Exploring isolate data using PubMLST. protocols.io https://dx.doi.org/10.17504/protocols.io.14egnyoxpv5d/v1

License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

Protocol status: Working

We use this protocol and it is based on tools currently hosted on PubMLST.org as of May 2026. Please feel free to contact me with any questions.

Created: May 06, 2025

Last Modified: May 04, 2026

Protocol Integer ID: 217775

Keywords: PubMLST, Neisseria, gonorrhoeae, gonorrhoea, genetics, genomics, MLST, analysis tools, plugins, grapetree, genome comparator, exporting data, bioinformatics, cgMLST, LIN code, typing, locus, alleles, neisseria gonorrhoeae isolate, exploring isolate data, isolate data, pubmlst database, public isolate, using pubmlst, pubmlst this protocol, own isolate, database, analysis

Funders Acknowledgements:

Nuffield Department of Population Health

Disclaimer

The advice provided here is purely based on my own experience and is not formally associated with PubMLST. 

Abstract

This protocol describes the basics of using the PubMLST database to examine Neisseria gonorrhoeae isolates. This will allow you to explore the database using a variety of different parameters and search functions, and to access various plugins and tools to analyze any given dataset. You can examine public isolates only, of which there are thousands already in the database, or upload your own isolates privately to analyse independently. Version 1 published MAY 2026.

Before start

Please read (and cite) the most recent PubMLST article, e.g. DOI: 10.12688/wellcomeopenres.14826.1

To access the full public database you will need to create a free account https://pubmlst.org/bigsdb?page=registration and register for the Neisseria isolate and typing database from your account page.

PubMLST basics

Multi-locus Sequence Typing & PubMLST
In multi-locus sequence typing (MLST), alleles are the unit of comparison, as opposed to individual nucleotide (DNA) sequence polymorphisms. Alleles are variants of a gene: each unique DNA sequence of the same gene (or gene fragment) is a different allele. In PubMLST, each allele of a given gene is assigned a number. The combination of alleles across a list of genes (a"typing scheme") is an allelic profile. Each unique allelic profile is called a sequence type (ST), which can be assigned a number e.g. ST 1179, as shown below.

For this isolate (id 138) gene abcZ has allele number 59, gene adk allele 39, aroE allele 67, and so on. This combination of alleles has been designated the ST 1179. 

Typing schemes are very variable in size. Conventional MLST for N. gonorrhoeae uses only seven genes in its typing scheme (see above figure), while core genome MLST uses >1000 genes. 
Some sequence typing schemes are not intended to represent inheritance and instead aim to catalogue features such as antimicrobial resistance (AMR) genotypes.
 Typing schemes may use whole coding sequences, or only sub-sequences (fragments) of a gene (as is the case with seven-locus MLST, used in the above figure)

For more detail on the concept of MLST please read: Maiden, M.C.J., et al. MLST revisited: the gene-by-gene approach to bacterial genomics. Nature Reviews Microbiology 2013;11(10):728-736.

The PubMLST database
All PubMLST species/genus databases have two principal aspects: an isolate database, where sequence data and gene annotations are stored alongside metadata such as year or country of origin, and a typing database, where genes/loci are indexed, along with all the different alleles for each of those genes. 

In other words, the isolate database shows you the data from the isolate perspective, the typing database from the locus perspective. 

The database includes both partial and whole genome sequence (WGS) data. Much of the WGS data is in the form of draft genomes: where the whole genome has been sequenced but not stitched together into one “complete” genome sequence. Instead, the WGS data is stored as a series of contigs (“contiguous” segments).

The genus Neisseria has its own database within PubMLST. This should include only members of this genus, although be warned some contaminated sequences may be included. We will discuss how to exclude those later. 

If you navigate to the Neisseria database via the Organisms tab, you will see this page:

The bottom left box can be clicked to access the typing database, the middle box can be clicked to access the isolate database. The right box describes the number of whole genomes found within the isolate database.

The typing database

The Neisseria typing database allows you to examine data from the locus perspective, e.g. by BLASTing a sequence to find its corresponding locus in PubMLST, or to examine all the alleles at a specific locus using its NEIS number or gene name. Please note these loci and alleles are indexed for all Neisseria species together. Many genes are shared between the species, and in some cases the same alleles may be found across the genus, while other genes may have species-specific alleles. 

Side note: genes in PubMLST
Within PubMLST, each full length coding sequence is assigned a unique number with the prefix NEIS. For example, the gene penA encoding PBP2 is known in PubMLST as NEIS1753. 

Each unique allele of that gene is assigned its own allele number. (E.g. there are over 5000 unique alleles for NEIS1753 in the PubMLST Neisseria database.) Some alleles do not encode a functional version of the gene due to the inclusion of internal stop codons, so keep that in mind: just because an isolate has been tagged with a locus present, doesn't necessarily mean that the gene is functional. 

Many alleles are assigned in an automated manner, but some have complexities that require the involvement of a human - we call the process of manually defining and tidying up alleles curation. 

Be aware that some typing schemes do not use whole genes and instead use a sub-sequence - these gene fragments will not have a NEIS number. Instead they may only have a three letter name, such as pgm, used in 7-locus MLST. 

If you want to find the NEIS number for a certain gene, the most reliable way is to blast the database with a known sequence of that gene. Alternatively, you can search for the gene name in the "by locus" aspect of the database.

The most current typing-based method for analysing gonococcal lineages in PubMLST (as of May 2025) is LIN code. The gonococcal LIN code uses an updated Ng cgMLST v2 typing scheme of 1430 core loci to barcode isolates. This method is more effective in the face of extensive HGT than 7-locus MLST. 

For more detail on the N. gonorrhoeae LIN code please see: Anastasia Unitt, Made A Krisna, Kasia M Parfitt, Keith A Jolley, Martin CJ Maiden, Odile B Harrison (2025) Neisseria gonorrhoeae LIN codes provide a robust, multi-resolution lineage nomenclature eLife  https://doi.org/10.7554/eLife.107758.3

The isolate database

When you first load the isolate database, you will likely see a dashboard summarising the contents of the database like this. You can switch this to a simple index using the purple tab on the top right of the page. To search, click the + symbol next to search button on the right hand side of the screen, and select "search database".

As you can see there is a lot of publicly available WGS data in the database. You can upload your own via the submissions button on the right hand side, or via the private data button if you want to keep the data out of public view prior to publication. Remember, the Neisseria isolate database includes multiple different species from within the genus, but is predominantly made up of N. meningitidis and N. gonorrhoeae.

The isolate search page
Initially you may encounter a very limited looking search page. To expand this click the dark purple "modify form" tab, at the top right of the page with the spanner icon.You will also want to ensure the tooltips are enabled by clicking the tooltips tab. 

Your expanded search form should look something like this. Now you have far more options to search the database with (please note that keeping all search options selected may make the form slower to load). If you have a known list of isolates you want to search for, use the attribute values box. 
Using the isolate search page you can interrogate the database in any number of ways. However, if you want to analyse exclusively gonococci, your first step will be to select under "isolate provenance/primary metadata fields" "species", and then enter "Neisseria gonorrhoeae", as shown below.

By clicking the + icon next to this field you can add other fields such as year or continent to narrow down your dataset, as shown below.

You can use the same part of the form to search for specific IDs or isolate names. Other parts of the form allow you to search for isolates that belong to a particular publication, by a particular allele at a specific locus, by sequence quality (using "sequence bin") or by "tagged sequence status" i.e. whether a particular locus is tagged or not. You can also search by all kinds of metadata (where it has been made available) for example by MIC, country, year etc.

- An aside -
If you have a pre-defined list of isolates you want to search for, you have a few options. If you know the bioproject accession, you can look for this under isolate provenance field. If you know that the isolate names all follow the same convention, e.g. include the characters "T2MSM" you can use the isolate provencence search box, select field "isolate" and "CONTAINS" and then search by these characters. You should be cautious with these approaches as isolates may be missing metadata such as bioproject, have isolate names formatted differently to how you expect, or other isolates may share the same naming convention despite not being from the dataset you are looking for. You can also paste a list of isolate ids or names in the attribute values search box to return multiple at once rather than searching one by one. 

You can use the "allele designations/scheme fields" box to search for isolates by sequence type e.g. MLST, or to search for isolates belonging to the same LIN code lineage. To do that select LIN code (N. gonorrhoeae cgMLST v2 from the drop down box, change the middle box to "starts with" and then type the start of the relevant LIN code. 

By using the various search functions in conjunction you can narrow down the wider database to a smaller selection of isolates. For example, you can extract only isolates from 2010 belonging to MLST 1901, or isolates from North America with allele 1302 of NEIS1922, and NEIS1930 tagged. The combinations are almost endless and depend entirely on your research question. You can also use filters, for example by publication.

Analysing an isolate dataset

Select a set of isolates to search. For example, from the publication drop down box select Alfsnes et al. 2020 Microb Genom. Click search, then scroll to the bottom of the page. Here you will find the analysis tools.

You can use the Breakdown "Fields"/"Two Fields"/"Combinations" buttons to generate some quick tables and figures describing your dataset, for example by year, MIC or country. Breakdown "Sequence bin" will allow you to export a contig analysis including numbers of contigs, GC% and other statistics.  
Under the Analysis options, "Genome Comparator" can do lots of things including generating alignments, distance matrices and other analysis. "rMLST species id" will breakdown the species assignment of each isolate. 

Export "dataset" allows you to extract the data on your isolates including metadata, allele assignment, STs and LIN codes to spreadsheets or similar to use in outside analysis software. You can also extract the contigs (through Export "Contigs") or specific sequences ( Export "Sequences") to use in your own alignment software. 

The Third party analysis tools are also very useful. "Grapetree" allows you to generate a minimum spanning tree which you can the annotate with various metadata to quickly visualise the clustering of your isolates based on the alleles at a particular loci (e.g. an AMR determinant) or the core genome (Ng cgMLST v2). "iTOL" will allow you to generate a neighbour joining tree to visualise and edit in iTOL: caution should be used here for gonococcal analyses, as this tree is not corrected for recombination and as such may be misleading. "Microreact" is a flexible tool for exploring a dataset geographically and temporally.

Things to consider when creating a dataset
For example, if you are creating a dataset of global gonococci to contextualise your isolates within.

Sequence quality. Poor quality sequence data may be contaminated by other species, may include more than one strain of gonococci, or may have many loci interrupted by the ends of contigs or inaccurate base calls. Use rMLST species id https://pubmlst.org/species-id to assess the species of a dataset, and use Sequence bin breakdown to assess contig number, GC% and total length. Generally, under 400 contigs is a good quality gonococcal genome assembly. GC% on average would be 52.5, higher or lower could indicate contamination. Total length is generally 2.1 Mbp, higher may indicate poor sequencing quality or mixed strain isolate. You can confirm mixed strain isolates by looking at highly variable loci and seeing if more than one allele has been assigned. 
Duplicated isolate records. Isolate names are not always unique, and may not be used consistently. Make sure using a unique identifier such as run accession that your dataset does not include duplicate records of the same isolate. You may also want to check for multiple sequencing runs of the same isolate; this can occur where a study has sequenced gonococci from multiple sites sampled within the same patient or in a direct transmission event between two partners. Whether this matters for your analysis depends on your research question. 
Lab strains. Lab strains may not represent wild gonococci, particularly strains like FA1090, FA19 and MS11 which were collected some time ago and belong to rare lineages. 
Metadata completeness. If you are interested only in isolates with confirmed resistance to an antibiotic you may want to filter out isolates that don't have MIC data available. 

Bookmarking or making projects

To save your dataset for quick access in the future you can use bookmarks or add it to a project. However, to access these options you must make a free PubMLST account. Signing up only takes a minute and also allows you to access all public data in PubMLST.

Side note: the curator interface
You may have noticed a red tab labelled "curator interface". Both the isolate and typing databases have a public-facing user interface, and a private curator interface. Here, curators can make changes to the database such as manually adding new alleles, or creating new schemes for sequence typing. I intend to write a curator protocol for these processes. 

Curation access can only be granted after contact with PubMLST administrators. 

Summary

By reading this protocol, and by following along by clicking through the PubMLST database, you should have an improved ability to navigate the website and use its tools. 

Protocol references

Jolley, K. A., et al. (2018). "Open-access bacterial population genomics: BIGSdb software, the PubMLST.org website and their applications." Wellcome Open Research 3: 124 DOI: 10.12688/wellcomeopenres.14826.1

Jolley, K. A., https://pubmlst.org/ [accessed MAY 2025]

Jolley, K. A., https://bigsdb.readthedocs.org/ [accessed MAY 2025]

Maiden, M.C.J., et al. MLST revisited: the gene-by-gene approach to bacterial genomics. Nature Reviews Microbiology 2013;11(10):728-736.

Unitt, A., et al. Neisseria gonorrhoeae LIN codes: a Robust, Multi-Resolution Lineage Nomenclature. bioRxiv 2025:2025.2003.2028.646058.

Acknowledgements

Thanks to Dr Odile Harrison, Prof Martin Maiden, Dr Keith Jolley and Dr James Bray who trained me in the use of PubMLST.