Bacterial genome annotation script using BLASTN

Ana Mariya Anhel; Lorea Alejaldre; Ángel oñi-Moreno; Lewis Grozinger

Jun 03, 2025

Version 1

Bacterial genome annotation script using BLASTN V.1

Forked from Bacterial genome annotation script using BLASTN

This protocol is a draft, published without a DOI.

Ana Mariya Anhel¹,
Lorea Alejaldre¹,
Ángel oñi-Moreno¹,
Lewis Grozinger²

¹Centro de Biotecnología y Genómica de Plantas, Universidad Politécnica de Madrid (UPM)-Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA/CSIC), Madrid, Spain;
²Centro Nacional de Biotecnologia (CNB-CSIC) Madrid, Spain

Ángel oñi-Moreno: [email protected];

Biocomputation Lab

CNB-CSIC

Protocol Citation: Ana Mariya Anhel, Lorea Alejaldre, Ángel oñi-Moreno, Lewis Grozinger 2025. Bacterial genome annotation script using BLASTN . protocols.io https://protocols.io/view/bacterial-genome-annotation-script-using-blastn-gzcvbx2w7

License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

Protocol status: Working

We use this protocol and it's working

Created: May 14, 2025

Last Modified: June 03, 2025

Protocol Integer ID: 218229

Keywords: Genome anotation, Bacterial, P. putida, Transposon, Transposon library, E. coli, bacterial genome annotation script, genome of pseudomonas putida kt2440, transposon integration events in genome, genome alignment, annotations of the transposon sequence, reads to transposon sequence, method of genome alignment, alignment to the genome, annotations from genbank, positions in the genome, other genome, sequencing read, position in the genome, genome, gene locus of transposon insert, genome sequence, multiple genome, genbank files for each read, multiple genome sequence, genbank format, pseudomonas putida kt2440, genbank file, genbank sequence, multiple genbank file, transposon sequence, using blastn, transposon insert, genbank reference, gene, python package tnatla, sequence trimming, genbank, gene locus, transposon integration event, new tool for metadata annotation, general metadata annotation of the result, reads with corresponding feature, blastn, general metadata annotation, aligning read, python script, annotatio

Funders Acknowledgements:

Comunidad de Madrid

Grant ID: Y2020/TCS-6555,2019-T1/BIO-14053

MCIN/AEI

Grant ID: CEX2020-000999-S,PID2020-117205GA-I00

ERC

Grant ID: 101044360

Abstract

This protocol uses the command line tools provided by the Python package TnAtlas to identify and annotate transposon integration events in genomes. 

Given a set of sequencing reads, transposon sequences and genomes, the protocol looks for positions in the genomes where the transposon sequences have been integrated, annotates the sequencing reads with corresponding features of the genome, and presents a summary of the results as a spreadsheet. Integrations are identified by aligning reads to transposon sequences using NCBI's BLASTN, and annotations are made from reference genomes in GENBANK format.

We have used this protocol in our group (https://biocomputationlab.com) to identify the location and gene locus of transposon inserts in the genome of Pseudomonas putida KT2440. However, this script can be used for other genomes for which the genome sequence and annotation are available.

This updated version of the protocol is the first to be supported by the Python package TnAtlas, which contains library code which can be used to run the protocol programmatically (in a Python script or notebook), as well as compute and present other statistics about the results themselves.

The package TnAtlas requires Python versions greater than or equal to 3.8. The protocol requires at least blastn version greater than or equal to 2.12. Optionally, the protocol also requires sickle version 1.33 or greater, and fastqc (any version), in order to perform the sequence trimming and quality control reporting.

This is a description of the LAP entry LAPu-InsertsGenAnnotation-2.0.0 located in the LAP repository, specifically LAPu-InsertsGenAnnotation-2.0.0 and Github Entry LAPu-InsertsGenAnnotation-2.0.0, 2 places that you can download directly the script used and usage examples

The major changes from previous version are:

Now shipped as a Python package: The code and tools are packaged and can be installed with Python PIP.
New tool for metadata annotation: The old --summary-map system has been replaced with a dedicated tool `tnmeta`, for general metadata annotation of the results.
Command line options for saving intermediate files: By default, the tools create far fewer files. The options, --sam, --trim-save, --transposon-save, and --genome-save control the creation of intermediate files.
Method of genome alignment: Alignment to the genome is now based on aligning the reads after  the transposon sequence that has been found has been removed, giving preference to alignments earlier in the read, and also avoiding position errors of a few base pairs that can affect sequence logo results.
Generates annotated genbank sequences: tnfind generates genbank files for each read that includes annotations of the transposon sequence and of the corresponding features from its position in the genome.
Annotations from genbank: Annotations come from genbank reference genomes instead of separate CSV files.
Multiple genomes: `tnfind` can search through multiple genome sequences at once by passing either passing multiple records in a genbank file, or multiple genbank files, to the `-genome` option.

Guidelines

This script needs min 4 arguments in the following order:
Directory of folder containing sequencing reads
Reads file type (should be FASTA format even if the extension can be anything)
Genome file to perform blastn alignment (FASTA format)
Genome annotation file (.csv)

Materials

Software

Docker or Docker Desktop

OR

Linux or MacOS
Python 3.8 or greater
BLAST+ version 2.12 or greater
Sickle
FastQC

Troubleshooting

Prepare input data

Reference genome

You can download the genome of the organism(s) that you want to compare to your sequencing reads from different sources such as NCBI, GSA or even pages dedicated to the organism (for example pseudomonas.com).

For this protocol, the genome files need to be in GENBANK, FASTA, or FASTQ format.

Note
Annotation will only work if GENBANK files are provided containing sequence features.

Sequencing reads

Place all read files under a single directory. These files can be in either FASTA or FASTQ format, but formats cannot be mixed (i.e. the tool will not process both FASTA and FASTQ files simultaneously).  

Note
The file extension is largely irrelevant and can be specified later using the --input-ext option. 

Note
If quality control and read trimming is desired, the FASTQ format is required. Sanger, Solexa/Illumina1.0 and Ilumina>1.3 quality score encodings are supported for FASTQ files.

Each FASTA or FASTQ file can contain multiple reads, and multiple files can be provided. All reads within all FASTA or FASTQ file in the directory will be processed. It is therefore important that each read has a unique ID (the ID of the sequence follows the first > character in the FASTA format, and the first @ character in FASTQ format).


Note
Later, when attaching (optional) metadata to the results, the ID will also be used to identify the plate and well from which the read was sampled. It is recommended that IDs are generated in a systematic way which facilitates this identification.

Transposon sequences

Sequences belonging to the transposon systems used should be provided in either GENBANK, FASTA, or FASTQ format. These sequences will be used to determine from where to start aligning a read to the reference genomes.

Each GENBANK, FASTA, or FASTA file can contain multiple sequences, and multiple files can be provided.

Note
Shorter sequences tend to give better results and less false positives than longer ones. But of course one should avoid sequences that are extremely short. Sequences should certainly be longer than the blastn word size (default: 11 base pairs).

Sample plate metadata

Additional metadata can be attached to individual sequencing reads by providing the metadata in a grid, where position in the grid can be mapped onto samples in a well plate.

The grid, or grids, can be provided in excel format (XLSX). The first row of the excel file should contain numbers that index the columns of the plate, and the first column should contain letters that index the rows of the plate. Each cell corresponding to a well in a plate should contain the metadata that you wish to attach to the sequencing reads originating from that well.

The name of the files themselves should follow the format: 

<PLATE_IDENTIFIER>_<METADATA_LABEL>.xlsx

where <PLATE_IDENTIFIER> is a unique identifier for the plate to which the metadata should be applied, and <METADATA_LABEL> is the name that you would like for the column containing the metadata that will appear in the output table.

<PLATE_IDENTIFIER> should usually correspond to some unique part of the IDs of the sequencing reads (see the discussion on IDs atGo to  ).

An example metadata file is provided here:
22CCRAA01_identity.xlsx18KB  

Which will attach a metadata column called "identity" to the reads that come from samples in the "22CCRAA01" plate.

If you have a Windows system, follow the instructions for "Using Docker on Windows"

If you have a MacOS system you can follow either "MACOS systems" or "Using Docker on MacOS" 

Note
If you are comfortable using Homebrew to install software you can pick either set of instructions. If you are not, it may be easier to follow the "Using Docker on MacOS" steps.

If you have a Linux system you can follow either "Linux systems" or "Using Docker on Linux"

Note
If you are comfortable using your package manager to install software you can pick either set of instructions. If you are not, it may be easier to follow the "Using Docker on Linux" steps.

Step case

Linux systems
25 steps

Install Python3

Most Linux distributions have Python3 in their package repositories, and it is convenient to install Python3 using the package manager. This protocol requires a Python version of at least 3.8.

For example:
Command
Python3.8 Installation using apt package manager (Linux Debian-based)
sudo apt-get update
sudo apt-get install python3.8

Note
The latest, up-to-date instructions for installing Python on Linux are here.

It is also important to have `pip` installed. If pip does not come with your Python3 installation, you must install it manually. For example:
Command
Pip installation from Python3.8
python3.8 -m ensurepip --upgrade

Note
The latest, up-to-date instructions for installing pip on Linux are here.


Note
If you already have Python3 installed and setup you may use any Python greater than or equal to version 3.8. In the commands listed here python3.8 will be used, but you can replace that with python3.x or python3 or python, depending on your setup.

Install BLAST+

There are different ways to install the NCBI-BLAST+ suite of tools. Fortunately most Linux distributions provide a package called ncbi-blast+ or similar that can be installed with the package manager. If your distribution has such a  package in its repositories, this is the recommended way to install BLAST+.

For example on debian-based distributions:
Command
Install NCBI BLAST+ suite using apt (Debian-based)
sudo apt-get install ncbi-blast+
If no BLAST package is available for your distribution, then the latest versions of the tools can also be downloaded from the NCBI ftp servers here. You can download them from the command line using wget:
Command
Download latest Linux NCBI-BLAST+ tools
wget https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.16.0+-x64-linux.tar.gz

Note
ncbi-blast-2.16.0+ was the latest version as of writing. In the future the URL in the above command will change. You should check here for the latest version and adjust the URL accordingly.

Note
If you decide to use the NCBI download to install BLAST+, then you will need to untar the download into a location on your filesystem. The tools will be under a directory named bin. Either you will need to add this directory to your PATH, so that Linux can find the BLAST commands, or you will need to specify the path to your blastn executable using the -blastn option when running the analysis tool in future steps.

Check BLAST+ installation

You can check that the installation of the BLAST+ tools went well by running:
Command
Check blastn version
blastn --version
Which should report the version of BLAST+ you installed (at least 2.12.0+)

Install FastQC

FastQC is a tool used for quality control of high-throughput sequence data. It provides a way to assess the quality of raw sequencing data

If you will not follow the quality assessment steps of the protocol, there is no need to install this software.

Most package repositories provide fastqc, for example, on Debian-based systems:

Command
Install fastqc using apt (Debian-based)
sudo apt-get install fastqc

Install Sickle

Sickle is a software tool designed for quality control of high-throughput sequence data, especially for data generated by Next-Generation Sequencing (NGS) platforms. Its primary use case is trimming low-quality bases and adapter sequences from the ends of sequencing reads

If you will not run the quality assessment steps of the protocol, there is no need to install this software.

Most package repositories provide sickle, for example, on Debian-based systems:
Command
Install sickle using apt (Debian-based)
sudo apt-get install sickle

The TnAtlas tools

The TnAtlas Python package contains the software tools used in this protocol. TnAtlas can be downloaded using the Python package installer PIP as follows:
Command
Install TnAtlas using PIP
python3.8 -m pip install tnatlas
If the installation is successful then the tnfind and tnmeta commands should be available:
Command
Check tnfind is installed
tnfind

Expected result
usage: tnfind [-h] [-v] [--sequencing-type {sanger,solexa,illumina}]
              [--input-type {fasta,fastq}] [--input-ext INPUT_EXT] [-o OUTPUT_FILE]
              [-qc [PATH_TO_FASTQC]] [-trim [PATH_TO_SICKLE]] [--trim-quality TRIM_QUALITY]
              [--trim-length TRIM_LENGTH] [--trim-save] [-blastn PATH_TO_BLASTN] [-sam]
              -transposon TRANSPOSON [TRANSPOSON ...] [--transposon-type TRANSPOSON_TYPE]
              [--transposon-save] [--transposon-word-size TRANSPOSON_WORD_SIZE]
              [--transposon-evalue TRANSPOSON_EVALUE] -genome GENOME [GENOME ...]
              [--genome-type GENOME_TYPE] [--genome-save]
              [--genome-word-size GENOME_WORD_SIZE] [--genome-evalue GENOME_EVALUE]
              [--genome-prefix GENOME_PREFIX] [-config CONFIG]
              input_dir output_dir
tnfind: error: the following arguments are required: input_dir, output_dir, -transposon, -genome


Command
Check tnmeta is installed
tnmeta

Expected result
usage: tnmeta [-h] [-o OUTPUT_FILE]
              plate_regex well_regex result_file metadata [metadata ...]
tnmeta: error: the following arguments are required: plate_regex, well_regex, result_file, metadata

Running the tnfind tool

Basic usage

tnfind finds possible transposon integration events in a set of sequencing reads and positions them in reference genomes.

Note
tnfind is a command line tool that should be run through a terminal or terminal emulator such as xterm, gnome terminal, Windows powershell or iterm.

At a minimum, tnfind requires four arguments:

input_dir is the directory where the sequencing reads in FASTA or FASTQ can be found.
output_dir is the directory into which tnfind will put its results.
-transposon is the GENBANK, FASTA or FASTQ file or files, containing the transposon DNA sequences.
-genome is the GENBANK, FASTA or FASTQ file or files, containing the genomic DNA sequences.


Note
output_dir needs to already exist for tnfind to work.

Note
Existing files in output_dir are at risk of being overwritten without warning. Be careful when running tnfind over multiple separate datasets using the same output_dir.

A simple example execution of tnfind would look like:
Command
Simple example of running tnfind
tnfind ./data ./data/results -transposon ./data/transposons.gb -genome ./data/genomes.gb
where in this case input_dir is ./data and contains FASTQ files with extension ab1, output_dir is ./data/results, the file ./data/transposons.gb contains transposon sequences in GENBANK format, and ./data/genomes.gb contains reference genomes in GENBANK format.

Other input formats, types and file extensions

The following options in combination in order to process input files with different formats, types and file extensions:

--input-type: is the format of the read files and should be either fasta or fastq. The default is fastq.
--input-ext: is the file extension of the read files. This can be anything. The default is ab1 if the --input-type is fastq or fasta if the --input-type is fasta.
--sequencing-type: the type of sequencing used to obtain the reads. This can be either sanger, solexa or illumina. The default is sanger.
--transposon-type: is the format of the file(s) containing the transposon sequences and can be either genbank, fasta or fastq. The default is genbank.
--genome-type: is the format of the file(s) containing the genome sequences and can be either genbank, fasta or fastq. The default is genbank.

For example, if you have FASTQ format reads that have file extension fatstq instead of ab1, you could run:
Command
Run tnfind on fastq files
tnfind ./data ./data/results -transposon ./data/transposons.gb -genome ./data/genomes.gb --input-ext fastq
Or if you do not have an annotated reference genome, and want to instead use a sequence-only FASTA file:
Command
Running tnfind with FASTA reference genome
tnfind ./data ./data/results -transposon ./data/transposons.gb -genome ./data/genomes.fasta --genome-type fasta

Quality control report

Optionally, tnfind can generate a fastqc quality control report for your sequencing reads, that contains information that will be useful if you intend to later trim your reads based on quality.

Passing the option -qc when running tnfind will generate the quality control report. The report will be saved in the output_dir in the file records_fastqc.html, you can open this file with your favourite web browser.
Command
Running fastqc as part of tnfind
tnfind ./data ./data/results -transposon ./data/transposons.gb -genome ./data/genomes.gb -qc

Note
By default, tnfind looks for a command fastqc on your computer system. If you have installed fastqc somewhere that is not on your PATH, you may also provide the path/to/fastqc to the -qc argument.

Command
Running tnfind with a custom location for fastqc
tnfind ./data ./data/results -transposon ./data/transposons.gb -genome ./data/genomes.gb -qc /path/to/fastqc

Quality-based trimming

To ensure that sequencing quality does not affect your results, you may wish to trim your input sequences where quality is low. tnfind can do this using the sickle tool.

Passing the option -trim when running tnfind will trim your reads before processing.
Command
Running tnfind with sickle to trim reads 
tnfind ./data ./data/results -transposon ./data/transposons.gb -genome ./data/genomes.gb -trim
Optionally, the --trim-length and --trim-quality options can passed to tnfind to control the severity of the trimming.
Command
Running tnfind with sickle and with custom trim length and quality
tnfind ./data ./data/results -transposon ./data/transposons.gb -genome ./data/genomes.gb -trim --trim-length 30 --trim-quality 25
If --trim-length is not given explicitly, its default value is 20. If --trim-quality is not given explicitly, its default value is also 20.


Note
By default, tnfind looks for a command sickle on your computer system. If you have installed sickle somewhere that is not on your PATH, you may also provide the path/to/sickle to the -trim argument.

Command
Running tnfind with sickle in a custom location
tnfind ./data ./data/results -transposon ./data/transposons.gb -genome ./data/genomes.gb -trim /path/to/your/sickle

SAM output

Part of tnfind's processing is a multiple sequence alignment between your sequencing reads and the reference genome(s). Optionally, this multiple sequence alignment may be saved in SAM format.

Passing the option -sam to tnfind will save a multiple sequence alignment in SAM format in the output_dir in the file samgenomes.sam.
Command
Running tnfind with SAM output
tnfind ./data ./data/results -transposon ./data/transposons.gb -genome ./data/genomes.gb -sam

Obtaining help

There are many more optional arguments that control the behaviour of tnfind. 

They can be listed, alongside a short description of each, by running:
Command
Display the tnfind help
tnfind --help

Default outputs

results.xlsx is an excel spreadsheet containing the summary of the tnfind results.
tnfind also produces one GENBANK file per input read, which is annotated with the supposed transposon sequence and genome alignments for the read. The genome alignments are also annotated with corresponding genome features and loci, if those are available in the reference genome files.

results.xlsx

This is an excel spreadsheet which contains at least one row for every read in the input to tnfind. 

If a particular read has multiple rows, then it was not possible to uniquely identify a position in the reference genome(s) for the transposon integration event. For example, because the integration has occurred within a genomic region which is repeated across the genome(s) (rRNAs for example, or genes with multiple copies, or genes that are shared among genomes).

The columns of the excel describe various different statistics and information about the alignment. There are several groups.

Information about the sequencing read
The first column is simply an indexing column and can be ignored.
The name column is the ID of the read obtained from the input file.
The read length column is the length of the read (after trimming, if applicable).
The multiple candidates column is true if the read has multiple rows and false otherwise. It can be used to filter uniquely identifiable integration events.

Information about the transposon sequence found
The transposon column contains the name of the supposed transposon sequence.
The transposon evalue column contains the evalue, reported by blastn, of the alignment of the read to the transposon sequence.
The transposon bit score column contains the bit score, reported by blastn, of the alignment of the read to the transposon sequence.
The transposon identity column contains the identity score, reported by blastn, of the alignment of the read to the transposon sequence. 1 is 100% identity.
The transposon length column indicates the length of the aligned part of the transposon sequence.
The transposon start column indicates the base pair position in the read where the alignment to the transposon sequence begins.
The transposon end column indicates the base pair position in the read where the alignment to the transposon sequence ends.

These columns are empty if no transposon sequence is found in the read.

Information about the position in the genome
The genome offset column indicates the distance in base pairs between the end of the transposon alignment and the beginning of the genome alignment.
The gap sequence is the read sequence (if any), between the end of the transposon alignment and the beginning of the genome alignment.
The insertion locus is the name of the locus (if any) in the genome where the integration event is supposed to have occurred. This name comes from the annotated reference genome.
The insertion gene is the name of the gene into which the transposon sequence was integrated. This name comes from the annotated reference genome. If there is no such annotation this column shows "None"
The sequence type is the annotated type of the locus into which the integration has occurred. Again, this is taken from annotations on the reference genome.
The insertion strand is the strand orientation of the locus into which the integration has occurred.
The insertion product is the product annotation of the locus into which the integration has occurred.

These columns are empty if there is no alignment to the genome, or if genome annotations are missing.

Information about the genome alignment
The genome column contains the ID of the reference genome to which the read has been aligned.
The genome alignment evalue columns contains the evalue, reported by blastn, of the alignment of the read to the genome sequence.
The genome alignment bit score column contains the bit score, reported by blastn, of the alignment of the read to the genome sequence.
The genome alignment identity column contains the identity score, reported by blastn, of the alignment of the read to the genome sequence. 1 is 100% identity.
The genome alignment length column contains the length in base pairs of the part of the read that was aligned to the genome.
The genome alignment mismatch column contains the number of mismatches, as reported by blastn, in the alignment between the read and the genome.
The genome alignment gaps column contains the number of gaps, as reported by blastn, in the alignment between the read and the genome.
The genome start column indicates the base pair position in the reference genome of the start of the alignment between the read and the genome.
The genome end column indicates the base pair position in the reference genome of the start of the alignment between the read and the genome.
The genome strand column indicates the strand orientation on the genome of the alignment between the read and genome.
The genome loci column is a list of all the genomic loci contained within the genome alignment (as opposed to the single locus at the integration position reported in the insertion locus column).
The genome prefix column displays the sequence corresponding to the first 9 base pairs of genomic DNA contained in the read (the start of the read sequence that aligns with the genome). The size of the sequence displayed can be changed with the --genome-prefix option.
The genome alignment column shows the quality of the alignment of the genome prefix sequence with the genome. Exact matches are indicated with "|". Mismatches are indicated with ".". Gaps are show as "-".

These columns are empty if there is no alignment to the genome.

Annotated reads

tnfind produces an annotated GENBANK file (file names ending in .processed.gb) for each input read that can be used to visually inspect the positions and alignments found to the both the transposon sequences and the genomes.

These files can be opened using your favourite sequence viewer to see the arrangement of transposon and genomic sequences that tnfind thinks is most probable for each read. Genomic sequences are also annotated with features from their respective reference genomes.

An example:

An example of the GENBANK output from tnfind. In this case, the read 22CCRRAA000_C05 is annotated with a transposon sequence alignment, plus the IR (Inverted repeat) feature. There is a small gap between the end of the transposon sequence and the genomic sequence, which comes from the genome AE015451.2 and contains the gene hmgA. In this case, the read aligned with the plus strand of AE015451.2, but the hmgA gene is coded for on the minus strand.

Example

12m

Create a new directory structure to work in and cd into it:
Command
Make a directory for the example run of tnfind
mkdir tnfind-example && cd tnfind-example

Download the example sequencing read files, transposon sequences file and reference genome.
Dataset
tnfind example data
NAME
https://saco.csic.es/s/jtwG6EGPCjyJcRk
LINK

You can download the example data from https://saco.csic.es/s/jtwG6EGPCjyJcRk and unzip into a directory called data, or use the command:
Command
Download the example data and unzip
wget https://saco.csic.es/s/jtwG6EGPCjyJcRk/download -O data.zip && unzip data.zip && rm data.zip
To create a directory called data, under the tnfind-example directory, containing the example data files.

Make a directory to contain the results
Command
Make a results directory
mkdir data/results

Run tnfind with trimming
Command
Run tnfind example
tnfind data data/results -transposon data/transposon.gb -genome data/pputidakt2440.gb -trim

Expected result
96 reads from /home/lewis/tnfind-example/data loaded

SE input file: /tmp/tmpy6xbbxx4/records.fastq

Total FastQ records: 96
FastQ records kept: 94
FastQ records discarded: 2

Loading transposons from /home/lewis/tnfind-example/data/transposons.gb
2 transposons loaded
blastn -query /tmp/tmpqoi9a0mu/query.fasta -subject /tmp/tmpqoi9a0mu/target.fasta -parse_deflines -outfmt 5 -evalue 0.01 -word_size 10
Loading genomes from /home/lewis/tnfind-example/data/pputidakt2440.gb
1 genome loaded
blastn -query /tmp/tmpva5irmgr/query.fasta -subject /tmp/tmpva5irmgr/target.fasta -parse_deflines -outfmt 5 -evalue 0.01 -word_size 10
/home/lewis/.local/lib/python3.13/site-packages/tnatlas/cli.py:94: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  data = pandas.concat([read.dataframe for read in reads.values()])
Saving summary table to /home/lewis/tnfind-example/data/results/results.xlsx

Open the spreadsheet at tnfind-example/data/results/results.xlsx. It should be similar to:

results.xlsx41KB  

Running the tnmeta tool

It is sometimes useful to add additional data to the results.xlsx spreadsheet. The tnmeta tool can be used to attach metadata to each read, based on the plate and well from which the read was sampled.
 
Note
tnmeta is a command line tool that should be run through a terminal or terminal emulator such as xterm, gnome terminal, Windows powershell or iterm.

The tnmeta tool requires 3 arguments.
well_plate_regex is a regular expression that will be used to extract the plate and well number from the read IDs.
result_file is the file which is to be annotated with the metadata.
metadata are the files which contain the maps of metadata.

By default, result_file is overwritten by tnmeta. If you don't want to overwrite result_file you can pass the option -o <name/of/new/file> to save the annotated results to a separate file.

Note
If you already have result_file open in your spreadsheet software, you may encounter a permission denied error when trying to run tnmeta. Close the result_file spreadsheet first and try again.

Naming the metadata file

The name of the metadata file will be used to determine to which plate the metadata corresponds, and the name of the new column that should contain the metadata in the results table.

The name of the file must begin with an alphanumeric plate identifier, followed by an underscore, followed by an alphanumeric column name.

A file "PLATE_COLUMN.xlsx" will look for reads from the plate called PLATE, and add metadata to reads in a new column called COLUMN.

You should rename your metadata files accordingly.

Well and plate regular expression

The well_plate_regex argument is used to match read IDs to a specific plate, and to extract the well position from the read IDs. The regular expression should contain the keywords PLATE and WELL, which indicate the positions of the plate identifier and well position in the read ID. All other parts of the regular expression are interpreted using the Python re regular expression syntax.

For example, if we have read IDs of the form "22CCRAA000_A01_premix", where "22CCRAA000" is the plate identifier and "A01" is the well position, we could use the regular expression:

PLATE_WELL_

Which means that tnmeta should expect the plate identifier to come first, then an underscore, then the well position, followed by another underscore.

Note
To avoid problems due to shell expansion of regular expression wildcards, it is a good idea to pass the well_plate_regex argument inside quotes, for example, 'PLATE_WELL_'

Example

In this example we use tnmeta to annotate the output of the tnfind example above. If you have not already, complete the steps for this example at Go to  

We will annotate using the file

22CCRAA000_identity_plate.xlsx18KB 

 Which will look for plates identified as "22CCRAA000", and will create a new column in the results table called "identity". The file is included in the example data, and if you have followed the tnfind example, should already be in the tnfind-example/data directory.

In the tnfind-example directory, run the command:
Command
new command name
tnmeta 'PLATE_WELL_' data/results/results.xlsx data/22CCRAA000_identity_plate.xlsx -o data/results/metadata_results.xlsx

You should see a new file has been created at tnfind-example/data/results/metadata_results.xlsx, that is similar to:
metaresults.xlsx42KB  

Protocol references

https://doi.org/10.1093/synbio/ysad012
https://doi.org/10.1021/acssynbio.3c00397

Public workspaceBacterial genome annotation script using BLASTN V.1

Linux systems25 steps

Bacterial genome annotation script using BLASTN V.1

Linux systems
25 steps