AgBase Functional Annotation Workflow

Amanda Cooksey; Anna Childers; Monica Poelchau; Surya Saha; Fiona McCarthy

Jun 10, 2026

Version 1

AgBase Functional Annotation Workflow V.1

DOI

https://dx.doi.org/10.17504/protocols.io.dm6gp7n6jgzp/v1

Amanda Cooksey¹,
Anna Childers²,
Monica Poelchau³,
Surya Saha⁴,
Fiona McCarthy¹

¹School of Animal and Comparative Biomedical Sciences, University of Arizona;
²Bee Research Laboratory, Beltsville Agricultural Research Center, Agricultural Research Service, USDA;
³National Agricultural Library, Agricultural Research Service, USDA;
⁴Boyce Thompson Institute / Velsera

USDA National Agricultural Library

Amanda M Cooksey

University of Arizona

DOI: https://dx.doi.org/10.17504/protocols.io.dm6gp7n6jgzp/v1

External link: https://agbase.arizona.edu/

Protocol Citation: Amanda Cooksey, Anna Childers, Monica Poelchau, Surya Saha, Fiona McCarthy 2026. AgBase Functional Annotation Workflow. protocols.io https://dx.doi.org/10.17504/protocols.io.dm6gp7n6jgzp/v1Version created by Amanda M Cooksey

Manuscript citation:

Saha, S.; Cooksey, A.M.; Childers, A.K.; Poelchau, M.F.; McCarthy, F.M. Workflows for Rapid Functional Annotation of Diverse Arthropod Genomes. Insects 2021, 12, 748. https://doi.org/10.3390/insects12080748

License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

Protocol status: Working

We use this protocol and it's working

Created: May 08, 2026

Last Modified: June 10, 2026

Protocol Integer ID: 316612

Keywords: functional annotation, gene ontology, pathways, KEGG, Reactome, Flybase, InterProScan, GO, agbase functional annotation workflow, pathway annotation, annotations from blast match, functional annotation workflow, single gaf of go annotation, functional annotation workflow this protocol, annotation tool, transfers gene ontology, go annotation, agbase, blast search, predictive information about protein, drosophila melanogaster reactome, query gene product, annotation, peptide fasta file, combine gaf, other tools in the workflow, protein, gaf format, flybase, interpro database, proteins with kegg, pathannotator, blast match, interproscan, single gaf, other tool

Funders Acknowledgements:

National Science Foundation

Grant ID: EPS-0903787

US Department of Agriculture Cooperative State Research, Education and Extension Service National Research Initiative

Grant ID: MISV-329140

National Institute of General Medical Sciences of the National Institutes of Health

Grant ID: 07111084

US Department of Agriculture Agricultural Research Service

Grant ID: Cooperative Agreement 6402-21000-033-01S

US Department of Agriculture National Institute of Food and Agriculture

Grant ID: MIS-069270

US Department of Agriculture National Institute of Food and Agriculture

Grant ID: MIS-241080

US Department of Agriculture National Institute of Food and Agriculture

Grant ID: 2011-67015-30332

US Department of Agriculture Agricultural Research Service

Grant ID: Non-Assistance Cooperative Agreement 58-8260-9-002

US Department of Agriculture Agricultural Research Service

Grant ID: Non-Assistance Cooperative Agreement 58-8260-4-003

Mississippi Agricultural and Forestry Experiment Station

Grant ID: NA

Disclaimer

This work was supported in part by the U.S. Department of Agriculture, Agricultural Research Service. Mention of trade names or commercial products in this publication is solely for the purpose of providing specific information and does not imply recommendation or endorsement by the USDA. USDA is an equal opportunity provider and employer.

Abstract

This protocol is an update to the protocol described in Saha et al.

The AgBase functional annotation workflow employs three annotation tools:
GOanna: performs a BLAST search and transfers gene ontology (GO) annotations from BLAST matches to the query gene products.
InterProScan: employs the InterPro database which integrates together predictive information about proteins' function from a number of partner resources, giving an overview of the families that a protein belongs to and the domains and sites it contains. InterProScan can also provide GO and pathway annotations.
Pathannotator: annotates proteins with KEGG, FlyBase and Drosophila melanogaster Reactome pathways.

Each of these tools accepts a peptide FASTA file and can be run independently of the other tools in the workflow. As both GOanna and InterProScan provide GO annotations, their outputs are provided in GAF format. The Combine GAFs tool can then be used to make a single GAF of GO annotations, if desired. 

Troubleshooting

Getting started

The Tribolium castaneum RefSeq (GCF_031307605.1-RS_2024_04) proteins are used as an example throughout this protocol. 
Dataset
Tribolium castaneum RefSeq proteins
NAME
https://www.ncbi.nlm.nih.gov/refseq/annotation_euk/Tribolium_castaneum/GCF_031307605.1-RS_2024_04/
LINK

Command
Get the Tribolium castaneum data
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/031/307/605/GCF_031307605.1_icTriCast1.1/GCF_031307605.1_icTriCast1.1_protein.faa.gz

gunzip GCF_031307605.1_icTriCast1.1_protein.faa.gz

The tools in this workflow are provided as containers which can be run using various container technologies. This protocol details the use of Docker and Apptainer. Which technology you need will depend on the compute system you intend to use. If you are running the analyses on a local computer or virtual machine you may want to install Docker. If you are running your analyses on a shared remote computing system (e.g. institutional HPC) you will need to use Apptainer. 

Step case

Run with Docker
20 steps

Docker must be installed on the computer you wish to use for your analysis.
To run Docker you must have ‘root’ permissions.
Docker will run all containers as ‘root’. This makes Docker incompatible with HPC systems.
Docker can be run on your local computer, a server, a cloud virtual machine.

GOanna

GOanna performs a BLAST search, allows you to filter based on BLAST match parameters and transfers Gene Ontology (GO) functional annotations from the BLAST matches to your input proteins.
GOanna accepts a protein FASTA file as input.
BLAST databases are created by AgBase based upon proteins that have GO available and subsetted by phyla. We recommend selecting the database most closely related to the sequence used as input.
We strongly recommend selecting only GO annotations based on experimental evidence codes. This will ensure the best quality annotations for your data.
The remaining parameters are standard BLAST parameters. More information on determining the best BLAST parameters for your specific data set can be found in the section below.

Get GOanna Databases. There are two required directories to download: agbase_database and go_info. These can be obtained using GoCommands. This can be installed as a small binary file by the user (even on an HPC).

Command
Install GoCommands binary
GOCMD_VER=$(curl -L -s https://raw.githubusercontent.com/cyverse/gocommands/main/VERSION.txt)
curl -L https://github.com/cyverse/gocommands/releases/download/${GOCMD_VER}/gocmd-${GOCMD_VER}-linux-amd64.tar.gz | tar zxvf -

chmod +x gocmd

Command
Initiate GoCommands
./gocmd init
Fill in this data when prompted:
iRODS Host [data.cyverse.org]: data.cyverse.org
iRODS Port [1247]: 1247
iRODS Zone [iplant]: iplant
iRODS Username: anonymous

Command
Pull data with GOCommands
./gocmd get -r --progress /iplant/home/shared/iplantcollaborative/protein_blast_dbs/go_info .
./gocmd get -r  --progress /iplant/home/shared/iplantcollaborative/protein_blast_dbs/agbase_database .

Get the GOanna container

Command
Get GOanna container using Docker
docker pull agbase/goanna:2.4

Run GOanna analysis with Docker

Command
GOanna help using Docker
docker run --rm agbase/goanna:2.4 -h

GOanna help and usage:

Options:
-a BLAST database basename (arthropod, bacteria, bird, crustacean, fish, fungi, human, insecta, invertebrates, mammals, nematode, plants, rodents uniprot_sprot, uniprot_trembl, vertebrates or viruses)
-c peptide fasta filename
-o output file basename
[-b transfer GO with experimental evidence only (yes or no). Default = yes.]
[-d database of query ID. If your entry contains spaces either substitute an underscore (_) or, to preserve the space, use double quotes around your entry. Default: user_input_db]
[-e Expect value (E) for saving hits. Default is 10.]
[-f Number of aligned sequences to keep. Default: 3]
[-g BLAST percent identity above which match should be kept. Default: keep all matches.]
[-h help]
[-m BLAST percent positive identity above which match should be kept. Default: keep all matches.]
[-s bitscore above which match should be kept. Default: keep all matches.]
[-k Maximum number of gap openings allowed for match to be kept. Default: 100]
[-l Maximum number of total gaps allowed for match to be kept. Default: 1000]
[-q Minimum query coverage per subject for match to be kept. Default: keep all matches]
[-t Number of threads.  Default: 8]
[-u Assigned by field of your GAF output file. If your entry contains spaces (eg. firstname lastname) either substitute and underscore (_) or, to preserve the space, use quotes around your entry (eg. firstname lastname) Default: user]
[-x Taxon ID of the query species. Default: taxon:0000]
[-p parse_deflines. Parse query and subject bar delimited sequence identifiers]

GOanna has three required parameters:
-a BLAST database basename (acceptable options are listed in the help/usage)
-c peptide FASTA file to BLAST
-o output file basename

Other parameters may be necessary for your particular analysis. The following parameters were evaluated for use with invertebrate whole proteome sets. 


Command
GOanna analysis using Docker
docker run \
--rm \
-v /location/of/agbase_database:/agbase_database \
-v /location/of/go_info:/go_info \
-v $(pwd):/work-dir \
agbase/goanna:2.4 \
-a invertebrates \
-c GCF_031307605.1_icTriCast1.1_protein.faa \
-o GCF_031307605.1 \
-g 70 \
-s 900 \
-r 1.2 \
-d RefSeq \
-u AgBase \
-x 7070 \
-k 9 \
-q 70
Command Explained:
docker run: tells docker to run
--rm: removes the container when the analysis has finished. The image will remain for future use.
-v /location/of/agbase_database:/agbase_database: tells docker to mount the 'agbase_database' directory you downloaded to the host machine to the '/agbase_database' directory within the container. The syntax for this is: <absolute path on host>:<absolute path in container>
-v /location/of/go_info:/go_info: mounts 'go_info' directory on host machine into 'go_info' directory inside the container
-v $(pwd):/work-dir: mounts my current working directory on the host machine to '/work-dir' in the container
agbase/goanna:2.4: the name of the Docker image to use

Note
All the options supplied after the image name are GOanna options

-a invertebrates: GOanna BLAST database to use--first of three required options.
-c GCF_031307605.1_icTriCast1.1_protein.faa: input file--second of three required options
-o GCF_031307605.1: output file basename--last of three required options

-g 70: tells GOanna to keep only those matches with at least 70% identity
-s 900: tells GOanna to keep only those matches with a bitscore above 900
-r 1.2: ratio of query length to subject length
-d RefSeq: database of query ID. This will appear in column 1 of the GAF output file.
-u AgBase: name to appear in column 15 of the GAF output file; if this contains a space it should be enclosed in double quotes e.g. "assigned by".
-x 7070: NCBI taxon ID of input file species will appear in column 13 of the GAF output file
-k 9: tells GOanna to keep only those matches with a maximum number of 9 gap openings
-q 70: tells GOanna to keep only those matches with query coverage of 70 per subject

Understanding your results
You should get 4 output files:
1. <basename>.asn: This is standard BLAST output format that allows for conversion to other formats. You probably won’t need to look at this output.

2. <basename>.html: This output displays in your web browser so that you can view pairwise alignments to determine BLAST parameters.

3. <basename>.tsv: This is the tab-delimited BLAST output that can be opened and sorted in Excel to determine BLAST parameter values. The file contains the following columns:
query ID
query length
query start
query end
subject ID
subject length
subject start
subject end
e-value
percent ID
query coverage
percent positive ID
gap openings
total gaps
bitscore
raw score

For more information on the BLAST output parameters see the NCBI BLAST documentation.

4. <basename>_goanna_gaf.tsv: This is the standard tab-separated gene association file format that is used by the GO Consortium and by software tools that accept GO annotation files to do GO enrichment.

InterProScan

InterPro is a database which integrates together predictive information about proteins' function from a number of partner resources, giving an overview of the families that a protein belongs to and the domains and sites it contains.

Basic functions of this tool
InterProScan:
removes special characters from FASTA sequences
splits FASTA into groups of 1000 sequences
runs InterProScan with user-specified options on each of the 1000-sequence files in parallel
re-combines output files from all groups of 1000

InterProScan XML parser:
parses the XML output from InterProScan to generate a gene association file (GAF) and count files for domains, pathways and GO annotations

Note
InterProScan is very memory intensive and it may not be feasible to run on your local computer. Adjust the number of cpus (-C), the number of analyses (-a) or disable residue level annotation (-e) to improve performance.

Get the InterProScan data

Command
Get InterProScan Data
#Pull the tar files
wget ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.75-106.0/alt/interproscan-data-5.75-106.0.tar.gz
wget ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.75-106.0/alt/interproscan-data-5.75-106.0.tar.gz.md5

#Run checksums
md5sum -c interproscan-data-5.75-106.0.tar.gz.md

#Extract the tar file
tar -pxvzf interproscan-data-5.75-106.0.tar.gz

Get the InterProScan container

Command
Get InterProScan container using Docker
docker pull agbase/interproscan:5.75-106_2

Run InterProScan analysis with Docker

Command
InterProScan help using Docker
docker run agbase/interproscan:5.75-106_2 -h

InterProScan help and usage:
Options:
-a   Optional, comma separated list of analyses.  If this option is not set, ALL analyses will be run.
-b   Optional, base output filename (relative or absolute path). Note that this option, the output directory (-d) option and the output file name (-o) option are mutually exclusive.  The appropriate file extension for the output format(s) will be appended automatically. By default the input file path/name will be used.
-d   Optional, output directory. Note that this option, the output file name (-o) option and the output file base (-b) option are mutually exclusive. The output filename(s) are the same as the input filename, with the appropriate file extension(s) for the output format(s) appended automatically .
-c   Optional.  Disables use of the precalculated match lookup service.  All match calculations will be run locally.
-C   Optional. Supply the number of cpus to use.
-e   Optional, excludes sites from the XML, JSON output
-f   Optional, case-insensitive, comma separated list of output formats. Supported formats are TSV, XML, JSON and GFF. Default for protein sequences are TSV, XML and GFF3, or for nucleotide sequences GFF3 and XML.
-g   Optional, switch on lookup of corresponding Gene Ontology annotation (IMPLIES -l lookup option)
-h   Optional, display help information
-i   Optional, path to fasta file that should be loaded on Master startup. Alternatively, in CONVERT mode, the InterProScan 5 XML file to convert.
-l   Also include lookup of corresponding InterPro annotation in the TSV and GFF3 output formats.
-m   Optional, minimum nucleotide size of ORF to report. Will only be considered if n is specified as a sequence type. Please be aware of the fact that if you specify a too short value it might be that the analysis takes a very long time!
-p   Optional, switch on lookup of corresponding Pathway annotation (IMPLIES -l lookup option)
-t   Optional, the type of the input sequences (dna/rna (n) or protein (p)).  The default sequence type is protein.
-T   Optional, specify temporary file directory (relative or absolute path). The default location is temp/.
-v   Optional, display version number
-r   Optional. Mode required ( -r cluster) to run in cluster mode. These options are provided but have not been tested with this wrapper script. For more information on running InterProScan in cluster mode: https://github.com/ebi-pf-team/interproscan/wiki/ClusterMode
-R   Optional. Clusterrunid (crid) required when using cluster mode.-R unique_id

OPTIONS FOR XML PARSER OUTPUTS

-F   This is the output directory from InterProScan.
-D   Supply the database responsible for these annotations.
-x   NCBI taxon ID of the ID being annotated
-y   Transcript or protein
-n   Name of the biocurator who made these annotations
-M   Optional. Mapping file.
-B   Optional. Bad input sequence file.


Available InterProScan analyses:
CDD
COILS
Gene3D
HAMAP
MOBIDB
PANTHER
Pfam
PIRSF
PRINTS
PROSITE (Profiles and Patterns)
SFLD
SMART (unlicensed components only by default - this analysis has simplified post-processing that includes an E-value filter, however you should not expect it to give the same match output as the fully licensed version of SMART)
SUPERFAMILY
NCBIFAM (includes the previous TIGRFAM analysis)


Command
InterProScan analysis using Docker
sudo docker run \
-v /your/local/data/directory:/data \
-v /where/you/downloaded/data/interproscan-5.75-106.0/data:/opt/interproscan/data \
-w /data \
agbase/interproscan:5.75-106_2 \
-i /path/to/your/input/file/GCF_031307605.1_icTriCast1.1_protein_test.faa \
-b GCF_031307605.1 \
-f tsv,xml \
-g \
-c \
-n AgBase \
-x 7070 \
-D RefSeq \
-l
docker run: tells docker to run
--rm: removes container when analysis finishes (image will remain for future analyses)
-v /your/local/data/directory:/data: mount my working directory on the host machine into the /data directory in the container. The syntax for this is <absolute path on host machine>:<absolute path in container>
-v /where/you/downloaded/data/interproscan-5.75-106.0/data:/opt/interproscan/data: mounts the InterProScan partner data (downloaded from FTP) on the host machine into the /opt/interproscan/data directory in the container
-w /data: your working directory
agbase/interproscan:5.75-106_2: the name of the Docker image to use

Note
All the options supplied after the image name are InterProScan options


-i /path/to/your/input/file/GCF_031307605.1_icTriCast1.1_protein.faa: local path to input FASTA file. You can also use the mounted file path: /data/pnnl_10000.fasta
-b GCF_031307605.1: output file basename
-f tsv,xml: desired output file formats; xml output is required as input for the XML parser that generates the GAF and count files.
-g: tells the tool to perform GO annotation
-c: tells tool to perform local compute and not connect to EBI. This only adds a little to the run time but removes error messages from network time out errors
-n AgBase: name to include in column 15 of GAF output file; if this contains spaces it should be enclosed in double quotes e.g. "assigned_by".
-x 7070: taxon ID of query species to be used in column 13 of GAF output file
-D RefSeq: database of query accession to be used in column 1 of GAF output file
-l: tells tools to include lookup of corresponding InterPro annotation in the TSV and GFF3 output formats.

Understanding Your Results
InterProScan outputs: https://interproscan-docs.readthedocs.io/en/v5/OutputFormats.html

<basename>.gff3
<basename>.tsv
<basename>.xml
<basename>.json

Parser Outputs

<basename>_gaf.txt: -This table follows the formatting of a gene association file (GAF) and can be used in GO enrichment analyses.
<basename>_acc_go_counts.txt: -This table includes input accessions, the number of GO IDs assigned to each accession and GO ID names. GO IDs are split into BP (Biological Process), MF (Molecular Function) and CC (Cellular Component).
<basename>_go_counts.txt: -This table counts the numbers of sequences assigned to each GO ID so that the user can quickly identify all genes assigned to a particular function.
<basename>_acc_interpro_counts.txt: -This table includes input accessions, number of InterPro IDs for each accession, InterPro IDs assigned to each sequence and the InterPro ID name.
<basename>_interpro_counts.txt: -This table counts the numbers of sequences assigned to each InterPro ID so that the user can quickly identify all genes with a particular motif.
<basename>_acc_pathway_counts.txt: -This table includes input accessions, number of pathway IDs for the accession and the pathway names. Multiple values are separated by a semi-colon.
<basename>_pathway_counts.txt: -This table counts the numbers of sequences assigned to each Pathway ID so that the user can quickly identify all genes assigned to a pathway.
<basename>.err: -This file will list any sequences that were not able to be analyzed by InterProScan. Examples of sequences that will cause an error are sequences with a large run of Xs.

Pathannotator

Pathannotator annotates proteins with KEGG (Kyoto Encyclopedia of Genes and Genomes), Flybase and Reactome Drosophila melanogaster pathways. It does this through the use of KofamScan, KEGG API, OrthoFinder, Flybase and Reactome.
KofamScan is a gene functional annotation tool based on KEGG Orthology and hidden Markov model (HMM). It is provided by the KEGG project. The online version is available here.
This pipeline pulls annotation directly from the KEGG API when possible. When that isn't possible the pipeline implements Kofamscan to identify homologous KEGG objects (KO). The pathways annotated to these KEGG objects are then transferred to the corresponding proteins in your species of interest.
If specified, the pipeline will also provide annotations to Flybase pathways and Reactome Drosophila melanogaster pathways. To do this the pipeline uses OrthoFinder to identify homologous Drosophila melanogaster proteins for your input proteins. Flybase metabolic pathway and signaling pathway annotations and Drosophila melanogaster Reactome pathways are then transferred to your input proteins from these homologs.

Get the KOfam databases. The KOfam 'profiles' and 'ko_list' databases are required to run the pipeline. If you don't already have these databases the pipeline will pull them during the first run. If you want to download them beforehand they are available from the KEGG website.

Command
Get the KOfam data
wget https://www.genome.jp/ftp/db/kofam/profiles.tar.gz
wget https://www.genome.jp/ftp/db/kofam/ko_list.gz

tar -xvzf profiles.tar.gz
gunzip ko_list.tar.gz

Get the Pathannotator container. The Pathannotator tool is available as a Docker container on Docker Hub: Pathannotator container

Command
Get Pathannotator container using Docker
docker pull agbase/pathannotator:4.1

Run Pathannotator using Docker.

Command
Pathannotator help using Docker
docker run --rm agbase/pathannotator:4.1 -h

Help and Usage:
-h to see help and usage statement
-k KEGG species code (NA or related species code if species not in KEGG) KEGG species codes can be found here: https://www.genome.jp/brite/br08611
-i input file (protein FASTA without header lines)
-d (optional: default is .) output directory (the file path should be relative to, and inside of,  your working directory)
-f (optional: default is NA) FB for Flybase and DME Reactome annotations, NA for none
-c (optional: default is nproc -1) number of cpus to use
-o outbase (file basename to use for output files)


Command
Pathannotator analysis using Docker
docker run \
--rm \
-v /path/to/your/input/files:/workdir \
-v /path/to/kofam/databases/:/data \
agbase/pathannotator:4.1 \
-k tca \
-i GCF_031307605.1_icTriCast1.1_protein.faa \
-d pathannotator \
-f FB \
-o GCF_031307605.1
docker run: tells docker to run
--rm: removes the container when the analysis has finished. The image will remain for future use.
-v /path/to/your/input/files:/workdir: mounts the working directory on the host machine to '/workdir' inside the container
-v /path/to/kofam/databases/:/data: mounts the directory with the Kofam database files (or where you want them to be stored) on the host machine to '/data' inside the container
agbase/pathannotator:4.1: the name of the Docker image to use


Note
All the options supplied after the image name are Pathannotator options

-k tca: KEGG species code for Tribolium casteneum. Can be found here: https://www.genome.jp/brite/br08611 . If your species doesn't have a code choose a closely related species or 'NA'.
-i GCF_031307605.1_icTriCast1.1_protein.faa: input file (protein FASTA, no header lines).
-d pathannotator: Directory where you want the pipeline outputs to go. The directory must exist before you run the pipeline. The file path should be relative to (and inside of) your working directory.
-f FB: FB indicates that we want to get Flybase and Drosophila melanogaster Reactome pathways annotations in addition to KEGG annotations.
-o GCF_031307605.1: this will be the prefix used to name all the output files

Understanding Your Results
The output files you can expect will differ depending on the circumstances of your run. If you are using a KEGG species code you will get both KEGG reference and KEGG species pathways. Without a KEGG code (NA) you will only get KEGG reference pathway annotations. Under all circumstances you may specify whether or not you want to receive Flybase and Reactome Drosophila melanogaster pathways annotations as well. Whatever your options, the pathways will all be output into a single GMT formatted file.

 
Expected output files:
There will be 2-5 output files depending the on the options specified. All analyses will generate a KEGG reference output and GMT file:

<basename>_KEGG_ref.tsv: These are annotations to the KEGG reference pathways. The pathway identifiers will begin with 'map'.

ABCD
Input_protein_IDKEGG_KO  KEGG_ref_pathway  KEGG_ref_pathway_name                 
XP_015835225.1K26207map04024cAMP signaling pathway               
XP_015835225.1       K26207map04261Adrenergic signaling in 
XP_001813251.1       K01540map04022 cGMPPKG signaling pathway             
<basename>_all_pathways.gmt: This file contains all of the pathways annotations generated by the analysis (KEGG ref, KEGG species, Flybase and Reactome) in GMT format. This is a tab-delimited file with N columns: Pathway ID, Description and Proteins annotated to that pathway (columns 3-N). This file has no header.

ABCDE
FBgg0000882      FlyBase_pathway XP_018223192.1 XP_018223020.1XP_018220038.1 
R-DME-109606     Reactome_pathwayXP_0182222056.1XP_018221350.1 
tca00130 KEGG_species_pathwayXP_018219344.1 
map00254 KEGG_reference_pathwayXP_018220255.1XP_018229887.1 

If a KEGG species code was specified (rather than 'NA') there will also be a KEGG species output:

<basename>_KEGG_species.tsv: These are annotations to the species-specific KEGG pathways. The pathway identifiers will begin with the KEGG species code.

ABCD
Input_protein_IDKEGG_KOKEGG_tca_pathwayKEGG_tca_pathway_name
XP_001813251.1 K01540tca04820Cytoskeleton in muscle cells - Tribolium castaneum (red flour beetle)
XP_001812480.1K02268tca00190Oxidative phosphorylation - Tribolium castaneum (red flour beetle) 
XP_008195997.1 K04676tca04350TGF-beta signaling pathway - Tribolium castaneum (red flour beetle)

If 'FB' was specified there will also be Flybase and Reactome outputs:

<basename>_flybase.tsv: If you used the 'FB' option for Flybase pathways annotations you will get this output.

ABCD
Input_protein_IDFlybase_protein_IDFlybase_pathway_IDFlybase_pathway_name
NP_001034540.1FBpp0077451FBgg0001085BMP Signaling Pathway Core Components
NP_001034503.2FBpp0084690FBgg0000904Insulin-like Receptor Signaling Pathway Core Components
NP_001034492.1FBpp0078442FBgg0002045CHITIN BIOSYNTHESIS
<basename>_reactome.tsv: If you used the 'FB' option you will get this output containing Reactome Drosophila melanogaster pathways annotations.

ABCD
Input_protein_IDUniprot_IDReactome_pathway_IDReactome_pathway_name
XP_018221664.1Q9W5E1R-DME-1234174Cellular response to hypoxia
XP_018222787.1Q9W4N8R-DME-194315Signaling by Rho GTPases
XP_018222841.1Q7KVX1R-DME-162582Signal transduction

Combine GAFs

This tool can be used to combine the gene association file (GAF) outputs from GOanna and InterProScan.
The tool accepts two input files:
GOanna GAF output
InterProScan GAF output

Get combine_gafs container


Command
Get Combine_GAFS container using Docker
docker pull agbase/combine_gafs:1.1

Combine GAF outputs from GOanna and InterProScan

Combine GAFs has three parameters:

-i InterProScan XML Parser GAF output
-g GOanna GAF output
-o output file basename




Command
Combine_GAFs using Docker
docker run \
--rm \
-v $(pwd):/work-dir \
agbase/combine_gafs:1.1 \
-i /path/to/GCF_031307605.1_gaf.txt \
-g /path/to/GCF_031307605.1_goanna_gaf.tsv \
-o GCF_031307605.1_complete_gaf


Command explained:
docker run: tells docker to run
--rm: removes the container when the analysis has finished. The image will remain for future use.
-v $(pwd):/work-dir: mounts my current working directory on the host machine to '/work-dir' in the container
agbase/combine_gafs:1.1: the name of the Docker image to use

Note
All the options supplied after the image name are Combine_GAFs options

-i /path/to/<basename>_gaf.txt: InterProScan GAF output file.
-g /path/to/<basename>_goanna_gaf.tsv: GOanna GAF output file.
-o <basename_complete_gaf>: output file basename--a .tsv extension will be added

    Understanding your results

Combine GAFs will create a single GAF output file:

<basename>.tsv: This is a GAF format file that contains all the GO annotations from both GOanna and InterProScan. The original input files will remain as they were.

References

Citation
Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL (2009). BLAST : architecture and applications. BMC bioinformatics.https://doi.org/10.1186/1471-2105-10-421
LINK

Citation
Blum M, Chang HY, Chuguransky S, Grego T, Kandasaamy S, Mitchell A, Nuka G, Paysan-Lafosse T, Qureshi M, Raj S, Richardson L, Salazar GA, Williams L, Bork P, Bridge A, Gough J, Haft DH, Letunic I, Marchler-Bauer A, Mi H, Natale DA, Necci M, Orengo CA, Pandurangan AP, Rivoire C, Sigrist CJA, Sillitoe I, Thanki N, Thomas PD, Tosatto SCE, Wu CH, Bateman A, Finn RD (2021). The InterPro protein families and domains database: 20 years on. Nucleic acids research.https://doi.org/10.1093/nar/gkaa977
LINK

Citation
Jones P, Binns D, Chang HY, Fraser M, Li W, McAnulla C, McWilliam H, Maslen J, Mitchell A, Nuka G, Pesseat S, Quinn AF, Sangrador-Vegas A, Scheremetjew M, Yong SY, Lopez R, Hunter S (2014). InterProScan 5: genome-scale protein function classification. Bioinformatics (Oxford, England).https://doi.org/10.1093/bioinformatics/btu031
LINK

Citation
Kanehisa M, Furumichi M, Sato Y, Matsuura Y, Ishiguro-Watanabe M (2025). KEGG: biological systems database as a model of the real world. Nucleic acids research.https://doi.org/10.1093/nar/gkae909
LINK

Citation
Öztürk-Çolak A, Marygold SJ, Antonazzo G, Attrill H, Goutte-Gattat D, Jenkins VK, Matthews BB, Millburn G, Dos Santos G, Tabone CJ, FlyBase Consortium (2024). FlyBase: updates to the Drosophila genes and genomes database. Genetics.https://doi.org/10.1093/genetics/iyad211
LINK

Citation
Milacic M, Beavers D, Conley P, Gong C, Gillespie M, Griss J, Haw R, Jassal B, Matthews L, May B, Petryszak R, Ragueneau E, Rothfels K, Sevilla C, Shamovsky V, Stephan R, Tiwari K, Varusai T, Weiser J, Wright A, Wu G, Stein L, Hermjakob H, D'Eustachio P (2024). The Reactome Pathway Knowledgebase 2024. Nucleic acids research.https://doi.org/10.1093/nar/gkad1025
LINK

Citation
Dirk Merkel (2014). Docker: lightweight Linux containers for consistent development and deployment. Linux Journal.https://dl.acm.org/doi/10.5555/2600239.2600241
LINK

Citation
Kurtzer GM, Sochat V, Bauer MW (2017). Singularity: Scientific containers for mobility of compute. PloS one.https://doi.org/10.1371/journal.pone.0177459
LINK

Citation
Saha S, Cooksey AM, Childers AK, Poelchau MF, McCarthy FM (2021). Workflows for Rapid Functional Annotation of Diverse Arthropod Genomes. Insects.https://doi.org/10.3390/insects12080748
LINK

Citation
Kanehisa M, Furumichi M, Sato Y, Ishiguro-Watanabe M, Tanabe M (2021). KEGG: integrating viruses and cellular organisms. Nucleic acids research.https://doi.org/10.1093/nar/gkaa970
LINK

Citation
Emms DM, Kelly S (2019). OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome biology.https://doi.org/10.1186/s13059-019-1832-y
LINK

Citation
Aramaki T, Blanc-Mathieu R, Endo H, Ohkubo K, Kanehisa M, Goto S, Ogata H (2020). KofamKOALA: KEGG Ortholog assignment based on profile HMM and adaptive score threshold. Bioinformatics (Oxford, England).https://doi.org/10.1093/bioinformatics/btz859
LINK

Citations

Step 22

Kanehisa M, Furumichi M, Sato Y, Matsuura Y, Ishiguro-Watanabe M. KEGG: biological systems database as a model of the real world.

https://doi.org/10.1093/nar/gkae909

Step 22

Milacic M, Beavers D, Conley P, Gong C, Gillespie M, Griss J, Haw R, Jassal B, Matthews L, May B, Petryszak R, Ragueneau E, Rothfels K, Sevilla C, Shamovsky V, Stephan R, Tiwari K, Varusai T, Weiser J, Wright A, Wu G, Stein L, Hermjakob H, D'Eustachio P. The Reactome Pathway Knowledgebase 2024.

https://doi.org/10.1093/nar/gkad1025

Step 22

Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute.

https://doi.org/10.1371/journal.pone.0177459

Step 22

Kanehisa M, Furumichi M, Sato Y, Ishiguro-Watanabe M, Tanabe M. KEGG: integrating viruses and cellular organisms.

https://doi.org/10.1093/nar/gkaa970

Step 22

Aramaki T, Blanc-Mathieu R, Endo H, Ohkubo K, Kanehisa M, Goto S, Ogata H. KofamKOALA: KEGG Ortholog assignment based on profile HMM and adaptive score threshold.

https://doi.org/10.1093/bioinformatics/btz859

Step 22

Jones P, Binns D, Chang HY, Fraser M, Li W, McAnulla C, McWilliam H, Maslen J, Mitchell A, Nuka G, Pesseat S, Quinn AF, Sangrador-Vegas A, Scheremetjew M, Yong SY, Lopez R, Hunter S. InterProScan 5: genome-scale protein function classification.

https://doi.org/10.1093/bioinformatics/btu031

Step 22

Öztürk-Çolak A, Marygold SJ, Antonazzo G, Attrill H, Goutte-Gattat D, Jenkins VK, Matthews BB, Millburn G, Dos Santos G, Tabone CJ, FlyBase Consortium. FlyBase: updates to the Drosophila genes and genomes database.

https://doi.org/10.1093/genetics/iyad211

Step 22

Dirk Merkel. Docker: lightweight Linux containers for consistent development and deployment

https://dl.acm.org/doi/10.5555/2600239.2600241

Step 22

Saha S, Cooksey AM, Childers AK, Poelchau MF, McCarthy FM. Workflows for Rapid Functional Annotation of Diverse Arthropod Genomes.

https://doi.org/10.3390/insects12080748

Step 22

Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL. BLAST : architecture and applications.

https://doi.org/10.1186/1471-2105-10-421

Step 22

Blum M, Chang HY, Chuguransky S, Grego T, Kandasaamy S, Mitchell A, Nuka G, Paysan-Lafosse T, Qureshi M, Raj S, Richardson L, Salazar GA, Williams L, Bork P, Bridge A, Gough J, Haft DH, Letunic I, Marchler-Bauer A, Mi H, Natale DA, Necci M, Orengo CA, Pandurangan AP, Rivoire C, Sigrist CJA, Sillitoe I, Thanki N, Thomas PD, Tosatto SCE, Wu CH, Bateman A, Finn RD. The InterPro protein families and domains database: 20 years on.

https://doi.org/10.1093/nar/gkaa977

Step 22

Emms DM, Kelly S. OrthoFinder: phylogenetic orthology inference for comparative genomics.

https://doi.org/10.1186/s13059-019-1832-y

Step 22

Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute.

https://doi.org/10.1371/journal.pone.0177459

Step 22

Kanehisa M, Furumichi M, Sato Y, Ishiguro-Watanabe M, Tanabe M. KEGG: integrating viruses and cellular organisms.

https://doi.org/10.1093/nar/gkaa970

Step 22

Aramaki T, Blanc-Mathieu R, Endo H, Ohkubo K, Kanehisa M, Goto S, Ogata H. KofamKOALA: KEGG Ortholog assignment based on profile HMM and adaptive score threshold.

https://doi.org/10.1093/bioinformatics/btz859

Step 22

https://doi.org/10.1093/nar/gkaa977

Step 22

https://doi.org/10.1093/bioinformatics/btu031

Step 22

Kanehisa M, Furumichi M, Sato Y, Matsuura Y, Ishiguro-Watanabe M. KEGG: biological systems database as a model of the real world.

https://doi.org/10.1093/nar/gkae909

Step 22

Saha S, Cooksey AM, Childers AK, Poelchau MF, McCarthy FM. Workflows for Rapid Functional Annotation of Diverse Arthropod Genomes.

https://doi.org/10.3390/insects12080748

Step 22

Emms DM, Kelly S. OrthoFinder: phylogenetic orthology inference for comparative genomics.

https://doi.org/10.1186/s13059-019-1832-y

Step 22

Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL. BLAST : architecture and applications.

https://doi.org/10.1186/1471-2105-10-421

Step 22

https://doi.org/10.1093/genetics/iyad211

Step 22

https://doi.org/10.1093/nar/gkad1025

Step 22

Dirk Merkel. Docker: lightweight Linux containers for consistent development and deployment

https://dl.acm.org/doi/10.5555/2600239.2600241

A	B	C	D
Input_protein_ID	KEGG_KO	KEGG_ref_pathway	KEGG_ref_pathway_name
XP_015835225.1	K26207	map04024	cAMP signaling pathway
XP_015835225.1	K26207	map04261	Adrenergic signaling in
XP_001813251.1	K01540	map04022	cGMPPKG signaling pathway

A	B	C	D	E
FBgg0000882	FlyBase_pathway	XP_018223192.1	XP_018223020.1	XP_018220038.1
R-DME-109606	Reactome_pathway	XP_0182222056.1	XP_018221350.1
tca00130	KEGG_species_pathway	XP_018219344.1
map00254	KEGG_reference_pathway	XP_018220255.1	XP_018229887.1

A	B	C	D
Input_protein_ID	KEGG_KO	KEGG_tca_pathway	KEGG_tca_pathway_name
XP_001813251.1	K01540	tca04820	Cytoskeleton in muscle cells - Tribolium castaneum (red flour beetle)
XP_001812480.1	K02268	tca00190	Oxidative phosphorylation - Tribolium castaneum (red flour beetle)
XP_008195997.1	K04676	tca04350	TGF-beta signaling pathway - Tribolium castaneum (red flour beetle)

A	B	C	D
Input_protein_ID	Flybase_protein_ID	Flybase_pathway_ID	Flybase_pathway_name
NP_001034540.1	FBpp0077451	FBgg0001085	BMP Signaling Pathway Core Components
NP_001034503.2	FBpp0084690	FBgg0000904	Insulin-like Receptor Signaling Pathway Core Components
NP_001034492.1	FBpp0078442	FBgg0002045	CHITIN BIOSYNTHESIS

A	B	C	D
Input_protein_ID	Uniprot_ID	Reactome_pathway_ID	Reactome_pathway_name
XP_018221664.1	Q9W5E1	R-DME-1234174	Cellular response to hypoxia
XP_018222787.1	Q9W4N8	R-DME-194315	Signaling by Rho GTPases
XP_018222841.1	Q7KVX1	R-DME-162582	Signal transduction

AgBase Functional Annotation Workflow V.1

Run with Docker20 steps

Run with Docker
20 steps