Jun 10, 2026

AgBase Functional Annotation Workflow V.1

  • Amanda Cooksey1,
  • Anna Childers2,
  • Monica Poelchau3,
  • Surya Saha4,
  • Fiona McCarthy1
  • 1School of Animal and Comparative Biomedical Sciences, University of Arizona;
  • 2Bee Research Laboratory, Beltsville Agricultural Research Center, Agricultural Research Service, USDA;
  • 3National Agricultural Library, Agricultural Research Service, USDA;
  • 4Boyce Thompson Institute / Velsera
  • USDA National Agricultural Library
Icon indicating open access to content
QR code linking to this content
Protocol CitationAmanda Cooksey, Anna Childers, Monica Poelchau, Surya Saha, Fiona McCarthy 2026. AgBase Functional Annotation Workflow. protocols.io https://dx.doi.org/10.17504/protocols.io.dm6gp7n6jgzp/v1Version created by Amanda M Cooksey
Manuscript citation:
Saha, S.; Cooksey, A.M.; Childers, A.K.; Poelchau, M.F.; McCarthy, F.M. Workflows for Rapid Functional Annotation of Diverse Arthropod Genomes. Insects 2021, 12, 748. https://doi.org/10.3390/insects12080748
License: This is an open access  protocol  distributed under the terms of the  Creative Commons Attribution License,  which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Protocol status: Working
We use this protocol and it's working
Created: May 08, 2026
Last Modified: June 10, 2026
Protocol  Integer ID: 316612
Keywords: functional annotation, gene ontology, pathways, KEGG, Reactome, Flybase, InterProScan, GO, agbase functional annotation workflow, pathway annotation, annotations from blast match, functional annotation workflow, single gaf of go annotation, functional annotation workflow this protocol, annotation tool, transfers gene ontology, go annotation, agbase, blast search, predictive information about protein, drosophila melanogaster reactome, query gene product, annotation, peptide fasta file, combine gaf, other tools in the workflow, protein, gaf format, flybase, interpro database, proteins with kegg, pathannotator, blast match, interproscan, single gaf, other tool
Funders Acknowledgements:
National Science Foundation
Grant ID: EPS-0903787
US Department of Agriculture Cooperative State Research, Education and Extension Service National Research Initiative
Grant ID: MISV-329140
National Institute of General Medical Sciences of the National Institutes of Health
Grant ID: 07111084
US Department of Agriculture Agricultural Research Service
Grant ID: Cooperative Agreement 6402-21000-033-01S
US Department of Agriculture National Institute of Food and Agriculture
Grant ID: MIS-069270
US Department of Agriculture National Institute of Food and Agriculture
Grant ID: MIS-241080
US Department of Agriculture National Institute of Food and Agriculture
Grant ID: 2011-67015-30332
US Department of Agriculture Agricultural Research Service
Grant ID: Non-Assistance Cooperative Agreement 58-8260-9-002
US Department of Agriculture Agricultural Research Service
Grant ID: Non-Assistance Cooperative Agreement 58-8260-4-003
Mississippi Agricultural and Forestry Experiment Station
Grant ID: NA
Disclaimer
This work was supported in part by the U.S. Department of Agriculture, Agricultural Research Service. Mention of trade names or commercial products in this publication is solely for the purpose of providing specific information and does not imply recommendation or endorsement by the USDA. USDA is an equal opportunity provider and employer.
Abstract
This protocol is an update to the protocol described in Saha et al.

The AgBase functional annotation workflow employs three annotation tools:
  1. GOanna: performs a BLAST search and transfers gene ontology (GO) annotations from BLAST matches to the query gene products.
  2. InterProScan: employs the InterPro database which integrates together predictive information about proteins' function from a number of partner resources, giving an overview of the families that a protein belongs to and the domains and sites it contains. InterProScan can also provide GO and pathway annotations.
  3. Pathannotator: annotates proteins with KEGG, FlyBase and Drosophila melanogaster Reactome pathways.

Each of these tools accepts a peptide FASTA file and can be run independently of the other tools in the workflow. As both GOanna and InterProScan provide GO annotations, their outputs are provided in GAF format. The Combine GAFs tool can then be used to make a single GAF of GO annotations, if desired.






Troubleshooting
Getting started
The Tribolium castaneum RefSeq (GCF_031307605.1-RS_2024_04) proteins are used as an example throughout this protocol.

Command
Get the Tribolium castaneum data
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/031/307/605/GCF_031307605.1_icTriCast1.1/GCF_031307605.1_icTriCast1.1_protein.faa.gz

gunzip GCF_031307605.1_icTriCast1.1_protein.faa.gz

The tools in this workflow are provided as containers which can be run using various container technologies. This protocol details the use of Docker and Apptainer. Which technology you need will depend on the compute system you intend to use. If you are running the analyses on a local computer or virtual machine you may want to install Docker. If you are running your analyses on a shared remote computing system (e.g. institutional HPC) you will need to use Apptainer.


Step case

Run with Docker
20 steps


  • Docker must be installed on the computer you wish to use for your analysis.
  • To run Docker you must have ‘root’ permissions.
  • Docker will run all containers as ‘root’. This makes Docker incompatible with HPC systems.
  • Docker can be run on your local computer, a server, a cloud virtual machine.

GOanna
  • GOanna performs a BLAST search, allows you to filter based on BLAST match parameters and transfers Gene Ontology (GO) functional annotations from the BLAST matches to your input proteins.
  • GOanna accepts a protein FASTA file as input.
  • BLAST databases are created by AgBase based upon proteins that have GO available and subsetted by phyla. We recommend selecting the database most closely related to the sequence used as input.
  • We strongly recommend selecting only GO annotations based on experimental evidence codes. This will ensure the best quality annotations for your data.
  • The remaining parameters are standard BLAST parameters. More information on determining the best BLAST parameters for your specific data set can be found in the section below.
Get GOanna Databases. There are two required directories to download: agbase_database and go_info. These can be obtained using GoCommands. This can be installed as a small binary file by the user (even on an HPC).

Command
Install GoCommands binary
GOCMD_VER=$(curl -L -s https://raw.githubusercontent.com/cyverse/gocommands/main/VERSION.txt)
curl -L https://github.com/cyverse/gocommands/releases/download/${GOCMD_VER}/gocmd-${GOCMD_VER}-linux-amd64.tar.gz | tar zxvf -

chmod +x gocmd

Command
Initiate GoCommands
./gocmd init
Fill in this data when prompted:
iRODS Host [data.cyverse.org]: data.cyverse.org
iRODS Port [1247]: 1247
iRODS Zone [iplant]: iplant
iRODS Username: anonymous

Command
Pull data with GOCommands
./gocmd get -r --progress /iplant/home/shared/iplantcollaborative/protein_blast_dbs/go_info .
./gocmd get -r  --progress /iplant/home/shared/iplantcollaborative/protein_blast_dbs/agbase_database .

Get the GOanna container

Command
Get GOanna container using Docker
docker pull agbase/goanna:2.4

Run GOanna analysis with Docker

Command
GOanna help using Docker
docker run --rm agbase/goanna:2.4 -h

GOanna help and usage:

Options:
-a BLAST database basename (arthropod, bacteria, bird, crustacean, fish, fungi, human, insecta, invertebrates, mammals, nematode, plants, rodents uniprot_sprot, uniprot_trembl, vertebrates or viruses)
-c peptide fasta filename
-o output file basename
[-b transfer GO with experimental evidence only (yes or no). Default = yes.]
[-d database of query ID. If your entry contains spaces either substitute an underscore (_) or, to preserve the space, use double quotes around your entry. Default: user_input_db]
[-e Expect value (E) for saving hits. Default is 10.]
[-f Number of aligned sequences to keep. Default: 3]
[-g BLAST percent identity above which match should be kept. Default: keep all matches.]
[-h help]
[-m BLAST percent positive identity above which match should be kept. Default: keep all matches.]
[-s bitscore above which match should be kept. Default: keep all matches.]
[-k Maximum number of gap openings allowed for match to be kept. Default: 100]
[-l Maximum number of total gaps allowed for match to be kept. Default: 1000]
[-q Minimum query coverage per subject for match to be kept. Default: keep all matches]
[-t Number of threads. Default: 8]
[-u Assigned by field of your GAF output file. If your entry contains spaces (eg. firstname lastname) either substitute and underscore (_) or, to preserve the space, use quotes around your entry (eg. firstname lastname) Default: user]
[-x Taxon ID of the query species. Default: taxon:0000]
[-p parse_deflines. Parse query and subject bar delimited sequence identifiers]

GOanna has three required parameters:
-a BLAST database basename (acceptable options are listed in the help/usage) -c peptide FASTA file to BLAST -o output file basename

Other parameters may be necessary for your particular analysis. The following parameters were evaluated for use with invertebrate whole proteome sets.


Command
GOanna analysis using Docker
docker run \
--rm \
-v /location/of/agbase_database:/agbase_database \
-v /location/of/go_info:/go_info \
-v $(pwd):/work-dir \
agbase/goanna:2.4 \
-a invertebrates \
-c GCF_031307605.1_icTriCast1.1_protein.faa \
-o GCF_031307605.1 \
-g 70 \
-s 900 \
-r 1.2 \
-d RefSeq \
-u AgBase \
-x 7070 \
-k 9 \
-q 70
Command Explained:
docker run: tells docker to run
--rm: removes the container when the analysis has finished. The image will remain for future use.
-v /location/of/agbase_database:/agbase_database: tells docker to mount the 'agbase_database' directory you downloaded to the host machine to the '/agbase_database' directory within the container. The syntax for this is: <absolute path on host>:<absolute path in container>
-v /location/of/go_info:/go_info: mounts 'go_info' directory on host machine into 'go_info' directory inside the container
-v $(pwd):/work-dir: mounts my current working directory on the host machine to '/work-dir' in the container
agbase/goanna:2.4: the name of the Docker image to use

Note
All the options supplied after the image name are GOanna options

-a invertebrates: GOanna BLAST database to use--first of three required options.
-c GCF_031307605.1_icTriCast1.1_protein.faa: input file--second of three required options
-o GCF_031307605.1: output file basename--last of three required options

-g 70: tells GOanna to keep only those matches with at least 70% identity
-s 900: tells GOanna to keep only those matches with a bitscore above 900
-r 1.2: ratio of query length to subject length
-d RefSeq: database of query ID. This will appear in column 1 of the GAF output file.
-u AgBase: name to appear in column 15 of the GAF output file; if this contains a space it should be enclosed in double quotes e.g. "assigned by".
-x 7070: NCBI taxon ID of input file species will appear in column 13 of the GAF output file
-k 9: tells GOanna to keep only those matches with a maximum number of 9 gap openings
-q 70: tells GOanna to keep only those matches with query coverage of 70 per subject
Understanding your results
You should get 4 output files:
1. <basename>.asn: This is standard BLAST output format that allows for conversion to other formats. You probably won’t need to look at this output.

2. <basename>.html: This output displays in your web browser so that you can view pairwise alignments to determine BLAST parameters.

3. <basename>.tsv: This is the tab-delimited BLAST output that can be opened and sorted in Excel to determine BLAST parameter values. The file contains the following columns:
  • query ID
  • query length
  • query start
  • query end
  • subject ID
  • subject length
  • subject start
  • subject end
  • e-value
  • percent ID
  • query coverage
  • percent positive ID
  • gap openings
  • total gaps
  • bitscore
  • raw score

For more information on the BLAST output parameters see the NCBI BLAST documentation.

4. <basename>_goanna_gaf.tsv: This is the standard tab-separated gene association file format that is used by the GO Consortium and by software tools that accept GO annotation files to do GO enrichment.
InterProScan
InterPro is a database which integrates together predictive information about proteins' function from a number of partner resources, giving an overview of the families that a protein belongs to and the domains and sites it contains.

Basic functions of this tool
InterProScan:
  • removes special characters from FASTA sequences
  • splits FASTA into groups of 1000 sequences
  • runs InterProScan with user-specified options on each of the 1000-sequence files in parallel
  • re-combines output files from all groups of 1000

InterProScan XML parser:
  • parses the XML output from InterProScan to generate a gene association file (GAF) and count files for domains, pathways and GO annotations

Note
InterProScan is very memory intensive and it may not be feasible to run on your local computer. Adjust the number of cpus (-C), the number of analyses (-a) or disable residue level annotation (-e) to improve performance.

Get the InterProScan data

Command
Get InterProScan Data
#Pull the tar files
wget ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.75-106.0/alt/interproscan-data-5.75-106.0.tar.gz
wget ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.75-106.0/alt/interproscan-data-5.75-106.0.tar.gz.md5

#Run checksums
md5sum -c interproscan-data-5.75-106.0.tar.gz.md

#Extract the tar file
tar -pxvzf interproscan-data-5.75-106.0.tar.gz

Get the InterProScan container

Command
Get InterProScan container using Docker
docker pull agbase/interproscan:5.75-106_2

Run InterProScan analysis with Docker

Command
InterProScan help using Docker
docker run agbase/interproscan:5.75-106_2 -h

InterProScan help and usage:
Options:
-a Optional, comma separated list of analyses. If this option is not set, ALL analyses will be run.
-b Optional, base output filename (relative or absolute path). Note that this option, the output directory (-d) option and the output file name (-o) option are mutually exclusive. The appropriate file extension for the output format(s) will be appended automatically. By default the input file path/name will be used.
-d Optional, output directory. Note that this option, the output file name (-o) option and the output file base (-b) option are mutually exclusive. The output filename(s) are the same as the input filename, with the appropriate file extension(s) for the output format(s) appended automatically .
-c Optional. Disables use of the precalculated match lookup service. All match calculations will be run locally.
-C Optional. Supply the number of cpus to use.
-e Optional, excludes sites from the XML, JSON output
-f Optional, case-insensitive, comma separated list of output formats. Supported formats are TSV, XML, JSON and GFF. Default for protein sequences are TSV, XML and GFF3, or for nucleotide sequences GFF3 and XML.
-g Optional, switch on lookup of corresponding Gene Ontology annotation (IMPLIES -l lookup option)
-h Optional, display help information
-i Optional, path to fasta file that should be loaded on Master startup. Alternatively, in CONVERT mode, the InterProScan 5 XML file to convert.
-l Also include lookup of corresponding InterPro annotation in the TSV and GFF3 output formats.
-m Optional, minimum nucleotide size of ORF to report. Will only be considered if n is specified as a sequence type. Please be aware of the fact that if you specify a too short value it might be that the analysis takes a very long time!
-p Optional, switch on lookup of corresponding Pathway annotation (IMPLIES -l lookup option)
-t Optional, the type of the input sequences (dna/rna (n) or protein (p)). The default sequence type is protein.
-T Optional, specify temporary file directory (relative or absolute path). The default location is temp/.
-v Optional, display version number
-r Optional. Mode required ( -r cluster) to run in cluster mode. These options are provided but have not been tested with this wrapper script. For more information on running InterProScan in cluster mode: https://github.com/ebi-pf-team/interproscan/wiki/ClusterMode
-R Optional. Clusterrunid (crid) required when using cluster mode.-R unique_id

OPTIONS FOR XML PARSER OUTPUTS

-F This is the output directory from InterProScan.
-D Supply the database responsible for these annotations.
-x NCBI taxon ID of the ID being annotated
-y Transcript or protein
-n Name of the biocurator who made these annotations
-M Optional. Mapping file.
-B Optional. Bad input sequence file.


Available InterProScan analyses:
  • CDD
  • COILS
  • Gene3D
  • HAMAP
  • MOBIDB
  • PANTHER
  • Pfam
  • PIRSF
  • PRINTS
  • PROSITE (Profiles and Patterns)
  • SFLD
  • SMART (unlicensed components only by default - this analysis has simplified post-processing that includes an E-value filter, however you should not expect it to give the same match output as the fully licensed version of SMART)
  • SUPERFAMILY
  • NCBIFAM (includes the previous TIGRFAM analysis)


Command
InterProScan analysis using Docker
sudo docker run \
-v /your/local/data/directory:/data \
-v /where/you/downloaded/data/interproscan-5.75-106.0/data:/opt/interproscan/data \
-w /data \
agbase/interproscan:5.75-106_2 \
-i /path/to/your/input/file/GCF_031307605.1_icTriCast1.1_protein_test.faa \
-b GCF_031307605.1 \
-f tsv,xml \
-g \
-c \
-n AgBase \
-x 7070 \
-D RefSeq \
-l
docker run: tells docker to run
--rm: removes container when analysis finishes (image will remain for future analyses)
-v /your/local/data/directory:/data: mount my working directory on the host machine into the /data directory in the container. The syntax for this is <absolute path on host machine>:<absolute path in container>
-v /where/you/downloaded/data/interproscan-5.75-106.0/data:/opt/interproscan/data: mounts the InterProScan partner data (downloaded from FTP) on the host machine into the /opt/interproscan/data directory in the container
-w /data: your working directory
agbase/interproscan:5.75-106_2: the name of the Docker image to use

Note
All the options supplied after the image name are InterProScan options

-i /path/to/your/input/file/GCF_031307605.1_icTriCast1.1_protein.faa: local path to input FASTA file. You can also use the mounted file path: /data/pnnl_10000.fasta
-b GCF_031307605.1: output file basename
-f tsv,xml: desired output file formats; xml output is required as input for the XML parser that generates the GAF and count files.
-g: tells the tool to perform GO annotation
-c: tells tool to perform local compute and not connect to EBI. This only adds a little to the run time but removes error messages from network time out errors
-n AgBase: name to include in column 15 of GAF output file; if this contains spaces it should be enclosed in double quotes e.g. "assigned_by".
-x 7070: taxon ID of query species to be used in column 13 of GAF output file
-D RefSeq: database of query accession to be used in column 1 of GAF output file
-l: tells tools to include lookup of corresponding InterPro annotation in the TSV and GFF3 output formats.
Understanding Your Results
<basename>.gff3
<basename>.tsv
<basename>.xml
<basename>.json

Parser Outputs
<basename>_gaf.txt: -This table follows the formatting of a gene association file (GAF) and can be used in GO enrichment analyses.
<basename>_acc_go_counts.txt: -This table includes input accessions, the number of GO IDs assigned to each accession and GO ID names. GO IDs are split into BP (Biological Process), MF (Molecular Function) and CC (Cellular Component).
<basename>_go_counts.txt: -This table counts the numbers of sequences assigned to each GO ID so that the user can quickly identify all genes assigned to a particular function.
<basename>_acc_interpro_counts.txt: -This table includes input accessions, number of InterPro IDs for each accession, InterPro IDs assigned to each sequence and the InterPro ID name.
<basename>_interpro_counts.txt: -This table counts the numbers of sequences assigned to each InterPro ID so that the user can quickly identify all genes with a particular motif.
<basename>_acc_pathway_counts.txt: -This table includes input accessions, number of pathway IDs for the accession and the pathway names. Multiple values are separated by a semi-colon.
<basename>_pathway_counts.txt: -This table counts the numbers of sequences assigned to each Pathway ID so that the user can quickly identify all genes assigned to a pathway.
<basename>.err: -This file will list any sequences that were not able to be analyzed by InterProScan. Examples of sequences that will cause an error are sequences with a large run of Xs.
Pathannotator
  • Pathannotator annotates proteins with KEGG (Kyoto Encyclopedia of Genes and Genomes), Flybase and Reactome Drosophila melanogaster pathways. It does this through the use of KofamScanKEGG APIOrthoFinder, Flybase and Reactome.
  • KofamScan is a gene functional annotation tool based on KEGG Orthology and hidden Markov model (HMM). It is provided by the KEGG project. The online version is available here.
  • This pipeline pulls annotation directly from the KEGG API when possible. When that isn't possible the pipeline implements Kofamscan to identify homologous KEGG objects (KO). The pathways annotated to these KEGG objects are then transferred to the corresponding proteins in your species of interest.
  • If specified, the pipeline will also provide annotations to Flybase pathways and Reactome Drosophila melanogaster pathways. To do this the pipeline uses OrthoFinder to identify homologous Drosophila melanogaster proteins for your input proteins. Flybase metabolic pathway and signaling pathway annotations and Drosophila melanogaster Reactome pathways are then transferred to your input proteins from these homologs.
Get the KOfam databases. The KOfam 'profiles' and 'ko_list' databases are required to run the pipeline. If you don't already have these databases the pipeline will pull them during the first run. If you want to download them beforehand they are available from the KEGG website.

Command
Get the KOfam data
wget https://www.genome.jp/ftp/db/kofam/profiles.tar.gz
wget https://www.genome.jp/ftp/db/kofam/ko_list.gz

tar -xvzf profiles.tar.gz
gunzip ko_list.tar.gz

Get the Pathannotator container. The Pathannotator tool is available as a Docker container on Docker Hub: Pathannotator container

Command
Get Pathannotator container using Docker
docker pull agbase/pathannotator:4.1

Run Pathannotator using Docker.

Command
Pathannotator help using Docker
docker run --rm agbase/pathannotator:4.1 -h

Help and Usage:
-h to see help and usage statement
-k KEGG species code (NA or related species code if species not in KEGG) KEGG species codes can be found here: https://www.genome.jp/brite/br08611
-i input file (protein FASTA without header lines)
-d (optional: default is .) output directory (the file path should be relative to, and inside of, your working directory)
-f (optional: default is NA) FB for Flybase and DME Reactome annotations, NA for none
-c (optional: default is nproc -1) number of cpus to use
-o outbase (file basename to use for output files)


Command
Pathannotator analysis using Docker
docker run \
--rm \
-v /path/to/your/input/files:/workdir \
-v /path/to/kofam/databases/:/data \
agbase/pathannotator:4.1 \
-k tca \
-i GCF_031307605.1_icTriCast1.1_protein.faa \
-d pathannotator \
-f FB \
-o GCF_031307605.1
docker run: tells docker to run
--rm: removes the container when the analysis has finished. The image will remain for future use.
-v /path/to/your/input/files:/workdir: mounts the working directory on the host machine to '/workdir' inside the container
-v /path/to/kofam/databases/:/data: mounts the directory with the Kofam database files (or where you want them to be stored) on the host machine to '/data' inside the container
agbase/pathannotator:4.1: the name of the Docker image to use
Note
All the options supplied after the image name are Pathannotator options

-k tca: KEGG species code for Tribolium casteneum. Can be found here: https://www.genome.jp/brite/br08611 . If your species doesn't have a code choose a closely related species or 'NA'.
-i GCF_031307605.1_icTriCast1.1_protein.faa: input file (protein FASTA, no header lines).
-d pathannotator: Directory where you want the pipeline outputs to go. The directory must exist before you run the pipeline. The file path should be relative to (and inside of) your working directory.
-f FB: FB indicates that we want to get Flybase and Drosophila melanogaster Reactome pathways annotations in addition to KEGG annotations.
-o GCF_031307605.1: this will be the prefix used to name all the output files
Understanding Your Results
The output files you can expect will differ depending on the circumstances of your run. If you are using a KEGG species code you will get both KEGG reference and KEGG species pathways. Without a KEGG code (NA) you will only get KEGG reference pathway annotations. Under all circumstances you may specify whether or not you want to receive Flybase and Reactome Drosophila melanogaster pathways annotations as well. Whatever your options, the pathways will all be output into a single GMT formatted file.

Expected output files:
There will be 2-5 output files depending the on the options specified. All analyses will generate a KEGG reference output and GMT file:

<basename>_KEGG_ref.tsv: These are annotations to the KEGG reference pathways. The pathway identifiers will begin with 'map'.

ABCD
Input_protein_IDKEGG_KO KEGG_ref_pathway KEGG_ref_pathway_name
XP_015835225.1K26207map04024cAMP signaling pathway
XP_015835225.1 K26207map04261Adrenergic signaling in
XP_001813251.1 K01540map04022 cGMPPKG signaling pathway
<basename>_all_pathways.gmt: This file contains all of the pathways annotations generated by the analysis (KEGG ref, KEGG species, Flybase and Reactome) in GMT format. This is a tab-delimited file with N columns: Pathway ID, Description and Proteins annotated to that pathway (columns 3-N). This file has no header.

ABCDE
FBgg0000882 FlyBase_pathway XP_018223192.1 XP_018223020.1XP_018220038.1
R-DME-109606 Reactome_pathwayXP_0182222056.1XP_018221350.1
tca00130 KEGG_species_pathwayXP_018219344.1
map00254 KEGG_reference_pathwayXP_018220255.1XP_018229887.1

If a KEGG species code was specified (rather than 'NA') there will also be a KEGG species output:

<basename>_KEGG_species.tsv: These are annotations to the species-specific KEGG pathways. The pathway identifiers will begin with the KEGG species code.
ABCD
Input_protein_IDKEGG_KOKEGG_tca_pathwayKEGG_tca_pathway_name
XP_001813251.1 K01540tca04820Cytoskeleton in muscle cells - Tribolium castaneum (red flour beetle)
XP_001812480.1K02268tca00190Oxidative phosphorylation - Tribolium castaneum (red flour beetle)
XP_008195997.1 K04676tca04350TGF-beta signaling pathway - Tribolium castaneum (red flour beetle)

If 'FB' was specified there will also be Flybase and Reactome outputs:

<basename>_flybase.tsv: If you used the 'FB' option for Flybase pathways annotations you will get this output.

ABCD
Input_protein_IDFlybase_protein_IDFlybase_pathway_IDFlybase_pathway_name
NP_001034540.1FBpp0077451FBgg0001085BMP Signaling Pathway Core Components
NP_001034503.2FBpp0084690FBgg0000904Insulin-like Receptor Signaling Pathway Core Components
NP_001034492.1FBpp0078442FBgg0002045CHITIN BIOSYNTHESIS
<basename>_reactome.tsv: If you used the 'FB' option you will get this output containing Reactome Drosophila melanogaster pathways annotations.

ABCD
Input_protein_IDUniprot_IDReactome_pathway_IDReactome_pathway_name
XP_018221664.1Q9W5E1R-DME-1234174Cellular response to hypoxia
XP_018222787.1Q9W4N8R-DME-194315Signaling by Rho GTPases
XP_018222841.1Q7KVX1R-DME-162582Signal transduction

Combine GAFs
This tool can be used to combine the gene association file (GAF) outputs from GOanna and InterProScan.
The tool accepts two input files:
  1. GOanna GAF output
  2. InterProScan GAF output
Get combine_gafs container


Command
Get Combine_GAFS container using Docker
docker pull agbase/combine_gafs:1.1

Combine GAF outputs from GOanna and InterProScan

Combine GAFs has three parameters:

-i InterProScan XML Parser GAF output -g GOanna GAF output -o output file basename




Command
Combine_GAFs using Docker
docker run \
--rm \
-v $(pwd):/work-dir \
agbase/combine_gafs:1.1 \
-i /path/to/GCF_031307605.1_gaf.txt \
-g /path/to/GCF_031307605.1_goanna_gaf.tsv \
-o GCF_031307605.1_complete_gaf


Command explained:
docker run: tells docker to run
--rm: removes the container when the analysis has finished. The image will remain for future use.
-v $(pwd):/work-dir: mounts my current working directory on the host machine to '/work-dir' in the container
agbase/combine_gafs:1.1: the name of the Docker image to use

Note
All the options supplied after the image name are Combine_GAFs options

-i /path/to/<basename>_gaf.txt: InterProScan GAF output file.
-g /path/to/<basename>_goanna_gaf.tsv: GOanna GAF output file.
-o <basename_complete_gaf>: output file basename--a .tsv extension will be added
Understanding your results

Combine GAFs will create a single GAF output file:

<basename>.tsv: This is a GAF format file that contains all the GO annotations from both GOanna and InterProScan. The original input files will remain as they were.

References

Citation
Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL (2009). BLAST : architecture and applications. BMC bioinformatics.
LINK

Citation
Blum M, Chang HY, Chuguransky S, Grego T, Kandasaamy S, Mitchell A, Nuka G, Paysan-Lafosse T, Qureshi M, Raj S, Richardson L, Salazar GA, Williams L, Bork P, Bridge A, Gough J, Haft DH, Letunic I, Marchler-Bauer A, Mi H, Natale DA, Necci M, Orengo CA, Pandurangan AP, Rivoire C, Sigrist CJA, Sillitoe I, Thanki N, Thomas PD, Tosatto SCE, Wu CH, Bateman A, Finn RD (2021). The InterPro protein families and domains database: 20 years on. Nucleic acids research.
LINK

Citation
Jones P, Binns D, Chang HY, Fraser M, Li W, McAnulla C, McWilliam H, Maslen J, Mitchell A, Nuka G, Pesseat S, Quinn AF, Sangrador-Vegas A, Scheremetjew M, Yong SY, Lopez R, Hunter S (2014). InterProScan 5: genome-scale protein function classification. Bioinformatics (Oxford, England).
LINK

Citation
Kanehisa M, Furumichi M, Sato Y, Matsuura Y, Ishiguro-Watanabe M (2025). KEGG: biological systems database as a model of the real world. Nucleic acids research.
LINK

Citation
Öztürk-Çolak A, Marygold SJ, Antonazzo G, Attrill H, Goutte-Gattat D, Jenkins VK, Matthews BB, Millburn G, Dos Santos G, Tabone CJ, FlyBase Consortium (2024). FlyBase: updates to the Drosophila genes and genomes database. Genetics.
LINK

Citation
Milacic M, Beavers D, Conley P, Gong C, Gillespie M, Griss J, Haw R, Jassal B, Matthews L, May B, Petryszak R, Ragueneau E, Rothfels K, Sevilla C, Shamovsky V, Stephan R, Tiwari K, Varusai T, Weiser J, Wright A, Wu G, Stein L, Hermjakob H, D'Eustachio P (2024). The Reactome Pathway Knowledgebase 2024. Nucleic acids research.
LINK

Citation
Dirk Merkel (2014). Docker: lightweight Linux containers for consistent development and deployment. Linux Journal.
LINK

Citation
Kurtzer GM, Sochat V, Bauer MW (2017). Singularity: Scientific containers for mobility of compute. PloS one.
LINK

Citation
Saha S, Cooksey AM, Childers AK, Poelchau MF, McCarthy FM (2021). Workflows for Rapid Functional Annotation of Diverse Arthropod Genomes. Insects.
LINK

Citation
Kanehisa M, Furumichi M, Sato Y, Ishiguro-Watanabe M, Tanabe M (2021). KEGG: integrating viruses and cellular organisms. Nucleic acids research.
LINK

Citation
Emms DM, Kelly S (2019). OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome biology.
LINK

Citation
Aramaki T, Blanc-Mathieu R, Endo H, Ohkubo K, Kanehisa M, Goto S, Ogata H (2020). KofamKOALA: KEGG Ortholog assignment based on profile HMM and adaptive score threshold. Bioinformatics (Oxford, England).
LINK

Citations
Step  22
Kanehisa M, Furumichi M, Sato Y, Matsuura Y, Ishiguro-Watanabe M. KEGG: biological systems database as a model of the real world.
https://doi.org/10.1093/nar/gkae909
Step  22
Milacic M, Beavers D, Conley P, Gong C, Gillespie M, Griss J, Haw R, Jassal B, Matthews L, May B, Petryszak R, Ragueneau E, Rothfels K, Sevilla C, Shamovsky V, Stephan R, Tiwari K, Varusai T, Weiser J, Wright A, Wu G, Stein L, Hermjakob H, D'Eustachio P. The Reactome Pathway Knowledgebase 2024.
https://doi.org/10.1093/nar/gkad1025
Step  22
Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute.
https://doi.org/10.1371/journal.pone.0177459
Step  22
Kanehisa M, Furumichi M, Sato Y, Ishiguro-Watanabe M, Tanabe M. KEGG: integrating viruses and cellular organisms.
https://doi.org/10.1093/nar/gkaa970
Step  22
Aramaki T, Blanc-Mathieu R, Endo H, Ohkubo K, Kanehisa M, Goto S, Ogata H. KofamKOALA: KEGG Ortholog assignment based on profile HMM and adaptive score threshold.
https://doi.org/10.1093/bioinformatics/btz859
Step  22
Jones P, Binns D, Chang HY, Fraser M, Li W, McAnulla C, McWilliam H, Maslen J, Mitchell A, Nuka G, Pesseat S, Quinn AF, Sangrador-Vegas A, Scheremetjew M, Yong SY, Lopez R, Hunter S. InterProScan 5: genome-scale protein function classification.
https://doi.org/10.1093/bioinformatics/btu031
Step  22
Öztürk-Çolak A, Marygold SJ, Antonazzo G, Attrill H, Goutte-Gattat D, Jenkins VK, Matthews BB, Millburn G, Dos Santos G, Tabone CJ, FlyBase Consortium. FlyBase: updates to the Drosophila genes and genomes database.
https://doi.org/10.1093/genetics/iyad211
Step  22
Dirk Merkel. Docker: lightweight Linux containers for consistent development and deployment
https://dl.acm.org/doi/10.5555/2600239.2600241
Step  22
Saha S, Cooksey AM, Childers AK, Poelchau MF, McCarthy FM. Workflows for Rapid Functional Annotation of Diverse Arthropod Genomes.
https://doi.org/10.3390/insects12080748
Step  22
Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL. BLAST : architecture and applications.
https://doi.org/10.1186/1471-2105-10-421
Step  22
Blum M, Chang HY, Chuguransky S, Grego T, Kandasaamy S, Mitchell A, Nuka G, Paysan-Lafosse T, Qureshi M, Raj S, Richardson L, Salazar GA, Williams L, Bork P, Bridge A, Gough J, Haft DH, Letunic I, Marchler-Bauer A, Mi H, Natale DA, Necci M, Orengo CA, Pandurangan AP, Rivoire C, Sigrist CJA, Sillitoe I, Thanki N, Thomas PD, Tosatto SCE, Wu CH, Bateman A, Finn RD. The InterPro protein families and domains database: 20 years on.
https://doi.org/10.1093/nar/gkaa977
Step  22
Emms DM, Kelly S. OrthoFinder: phylogenetic orthology inference for comparative genomics.
https://doi.org/10.1186/s13059-019-1832-y
Step  22
Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute.
https://doi.org/10.1371/journal.pone.0177459
Step  22
Kanehisa M, Furumichi M, Sato Y, Ishiguro-Watanabe M, Tanabe M. KEGG: integrating viruses and cellular organisms.
https://doi.org/10.1093/nar/gkaa970
Step  22
Aramaki T, Blanc-Mathieu R, Endo H, Ohkubo K, Kanehisa M, Goto S, Ogata H. KofamKOALA: KEGG Ortholog assignment based on profile HMM and adaptive score threshold.
https://doi.org/10.1093/bioinformatics/btz859
Step  22
Blum M, Chang HY, Chuguransky S, Grego T, Kandasaamy S, Mitchell A, Nuka G, Paysan-Lafosse T, Qureshi M, Raj S, Richardson L, Salazar GA, Williams L, Bork P, Bridge A, Gough J, Haft DH, Letunic I, Marchler-Bauer A, Mi H, Natale DA, Necci M, Orengo CA, Pandurangan AP, Rivoire C, Sigrist CJA, Sillitoe I, Thanki N, Thomas PD, Tosatto SCE, Wu CH, Bateman A, Finn RD. The InterPro protein families and domains database: 20 years on.
https://doi.org/10.1093/nar/gkaa977
Step  22
Jones P, Binns D, Chang HY, Fraser M, Li W, McAnulla C, McWilliam H, Maslen J, Mitchell A, Nuka G, Pesseat S, Quinn AF, Sangrador-Vegas A, Scheremetjew M, Yong SY, Lopez R, Hunter S. InterProScan 5: genome-scale protein function classification.
https://doi.org/10.1093/bioinformatics/btu031
Step  22
Kanehisa M, Furumichi M, Sato Y, Matsuura Y, Ishiguro-Watanabe M. KEGG: biological systems database as a model of the real world.
https://doi.org/10.1093/nar/gkae909
Step  22
Saha S, Cooksey AM, Childers AK, Poelchau MF, McCarthy FM. Workflows for Rapid Functional Annotation of Diverse Arthropod Genomes.
https://doi.org/10.3390/insects12080748
Step  22
Emms DM, Kelly S. OrthoFinder: phylogenetic orthology inference for comparative genomics.
https://doi.org/10.1186/s13059-019-1832-y
Step  22
Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL. BLAST : architecture and applications.
https://doi.org/10.1186/1471-2105-10-421
Step  22
Öztürk-Çolak A, Marygold SJ, Antonazzo G, Attrill H, Goutte-Gattat D, Jenkins VK, Matthews BB, Millburn G, Dos Santos G, Tabone CJ, FlyBase Consortium. FlyBase: updates to the Drosophila genes and genomes database.
https://doi.org/10.1093/genetics/iyad211
Step  22
Milacic M, Beavers D, Conley P, Gong C, Gillespie M, Griss J, Haw R, Jassal B, Matthews L, May B, Petryszak R, Ragueneau E, Rothfels K, Sevilla C, Shamovsky V, Stephan R, Tiwari K, Varusai T, Weiser J, Wright A, Wu G, Stein L, Hermjakob H, D'Eustachio P. The Reactome Pathway Knowledgebase 2024.
https://doi.org/10.1093/nar/gkad1025
Step  22
Dirk Merkel. Docker: lightweight Linux containers for consistent development and deployment
https://dl.acm.org/doi/10.5555/2600239.2600241