Yang T, Certano A, Lakshmanan V, Bassi S, Purushotham N, Nock T, Chaudhury A, Ray A Genome resource announcement of a Leptodophora sp. fungus isolated from roots of broadleaf plants in Wisconsin, USA. Microbiology Resource Announcements 14(8). doi: 10.1128/mra.00936-24
License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License,  which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Protocol to annotate a fungi genome from long and short reads
Troubleshooting
Setup
Install Docker
If you don't have Docker already, install it. There are two versions, Docker Engine (also known as CE) and Docker Desktop. The Desktop version is more user friendly but since may require commercial license for large enterprise, this tutorial is based on the Docker engine. Both version will work in this protocol. Linux users can install both Docker CE and Desktop, while macOS and Windows users should install Docker Desktop.
You will need fastq data (long reads), short reads, and the assembly data. In the following code, the assembly data file is called assembly.fasta. The long reads file is called ID.fastq. The short reads should be two files (ID_R1.fastq.gz and ID_R2.fastq.gz).
If you have more files for short reads, you can concatenate them so you end up with 2 files. For example, if you have ID_L001_R1.fastq.gz, ID_L002_R1.fastq.gz, ID_L001_R2.fastq.gz, ID_L002_R2.fastq.gz, you can concatenate them with these commands:
Download FamDB HDF5 database, Interproscan database and GeneMark license
FamDB HDF5 database
FamDB HDF5 database is needed for the RepeatMasker step. This database is partitioned by taxonomic groups, the partition needed for Fungi is partition number 0, for more information about partitions read this file: README.txt2KB
If you don't have a GeneMark license, get it from this page. License key file should be named gm_key and located in /your_dir. This license is need to run the Funannotate Predict step.
Run hifiasm
Run the following command (replace /your_dir for the base directory where you have your data
hifiasm takes a collection of individual DNA sequencing reads (the .fastq.gz file) and reconstructs the DNA sequence of the organism (the assembled contigs/scaffolds in the output directory)
Run gfatools gfa2fa
Run the following command (replace /your_dir for the base directory where you have your data
docker run -it -v /your_dir:/ftmp dnalinux/gfatools sh -c "gfatools-final-gt/gfatools gfa2fa /ftmp/output/ID-longread.bp.p_ctg.gfa > /ftmp/ID-longread.bp.p_ctg.fasta"
gfa2fa takes the detailed graphical representation of the assembled genome (the .gfa file) and simplifies it into a more standard, sequence-only format (the .fasta file)
Run sspace_longread
Run the following command (replace /your_dir for the base directory where you have your data
SSPACE-LongRead takes fragmented genome assembly (contigs) and uses the long, accurate HiFi reads to piece these fragments together into larger, more contiguous representations of the chromosomes (scaffolds), reducing the number of gaps and improving the overall completeness of the genome assembly.
Run Gapcloser
Run the following command (replace /your_dir for the base directory where you have your data
LR_Gapcloser takes a scaffolded genome with remaining gaps (represented by 'N's) and uses the long, accurate HiFi reads to fill in these gaps, resulting in a more complete and contiguous genome assembly.
Run BWA Index
Run the following command (replace /your_dir for the base directory where you have your data
docker run -it -v /your_dir:/ftmp dnalinux/bwa:0.7.17-3-deb bwa index /ftmp/ID_lr-gapcloser/iteration-1/gapclosed.fasta
bwa index takes a linear DNA sequence (the FASTA file) and transforms it into a highly efficient and searchable data structure (the index files). This allows subsequent bwa commands (e.g., bwa mem for alignment) to quickly map millions or billions of short sequencing reads back to this genome assembly
Run fastp
Run the following command (replace /your_dir for the base directory where you have your data)
fastp is a very fast all-in-one preprocessor for FASTQ files. It's designed to perform quality control, filtering, and trimming of raw sequencing reads. Raw reads from sequencing machines often contain errors, adapter sequences, and low-quality bases at their ends. It takes raw, potentially problematic paired-end sequencing reads and transforms them into high-quality, "cleaned" paired-end reads by removing errors, adapters, and low-quality sequences.
Run BWA mem
Run the following command (replace /your_dir for the base directory where you have your data). Replace CPU for your CPU count.
docker run -it -v /your_dir:/ftmp dnalinux/bwa:0.7.17-3-deb /bin/bash -c "bwa mem -t CPU /ftmp/ID_lr-gapcloser/iteration-1/gapclosed.fasta /ftmp/ID_R1_trim.fastq.gz /ftmp/ID_R2_trim.fastq.gz > /ftmp/ID_aligned_reads.sam"
bwa mem takes cleaned sequencing reads and a pre-indexed reference genome (the gapclosed.fasta file) and transforms them into a .sam file.
Run SAMTOOLS
Sort and Index
Run the following command (replace /your_dir for the base directory where you have your data).
docker run -it -v /your_dir:/ftmp dnalinux/samtools:1.20-3-deb samtools index /ftmp/ID_sorted_aligned_reads.bam
samtools view command takes the human-readable (but large) text-based alignment file and converts it into a smaller, more efficient binary format suitable for further computational analysis.
samtools sort reorganizes the alignment data in the BAM file into a canonical, ordered sequence that is required for subsequent steps.
samtools index creates a lookup table that enables fast random access to alignment data within the sorted BAM file, significantly speeding up downstream analyses.
Pilon
Run the following command (replace /your_dir for the base directory where you have your data).
pilon takes a nearly complete genome assembly and refines it using high-accuracy, short-read sequencing data. It effectively "cleans up" the assembly, reducing the number of single-base errors and small indels, resulting in a more accurate and higher-quality final genome sequence. This is often one of the final steps in producing a finished de novo genome assembly.
Funannotate
Funannotate Clean and Sort
Run the following command (replace /your_dir for the base directory where you have your data).
funannotate clean takes a high-quality genome assembly and performs a set of standardized quality control and formatting steps to make it optimal for gene prediction and annotation.
funannotate sort ensures that the genome assembly is organized and that only sufficiently long sequences, which are most amenable to accurate gene prediction, are carried forward for the intensive annotation process.
RepeatMasker
Run the following command (replace /your_dir for the base directory where you have your data). Remember that is step requires the dfam38_full.0.h5 database installed in a directory that should be called /ftmp in the docker. If dfam38-1_full.0.h5 is available, a ln to dfam38_full.0.h5 must be done (check step 3.1 for details)
RepeatMasker takes a raw genome assembly and transforms it by marking all known repetitive DNA elements, typically by converting their nucleotides to lowercase. This "masked" genome is then much more suitable for accurate gene prediction and other sequence analyses that are sensitive to repetitive regions.
Fuannotate Predict
Run the following command (replace /your_dir for the base directory where you have your data). Replace CPU for your CPU count.
docker run -it -v /your_dir:/ftmp dnalinux/funannotate-gmes-dikarya:latest funannotate predict -i /ftmp/funannotate_prep/ID_polished_clean_sort.fasta.masked -s ID -o /ftmp/funannotate --cpus CPU
funannotate predict takes a masked genome assembly and, by integrating various computational tools and biological evidence, transforms it into a set of predicted gene models (including their genomic locations, exon-intron structures, and corresponding protein/CDS sequences) represented in standard bioinformatics formats like GFF3.
Interproscan
Run the following command (replace /your_dir for the base directory where you have your data). Replace CPU for your CPU count.
docker run -it -v /your_dir:/ftmp -v /your_dir/interproscan-5.69-101.0/data:/opt/interproscan/data -v /tmp:/temp dnalinux/interproscan:5.69-101.0 --input /ftmp/funannotate/predict_results/ID.proteins.fa --disable-precalc --output-dir /ftmp/funannotate/ipsout --cpu CPU
InterProScan takes a set of predicted protein sequences and transforms them into a rich functional annotation by identifying known protein signatures, assigning higher-level InterPro classifications, and linking them to Gene Ontology terms.
Funannoate annotate
Run the following command (replace /your_dir for the base directory where you have your data)
docker run -it -v /your_dir:/ftmp dnalinux/funannotate-gmes-dikarya funannotate annotate -i /ftmp/funannotate --fasta /ftmp/funannotate/predict_results/ID.proteins.fa --species ID --out /ftmp/FA_results --iprscan /ftmp/funannotate/ipsout/ID.proteins.fa.xml
funannotate annotate takes the predicted gene structures and their corresponding protein sequences, enriches them with detailed functional information derived from databases like InterProScan, and compiles all this into a complete and standardized set of genome annotation files. This is the culmination of the entire annotation pipeline, providing the biological insights into the sequenced organism.