May 29, 2026

Protocol: Positive Selection Analysis of the Mitochondrial nad4 Gene in Ophiocordycipitaceae Using Multi-Method Convergent Evidence

  • Qiuyang Wei1
  • 1Chongqing Academy of Chinese Materia Medica
  • wqycsic
Icon indicating open access to content
QR code linking to this content
Protocol CitationQiuyang Wei 2026. Protocol: Positive Selection Analysis of the Mitochondrial nad4 Gene in Ophiocordycipitaceae Using Multi-Method Convergent Evidence. protocols.io https://dx.doi.org/10.17504/protocols.io.j8nlk7bdwg5r/v1
License: This is an open access  protocol  distributed under the terms of the  Creative Commons Attribution License,  which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Protocol status: Working
We use this protocol and it's working
Created: May 29, 2026
Last Modified: May 29, 2026
Protocol  Integer ID: 318173
Keywords: Ophiocordycipitaceae, nad4, mitochondrial genome, PAML, positive selection analysis of the mitochondrial nad4 gene, positive selection signatures in the mitochondrial nad4 gene, mitochondrial nad4 gene, taxa of the ascomycete family, ophiocordycipitaceae, ancestral sequence reconstruction with site, ascomycete family, physicochemical property divergence analysis of amino acid substitution, validating positive selection signature, confidence in positive selection inference, ancestral sequence reconstruction, positive selection inference, likelihood ratio test, amino acid substitution, positive selection analysis
Funders Acknowledgements:
The Science and Technology Research Program of Chongqing Municipal Education Commission
Grant ID: KJQN202515101
The Fundamental Scientific Research Funds Project of Chongqing Municipal Science and Technology Bureau
Grant ID: CSTB2025jxjl-jbkyX0017
Abstract
This protocol describes a comprehensive multi-method analytical workflow for detecting and validating positive selection signatures in the mitochondrial nad4 gene across 16 taxa of the ascomycete family Ophiocordycipitaceae. The protocol integrates four complementary analytical approaches: (i) ancestral sequence reconstruction with site-specific dN/dS estimation using the Nei-Gojobori method, (ii) PAML codeml site models (M0 one-ratio, M7 beta, M8 beta+ω) with likelihood ratio tests and Bayes Empirical Bayes (BEB) posterior probability-based site identification, (iii) Datamonkey web server analysis using FEL, MEME, and FUBAR algorithms, and (iv) physicochemical property divergence analysis of amino acid substitutions at candidate positively selected sites. Convergent evidence across independent methods strengthens confidence in positive selection inference. All input files, output files, control files, and analysis scripts are deposited at Zenodo (DOI: 10.5281/zenodo.20438256) for full reproducibility.
Guidelines
Ophiocordycipitaceae
(Hypocreales, Sordariomycetes) is a family of entomopathogenic fungi that includes the medicinally and economically significant Chinese cordyceps fungus Ophiocordyceps sinensis. Mitochondrial protein-coding genes evolve under predominantly purifying selection; however, lineage-specific positive selection on specific genes may reflect adaptive evolution associated with ecological niche differentiation, host switching, or metabolic specialization.
The mitochondrial nad4 gene encodes subunit 4 of NADH dehydrogenase (Complex I), a core component of the oxidative phosphorylation pathway. Amino acid substitutions in functionally critical domains — particularly those involved in proton translocation and ubiquinone binding — may affect enzymatic efficiency and organismal fitness under varying environmental conditions. Detecting positive selection on nad4 requires rigorous statistical approaches that distinguish genuine adaptive signals from stochastic noise.
This protocol integrates four analytical methods, each with distinct statistical frameworks and assumptions, to provide convergent evidence for positive selection. The combination of phylogenetic codon models (PAML), maximum likelihood approaches (Datamonkey), non-model-based pairwise comparisons (Nei-Gojobori), and structural/functional validation (physicochemical divergence) yields robust and reproducible results suitable for publication in peer-reviewed journals.Ophiocordycipitaceae (Hypocreales, Sordariomycetes) is a family of entomopathogenic fungi that includes the medicinally and economically significant Chinese cordyceps fungus Ophiocordyceps sinensis. Mitochondrial protein-coding genes evolve under predominantly purifying selection; however, lineage-specific positive selection on specific genes may reflect adaptive evolution associated with ecological niche differentiation, host switching, or metabolic specialization.
The mitochondrial nad4 gene encodes subunit 4 of NADH dehydrogenase (Complex I), a core component of the oxidative phosphorylation pathway. Amino acid substitutions in functionally critical domains — particularly those involved in proton translocation and ubiquinone binding — may affect enzymatic efficiency and organismal fitness under varying environmental conditions. Detecting positive selection on nad4 requires rigorous statistical approaches that distinguish genuine adaptive signals from stochastic noise.
This protocol integrates four analytical methods, each with distinct statistical frameworks and assumptions, to provide convergent evidence for positive selection. The combination of phylogenetic codon models (PAML), maximum likelihood approaches (Datamonkey), non-model-based pairwise comparisons (Nei-Gojobori), and structural/functional validation (physicochemical divergence) yields robust and reproducible results suitable for publication in peer-reviewed journals.
Before start
1. Verify that all input files from the Zenodo dataset are correctly downloaded and accessible.
2. Ensure PAML (codeml) is installed and available in the system PATH, or note its installation directory.
3. For Windows users without native compilation capability: PAML control files can be submitted to an AI-assisted analysis platform or a Linux server. The protocol describes both native execution and alternative execution paths.
4. The mold/protozoan mitochondrial genetic code (Translation Table 4) is used for all codon-based analyses. Verify that PAML control files specify icode = 3 (mold mitochondrial code).
5. Sequence names in alignment files and tree files must match exactly. Datamonkey additionally requires sequence names ≤ 30 characters without special characters.
Software Version Purpose URL
MAFFT ≥ v7.0 Multiple sequence alignment with codon-aware mode https://mafft.cbrc.jp/alignment/software/
MEGA X or 11 Alignment visualization, NJ tree construction https://www.megasoftware.net/
IQ-TREE ≥ v1.6 (optional) Maximum likelihood tree with ModelFinder http://www.iqtree.org/
PAML (codeml) ≥ v4.9 Codon substitution site models (M0, M7, M8) http://abacus.gene.ucl.ac.uk/software/paml.html
Python ≥ 3.8 Alignment processing, ancestral reconstruction, dN/dS calculation https://www.python.org/
Biopython ≥ 1.79 Sequence I/O, pairwise2 alignment, codonalign module https://biopython.org/
Datamonkey Web server FEL, MEME, FUBAR selection analyses https://www.datamonkey.org/
Protocol: Positive Selection Analysis of the Mitochondrial nad4 Gene in Ophiocordycipitaceae Using Multi-Method Convergent Evidence
Step 1: Data collection and genome annotation extraction
1.1 Retrieve complete mitochondrial genome sequences from NCBI GenBank. The dataset includes 16 species representing major clades within Ophiocordycipitaceae plus two outgroup species from Sordariomycetes (Podospora anserina, Neurospora crassa). Species selection criteria: (a) availability of complete or near-complete mitochondrial genome sequences, (b) representation of major phylogenetic clades, (c) inclusion of outgroup taxa for improved phylogenetic rooting. 1.2 Extract protein-coding gene (PCG) annotations from GenBank feature tables. Prioritize CDS features; use gene features as fallback where CDS annotations are absent. 1.3 Apply a length-filtering strategy to exclude intron-containing CDS annotations. Retain only CDS features with lengths within the expected range for each gene — for nad4, the expected range is 500–2,500 bp. Exclude annotations with internal stop codons or frameshift mutations. 1.4 Calculate nucleotide composition statistics for each genome: A%, T%, G%, C%, AT%, GC%, AT-skew = (A−T)/(A+T), GC-skew = (G−C)/(G+C). These provide baseline characterization of mitochondrial nucleotide composition, which may influence codon usage patterns and substitution models. 1.5 Identify 13 core PCGs (atp6, atp8, atp9, cob, cox1, cox2, cox3, nad1, nad2, nad3, nad4, nad5, nad6) present across species for downstream analyses. Expected output: Annotated nad4 coding sequences for all 16 taxa; nucleotide composition summary table.
Step 2: Codon-aware multiple sequence alignment
2.1 Translate nad4 nucleotide sequences to amino acid sequences using the mold/protozoan mitochondrial genetic code (Translation Table 4). Verify that no internal stop codons are present. 2.2 Perform codon-aware multiple sequence alignment using MAFFT with the --auto algorithm selection mode, or MUSCLE v5 with default parameters. For scripted pipelines, an alternative approach is the codon-aware star alignment strategy: align at the amino acid level using the Needleman-Wunsch algorithm (via Bio.pairwise2 in Biopython) and back-translate to codon triplets to preserve reading frame integrity. 2.3 Manually inspect the alignment in MEGA or AliView to verify: (a) codon boundaries are maintained without frameshifts, (b) no internal stop codons are present in any sequence, (c) alignment of conserved functional domains is visually coherent. 2.4 Export the alignment in three formats for different downstream tools: • PHYLIP sequential format (.phy) — required by PAML codeml • FASTA format (.fasta) — required by Datamonkey and gene-wide analyses • NEXUS format (.nex) — for phylogenetic software compatibility 2.5 Prepare an additional gap-free version of the alignment by removing all columns containing gaps. This gap-free alignment is used for Datamonkey server submission and pairwise sliding-window analyses. Final alignment dimensions: 1,614 bp; 538 codons; no gaps retained (gap-free version); 16 taxa. Expected output: 01_nad4_alignment.phy, 2_nad4_alignment_clean.fasta, 03_nad4_alignment_nogap.fasta, 12_nad4_alignment.nex

Step 3: Phylogenetic tree reconstruction

3.1 Construct the starting phylogenetic tree using the Neighbor-Joining (NJ) method (Saitou & Nei, 1987) implemented in MEGA X:

Parameter Setting
Substitution model p-distance (or Kimura 2-parameter)
Bootstrap replicates 100 (codon-aware)
Gaps/Missing data Pairwise deletion
Root Podospora anserina and Neurospora crassa (outgroup)
3.2 (Optional) For maximum likelihood validation, construct an ML tree using IQ-TREE with ModelFinder for automatic substitution model selection. This step may not be feasible in all computational environments. 3.3 Export the tree in Newick format (.nwk). Ensure that tip labels match exactly with sequence names in all alignment files. 3.4 Visually verify the tree topology against published Ophiocordycipitaceae phylogenies to check for major topological anomalies. Expected output: 05_nad4_nj_tree.nwk
Step 4: Ancestral sequence reconstruction and site-specific dN/dS estimation
4.1 Reconstruct ancestral sequences along the species tree using maximum parsimony (Fitch algorithm). Count synonymous and nonsynonymous substitutions along each branch. 4.2 Compute site-specific dN (nonsynonymous) and dS (synonymous) substitution rates using the Nei-Gojobori method with Jukes-Cantor correction for multiple hits. Apply the mold/protozoan mitochondrial genetic code. 4.3 Calculate the selection pressure parameter: ω = dN/dS at each codon position. Interpretation: • ω < 1: purifying selection (conservation) • ω ≈ 1: neutral evolution • ω > 1: positive selection (adaptive evolution) 4.4 Identify positively selected sites (PSS) using Fisher's exact test with significance threshold P < 0.05. 4.5 Perform a complementary pairwise dN/dS analysis using the Nei-Gojobori method on all 120 pairwise comparisons among the 16-species nad4 alignment (537 codons after removal of terminal stop codons). 4.6 Characterize regional variation with a sliding window analysis: window size = 30 codons, step size = 10 codons. Identify codon regions with locally elevated ω values. Expected output: Site-specific dN, dS, and ω values; list of candidate positively selected codons (P < 0.05); sliding window profiles.
Step 5: PAML codeml analysis — M0 (one-ratio) model
5.1 Prepare the M0 control file (codeml_M0.ctl) with the following parameters 5.2 Execute codeml 5.3 Record key outputs from M0_output.out: Log-likelihood (lnL_M0), tree length, global ω estimate (dN/dS across all branches), and verify model convergence (check for 'Time used' line in output). 5.4 The M0-estimated branch lengths will be fixed in subsequent site model analyses (Steps 6–7) to reduce computational burden and improve convergence stability. Expected output: 06_PAML_M0_output.out; lnL_M0 and global ω values.
Step 6: PAML codeml analysis — M7 (beta) null model
6.1 Prepare the M7 control file (codeml_M7.ctl) 6.2 Execute codeml: codeml codeml_M7.ctl 6.3 Record key outputs: Log-likelihood (lnL_M7), parameters of the beta distribution (p, q), and verify model convergence. Expected output: 07_PAML_M7_output.out; lnL_M7 and beta distribution parameters.
Step 7: PAML codeml analysis — M8 (beta+ω) alternative model
7.1 Prepare the M8 control file (codeml_M8.ctl) 7.2 Execute codeml: codeml codeml_M8.ctl 7.3 Record key outputs: Log-likelihood (lnL_M8), proportion of sites in the ω > 1 class (p₁), estimated ω for the positively selected class (ω₁), and verify model convergence. Expected output: 08_PAML_M8_output.out

Step 8: Likelihood ratio test (LRT) between M7 and M8

8.1 Compute the likelihood ratio test statistic: LRT = 2 × (lnL_M8 −lnL_M7) 8.2 Determine degrees of freedom (df): M8 has 2 additional parameters compared to M7: (a) the proportion of sites in the ω > 1 class (p₁), and (b) the ω value for that class. Therefore, df = 2. 8.3 Obtain the P-value from a chi-squared distribution with df = 2. In Python: p_value = 1 - chi2.cdf(LRT, df=2) 8.4 Interpret the result:

P-value Interpretation
P < 0.05 Significant evidence for positive selection. Reject M7 in favor of M8.
P < 0.01 Strong evidence for positive selection.
P ≥ 0.05 No significant evidence. M7 cannot be rejected.
Expected output: LRT statistic, df, P-value.

Step 9: Bayes Empirical Bayes (BEB) site identification

9.1 Locate the BEB analysis section in the M8 output file. Search for the line containing 'Bayes Empirical Bayes (BEB) analysis'. 9.2 Extract codons with posterior probability of belonging to the ω >1 class:

Posterior probability Evidence strength
> 0.95 Strong evidence for positive selection
0.50–0.94 Weak/suggestive evidence
9.3 For each identified site, record: codon position in the alignment, amino acid at that position, posterior probability, and estimated ω. 9.4 Map the alignment codon position to amino acid positions in the nad4 protein, referencing the NADH dehydrogenase subunit 4 structure (Pfam: PF00361, PF01059). Expected output: Table of BEB-identified positively selected sites (position, amino acid, posterior probability, functional domain).
Step 10: Datamonkey server analysis
10.1 Navigate to the Datamonkey web server at https://www.datamonkey.org/ 10.2 Upload the gap-free FASTA alignment 03_nad4_alignment_nogap.fasta) and the NJ tree (05_nad4_nj_tree.nwk). Note: Datamonkey requires unique sequence names ≤ 30 characters without special characters. 10.3 Under the 'Selection' analysis menu, run three complementary methods: FEL (Fixed Effects Likelihood): Estimates dN and dS at each site independently using maximum likelihood. Parameters: default settings; significance threshold: P < 0.1. MEME (Mixed Effects Model of Evolution): Allows ω to vary across branches at each site, detecting episodic/diversifying selection. Parameters: default settings; significance threshold: P < 0.1. FUBAR (Fast Unconstrained Bayesian Approximation): Uses a Bayesian approach with Markov chain Monte Carlo (MCMC) to estimate posterior probabilities of dN > dS at each site. Parameters: default settings (5 grid points, 5 MCMC chains); significance threshold: posterior probability ≥ 0.9. 10.4 Download results (CSV format) for each method. Expected output: 11_datamonkey_results.txt; lists of significant sites from FEL, MEME, and FUBAR.
Step 11: Cross-validation across methods
11.1 Compile all positively selected sites identified by each method into a comparative summary table. 11.2 Classify confidence levels: • High confidence: Sites identified by ≥ 3 independent methods • Moderate confidence: Sites identified by 2 independent methods • Low confidence: Sites identified by a single method only Expected output: Cross-validation table; final list of high-confidence positively selected sites.

Step 12: Physicochemical property divergence analysis

12.1 For each candidate positively selected site, extract the amino acid in the reference sequence and the amino acid in the positively selected lineage. 12.2 Assess changes in the following physicochemical properties:

Property Scale/Method Interpretation
Hydrophobicity Kyte-Doolittle hydropathy index Change in membrane environment preference
Charge pKa of side chain at physiological pH Change from acidic (−) to basic (+) or vice versa
Molecular volume Amino acid side chain volume (ų) Steric effects on protein structure
Polarity Polar vs. nonpolar classification Change in solvent interaction
12.3 Classify each substitution as radical (≥ 2 properties change) or conservative (≤ 1 property change). Expected output: Physicochemical change summary table for each candidate site.
Step 13: Data deposition and reproducibility
13.1 Assemble all input, output, and intermediate files: alignment files, tree file, PAML control files, PAML output files, Datamonkey results, Python analysis scripts, codon usage and nucleotide composition data.
Protocol references
1. Yang, Z. (2007). PAML 4: Phylogenetic Analysis by Maximum Likelihood. Molecular Biology and Evolution, 24(8), 1586–1591.
2. Yang, Z., Wong, W.S.W., & Nielsen, R. (2005). Bayes empirical Bayes inference of amino acid sites under positive selection. Molecular Biology and Evolution, 22(4), 1107–1118.
3. Nei, M., & Gojobori, T. (1986). Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Molecular Biology and Evolution, 3(5), 418–426.
4. Pond, S.L.K., & Frost, S.D.W. (2005). Datamonkey: rapid detection of selective pressure on individual sites of codon alignments. Bioinformatics, 21(10), 2531–2533.
5. Murrell, B., et al. (2012). FUBAR: a fast, unconstrained Bayesian approximation for inferring selection. Molecular Biology and Evolution, 29(10), 3097–3108.
6. Murrell, B., et al. (2012). Detecting individual sites subject to episodic diversifying selection. PLoS Genetics, 8(7), e1002764.
7. Saitou, N., & Nei, M. (1987). The neighbor-joining method: a new method for reconstructing phylogenetic trees. Molecular Biology and Evolution, 4(4), 406–425.
8. Cock, P.J.A., et al. (2009). Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics, 25(11), 1422–1423.
9. Katoh, K., & Standley, D.M. (2013). MAFFT multiple sequence alignment software version 7. Molecular Biology and Evolution, 30(4), 772–780.
10. Sharp, P.M., & Li, W.H. (1987). The codon Adaptation Index — a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Research, 15(3), 1281–1295.
Acknowledgements
Associated Resources: Zenodo dataset: 10.5281/zenodo.20438256 (https://doi.org/10.5281/zenodo.20438256)