Protocol: Positive Selection Analysis of the Mitochondrial nad4 Gene in Ophiocordycipitaceae Using Multi-Method Convergent Evidence

Qiuyang Wei

May 29, 2026

Protocol: Positive Selection Analysis of the Mitochondrial nad4 Gene in Ophiocordycipitaceae Using Multi-Method Convergent Evidence

DOI

https://dx.doi.org/10.17504/protocols.io.j8nlk7bdwg5r/v1

Qiuyang Wei¹

¹Chongqing Academy of Chinese Materia Medica

wqycsic

Qiuyang Wei

Chongqing Academy of Chinese Materia Medica

DOI: https://dx.doi.org/10.17504/protocols.io.j8nlk7bdwg5r/v1

Protocol Citation: Qiuyang Wei 2026. Protocol: Positive Selection Analysis of the Mitochondrial nad4 Gene in Ophiocordycipitaceae Using Multi-Method Convergent Evidence. protocols.io https://dx.doi.org/10.17504/protocols.io.j8nlk7bdwg5r/v1

License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

Protocol status: Working

We use this protocol and it's working

Created: May 29, 2026

Last Modified: May 29, 2026

Protocol Integer ID: 318173

Keywords: Ophiocordycipitaceae, nad4, mitochondrial genome, PAML, positive selection analysis of the mitochondrial nad4 gene, positive selection signatures in the mitochondrial nad4 gene, mitochondrial nad4 gene, taxa of the ascomycete family, ophiocordycipitaceae, ancestral sequence reconstruction with site, ascomycete family, physicochemical property divergence analysis of amino acid substitution, validating positive selection signature, confidence in positive selection inference, ancestral sequence reconstruction, positive selection inference, likelihood ratio test, amino acid substitution, positive selection analysis

Funders Acknowledgements:

The Science and Technology Research Program of Chongqing Municipal Education Commission

Grant ID: KJQN202515101

The Fundamental Scientific Research Funds Project of Chongqing Municipal Science and Technology Bureau

Grant ID: CSTB2025jxjl-jbkyX0017

Abstract

This protocol describes a comprehensive multi-method analytical workflow for detecting and validating positive selection signatures in the mitochondrial nad4 gene across 16 taxa of the ascomycete family Ophiocordycipitaceae. The protocol integrates four complementary analytical approaches: (i) ancestral sequence reconstruction with site-specific dN/dS estimation using the Nei-Gojobori method, (ii) PAML codeml site models (M0 one-ratio, M7 beta, M8 beta+ω) with likelihood ratio tests and Bayes Empirical Bayes (BEB) posterior probability-based site identification, (iii) Datamonkey web server analysis using FEL, MEME, and FUBAR algorithms, and (iv) physicochemical property divergence analysis of amino acid substitutions at candidate positively selected sites. Convergent evidence across independent methods strengthens confidence in positive selection inference. All input files, output files, control files, and analysis scripts are deposited at Zenodo (DOI: 10.5281/zenodo.20438256) for full reproducibility.

Guidelines

Ophiocordycipitaceae
(Hypocreales, Sordariomycetes) is a family of entomopathogenic fungi that includes the medicinally and economically significant Chinese cordyceps fungus Ophiocordyceps sinensis. Mitochondrial protein-coding genes evolve under predominantly purifying selection; however, lineage-specific positive selection on specific genes may reflect adaptive evolution associated with ecological niche differentiation, host switching, or metabolic specialization.
The mitochondrial nad4 gene encodes subunit 4 of NADH dehydrogenase (Complex I), a core component of the oxidative phosphorylation pathway. Amino acid substitutions in functionally critical domains — particularly those involved in proton translocation and ubiquinone binding — may affect enzymatic efficiency and organismal fitness under varying environmental conditions. Detecting positive selection on nad4 requires rigorous statistical approaches that distinguish genuine adaptive signals from stochastic noise.
This protocol integrates four analytical methods, each with distinct statistical frameworks and assumptions, to provide convergent evidence for positive selection. The combination of phylogenetic codon models (PAML), maximum likelihood approaches (Datamonkey), non-model-based pairwise comparisons (Nei-Gojobori), and structural/functional validation (physicochemical divergence) yields robust and reproducible results suitable for publication in peer-reviewed journals.Ophiocordycipitaceae (Hypocreales, Sordariomycetes) is a family of entomopathogenic fungi that includes the medicinally and economically significant Chinese cordyceps fungus Ophiocordyceps sinensis. Mitochondrial protein-coding genes evolve under predominantly purifying selection; however, lineage-specific positive selection on specific genes may reflect adaptive evolution associated with ecological niche differentiation, host switching, or metabolic specialization.
The mitochondrial nad4 gene encodes subunit 4 of NADH dehydrogenase (Complex I), a core component of the oxidative phosphorylation pathway. Amino acid substitutions in functionally critical domains — particularly those involved in proton translocation and ubiquinone binding — may affect enzymatic efficiency and organismal fitness under varying environmental conditions. Detecting positive selection on nad4 requires rigorous statistical approaches that distinguish genuine adaptive signals from stochastic noise.
This protocol integrates four analytical methods, each with distinct statistical frameworks and assumptions, to provide convergent evidence for positive selection. The combination of phylogenetic codon models (PAML), maximum likelihood approaches (Datamonkey), non-model-based pairwise comparisons (Nei-Gojobori), and structural/functional validation (physicochemical divergence) yields robust and reproducible results suitable for publication in peer-reviewed journals.

Before start

1. Verify that all input files from the Zenodo dataset are correctly downloaded and accessible.
2. Ensure PAML (codeml) is installed and available in the system PATH, or note its installation directory.
3. For Windows users without native compilation capability: PAML control files can be submitted to an AI-assisted analysis platform or a Linux server. The protocol describes both native execution and alternative execution paths.
4. The mold/protozoan mitochondrial genetic code (Translation Table 4) is used for all codon-based analyses. Verify that PAML control files specify icode = 3 (mold mitochondrial code).
5. Sequence names in alignment files and tree files must match exactly. Datamonkey additionally requires sequence names ≤ 30 characters without special characters.
Software	Version	Purpose	URL
MAFFT	≥ v7.0	Multiple sequence alignment with codon-aware mode	https://mafft.cbrc.jp/alignment/software/
MEGA	X or 11	Alignment visualization, NJ tree construction	https://www.megasoftware.net/
IQ-TREE	≥ v1.6 (optional)	Maximum likelihood tree with ModelFinder	http://www.iqtree.org/
PAML (codeml)	≥ v4.9	Codon substitution site models (M0, M7, M8)	http://abacus.gene.ucl.ac.uk/software/paml.html
Python	≥ 3.8	Alignment processing, ancestral reconstruction, dN/dS calculation	https://www.python.org/
Biopython	≥ 1.79	Sequence I/O, pairwise2 alignment, codonalign module	https://biopython.org/
Datamonkey	Web server	FEL, MEME, FUBAR selection analyses	https://www.datamonkey.org/

Protocol: Positive Selection Analysis of the Mitochondrial nad4 Gene in Ophiocordycipitaceae Using Multi-Method Convergent Evidence

Step 1: Data collection and genome annotation extraction
1.1 Retrieve complete mitochondrial genome sequences from NCBI GenBank.
The dataset includes 16 species representing major clades within Ophiocordycipitaceae plus two outgroup species from Sordariomycetes (Podospora anserina, Neurospora crassa). Species selection criteria: (a) availability of complete or near-complete mitochondrial genome sequences, (b) representation of major phylogenetic clades, (c) inclusion of outgroup taxa for improved phylogenetic rooting.
1.2 Extract protein-coding gene (PCG) annotations from GenBank feature tables. Prioritize CDS features; use gene features as fallback where CDS annotations are absent.
1.3 Apply a length-filtering strategy to exclude intron-containing CDS annotations. Retain only CDS features with lengths within the expected range for each gene — for nad4, the expected range is 500–2,500 bp. Exclude annotations with internal stop codons or frameshift mutations.
1.4 Calculate nucleotide composition statistics for each genome: A%, T%, G%, C%, AT%, GC%, AT-skew = (A−T)/(A+T), GC-skew = (G−C)/(G+C). These provide baseline characterization of mitochondrial nucleotide composition, which may influence codon usage patterns and substitution models.
1.5 Identify 13 core PCGs (atp6, atp8, atp9, cob, cox1, cox2, cox3, nad1, nad2, nad3, nad4, nad5, nad6) present across species for downstream analyses.
Expected output: Annotated nad4 coding sequences for all 16 taxa; nucleotide composition summary table.

Step 2: Codon-aware multiple sequence alignment
2.1 Translate nad4 nucleotide sequences to amino acid sequences using the mold/protozoan mitochondrial genetic code (Translation Table 4). Verify that no internal stop codons are present.
2.2 Perform codon-aware multiple sequence alignment using MAFFT with the --auto algorithm selection mode, or MUSCLE v5 with default parameters. For scripted pipelines, an alternative approach is the codon-aware star alignment strategy: align at the amino acid level using the Needleman-Wunsch algorithm
(via Bio.pairwise2 in Biopython) and back-translate to codon triplets to preserve reading frame integrity.
2.3 Manually inspect the alignment in MEGA or AliView to verify: (a) codon boundaries are maintained without frameshifts, (b) no internal stop codons are present in any sequence, (c) alignment of conserved functional domains is visually coherent.
2.4 Export the alignment in three formats for different downstream tools:
• PHYLIP sequential format (.phy) — required by PAML codeml
• FASTA format (.fasta) — required by Datamonkey and gene-wide analyses
• NEXUS format (.nex) — for phylogenetic software compatibility
2.5 Prepare an additional gap-free version of the alignment by removing all columns containing gaps. This gap-free alignment is used for Datamonkey server submission and pairwise sliding-window analyses.
Final alignment dimensions: 1,614 bp; 538 codons; no gaps retained (gap-free version); 16 taxa.
Expected output: 01_nad4_alignment.phy,  2_nad4_alignment_clean.fasta, 03_nad4_alignment_nogap.fasta, 12_nad4_alignment.nex

Step 3: Phylogenetic tree reconstruction
3.1 Construct the starting phylogenetic tree using the Neighbor-Joining (NJ) method (Saitou & Nei, 1987) implemented in MEGA X:

  Parameter
    Setting
  
  Substitution model
    p-distance (or Kimura
  2-parameter)
  
  Bootstrap replicates
    100 (codon-aware)
  
  Gaps/Missing data
    Pairwise deletion
  
  Root
    Podospora anserina and Neurospora
  crassa (outgroup)
  
3.2 (Optional) For maximum likelihood validation, construct an ML tree using IQ-TREE with ModelFinder for automatic substitution model selection. This step may not be feasible in all computational environments.
3.3 Export the tree in Newick format (.nwk). Ensure that tip labels match exactly with sequence names in all alignment files.
3.4 Visually verify the tree topology against published Ophiocordycipitaceae phylogenies to check for major topological anomalies.
Expected output: 05_nad4_nj_tree.nwk

Step 4: Ancestral sequence reconstruction and
site-specific dN/dS estimation
4.1 Reconstruct ancestral sequences along the species tree using maximum parsimony (Fitch algorithm). Count synonymous and nonsynonymous substitutions along each branch.
4.2 Compute site-specific dN (nonsynonymous) and dS (synonymous) substitution rates using the Nei-Gojobori method with Jukes-Cantor correction for multiple hits. Apply the mold/protozoan mitochondrial genetic code.
4.3 Calculate the selection pressure parameter: ω = dN/dS at each codon position. Interpretation:
• ω < 1: purifying selection (conservation)
• ω ≈ 1: neutral evolution
• ω > 1: positive selection (adaptive evolution)
4.4 Identify positively selected sites (PSS) using Fisher's exact test with significance threshold P < 0.05.
4.5 Perform a complementary pairwise dN/dS analysis using the
Nei-Gojobori method on all 120 pairwise comparisons among the 16-species nad4 alignment (537 codons after removal of terminal stop codons).
4.6 Characterize regional variation with a sliding window analysis: window size = 30 codons, step size = 10 codons. Identify codon regions with locally elevated ω values.
Expected output: Site-specific dN, dS, and ω values; list of candidate positively selected codons (P < 0.05); sliding window profiles.

Step 5: PAML codeml analysis — M0 (one-ratio) model
5.1 Prepare the M0 control file (codeml_M0.ctl) with the following
parameters
5.2 Execute codeml
5.3 Record key outputs from M0_output.out: Log-likelihood (lnL_M0), tree length, global ω estimate (dN/dS across all branches), and verify model convergence (check for 'Time used' line in output).
5.4 The M0-estimated branch lengths will be fixed in subsequent site model analyses (Steps 6–7) to reduce computational burden and improve convergence stability.
Expected output: 06_PAML_M0_output.out; lnL_M0 and global ω values.

Step 6: PAML codeml analysis — M7 (beta) null model
6.1 Prepare the M7 control file (codeml_M7.ctl)
6.2 Execute codeml: codeml codeml_M7.ctl
6.3 Record key outputs: Log-likelihood (lnL_M7), parameters of the beta distribution (p, q), and verify model convergence.
Expected output: 07_PAML_M7_output.out; lnL_M7 and beta distribution parameters.

Step 7: PAML codeml analysis — M8 (beta+ω) alternative
model
7.1 Prepare the M8 control file (codeml_M8.ctl)
7.2 Execute codeml: codeml codeml_M8.ctl
7.3 Record key outputs: Log-likelihood (lnL_M8), proportion of sites in the ω > 1 class (p₁), estimated ω for the positively selected class (ω₁), and verify model convergence.
Expected output: 08_PAML_M8_output.out

Step 8: Likelihood ratio test (LRT) between M7 and M8
8.1 Compute the likelihood ratio test statistic: LRT = 2 × (lnL_M8 −lnL_M7)
8.2 Determine degrees of freedom (df): M8 has 2 additional parameters compared to M7: (a) the proportion of sites in the ω > 1 class (p₁), and (b) the ω value for that class. Therefore, df = 2.
8.3 Obtain the P-value from a chi-squared distribution with df = 2. In Python: p_value = 1 - chi2.cdf(LRT, df=2)
8.4 Interpret the result:

  P-value
    Interpretation
  
  P < 0.05
    Significant evidence for positive
  selection. Reject M7 in favor of M8.
  
  P < 0.01
    Strong evidence for positive
  selection.
  
  P ≥ 0.05
    No significant evidence. M7
  cannot be rejected.
  
Expected output: LRT statistic, df, P-value.

Step 9: Bayes Empirical Bayes (BEB) site identification
9.1 Locate the BEB analysis section in the M8 output file. Search for the line containing 'Bayes Empirical Bayes (BEB) analysis'.
9.2 Extract codons with posterior probability of belonging to the ω >1 class:

  Posterior probability
    Evidence strength
  
  > 0.95
    Strong evidence for positive
  selection
  
  0.50–0.94
    Weak/suggestive evidence
  
9.3 For each identified site, record: codon position in the alignment, amino acid at that position, posterior probability, and estimated ω.
9.4 Map the alignment codon position to amino acid positions in the nad4 protein, referencing the NADH dehydrogenase subunit 4 structure (Pfam: PF00361, PF01059).
Expected output: Table of BEB-identified positively selected sites (position, amino acid, posterior probability, functional domain).

Step 10: Datamonkey server analysis
10.1 Navigate to the Datamonkey web server at https://www.datamonkey.org/
10.2 Upload the gap-free FASTA alignment  03_nad4_alignment_nogap.fasta)
and the NJ tree (05_nad4_nj_tree.nwk). Note: Datamonkey requires unique sequence names ≤ 30 characters without special characters.
10.3 Under the 'Selection' analysis menu, run three complementary methods:
FEL (Fixed Effects Likelihood): Estimates dN and dS at each site
independently using maximum likelihood. Parameters: default settings; significance threshold: P < 0.1.
MEME (Mixed Effects Model of Evolution): Allows ω to vary across branches at each site, detecting episodic/diversifying selection. Parameters: default settings; significance threshold: P < 0.1.
FUBAR (Fast Unconstrained Bayesian Approximation): Uses a Bayesian approach with Markov chain Monte Carlo (MCMC) to estimate posterior probabilities of dN > dS at each site. Parameters: default settings (5 grid points, 5 MCMC chains); significance threshold: posterior probability ≥ 0.9.
10.4 Download results (CSV format) for each method.
Expected output: 11_datamonkey_results.txt; lists of significant sites from FEL, MEME, and FUBAR.

Step 11: Cross-validation across methods
11.1 Compile all positively selected sites identified by each method into a comparative summary table.
11.2 Classify confidence levels:
• High confidence: Sites identified by ≥ 3 independent methods
• Moderate confidence: Sites identified by 2 independent methods
• Low confidence: Sites identified by a single method only
Expected output: Cross-validation table; final list of high-confidence positively selected sites.

Step 12: Physicochemical property divergence analysis
12.1 For each candidate positively selected site, extract the amino acid in the reference sequence and the amino acid in the positively selected lineage.
12.2 Assess changes in the following physicochemical properties:

  Property
    Scale/Method
    Interpretation
  
  Hydrophobicity
    Kyte-Doolittle hydropathy index
    Change in membrane environment
  preference
  
  Charge
    pKa of side chain at
  physiological pH
    Change from acidic (−) to basic
  (+) or vice versa
  
  Molecular volume
    Amino acid side chain volume (Å³)
    Steric effects on protein
  structure
  
  Polarity
    Polar vs. nonpolar classification
    Change in solvent interaction
  
12.3 Classify each substitution as radical (≥ 2 properties change) or conservative (≤ 1 property change).
Expected output: Physicochemical change summary table for each candidate site.

Step 13: Data deposition and reproducibility
13.1 Assemble all input, output, and intermediate files: alignment
files, tree file, PAML control files, PAML output files, Datamonkey results, Python analysis scripts, codon usage and nucleotide composition data.

Protocol references

Yang, Z. (2007). PAML 4: Phylogenetic Analysis by Maximum Likelihood. Molecular Biology and Evolution, 24(8), 1586–1591.
Yang, Z., Wong, W.S.W., & Nielsen, R. (2005). Bayes empirical Bayes inference of amino acid sites under positive selection. Molecular Biology and Evolution, 22(4), 1107–1118.
Nei, M., & Gojobori, T. (1986). Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Molecular Biology and Evolution, 3(5), 418–426.
Pond, S.L.K., & Frost, S.D.W. (2005). Datamonkey: rapid detection of selective pressure on individual sites of codon alignments. Bioinformatics, 21(10), 2531–2533.
Murrell, B., et al. (2012). FUBAR: a fast, unconstrained Bayesian approximation for inferring selection. Molecular Biology and Evolution, 29(10), 3097–3108.
Murrell, B., et al. (2012). Detecting individual sites subject to episodic diversifying selection. PLoS Genetics, 8(7), e1002764.
Saitou, N., & Nei, M. (1987). The neighbor-joining method: a new method for reconstructing phylogenetic trees. Molecular Biology and Evolution, 4(4), 406–425.
Cock, P.J.A., et al. (2009). Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics, 25(11), 1422–1423.
Katoh, K., & Standley, D.M. (2013). MAFFT multiple sequence alignment software version 7. Molecular Biology and Evolution, 30(4), 772–780.
Sharp, P.M., & Li, W.H. (1987). The codon Adaptation Index — a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Research, 15(3), 1281–1295.

Acknowledgements

Associated Resources: Zenodo dataset: 10.5281/zenodo.20438256 (https://doi.org/10.5281/zenodo.20438256)


	Parameter	Setting
	Substitution model	p-distance (or Kimura 2-parameter)
	Bootstrap replicates	100 (codon-aware)
	Gaps/Missing data	Pairwise deletion
	Root	Podospora anserina and Neurospora crassa (outgroup)


	P-value	Interpretation
	P < 0.05	Significant evidence for positive selection. Reject M7 in favor of M8.
	P < 0.01	Strong evidence for positive selection.
	P ≥ 0.05	No significant evidence. M7 cannot be rejected.


	Posterior probability	Evidence strength
	> 0.95	Strong evidence for positive selection
	0.50–0.94	Weak/suggestive evidence


Property	Scale/Method	Interpretation
Hydrophobicity	Kyte-Doolittle hydropathy index	Change in membrane environment preference
Charge	pKa of side chain at physiological pH	Change from acidic (−) to basic (+) or vice versa
Molecular volume	Amino acid side chain volume (Å³)	Steric effects on protein structure
Polarity	Polar vs. nonpolar classification	Change in solvent interaction

Protocol: Positive Selection Analysis of the Mitochondrial nad4 Gene in Ophiocordycipitaceae Using Multi-Method Convergent Evidence

Step 3: Phylogenetic tree reconstruction

3.1 Construct the starting phylogenetic tree using the Neighbor-Joining (NJ) method (Saitou & Nei, 1987) implemented in MEGA X:

Step 8: Likelihood ratio test (LRT) between M7 and M8

Step 9: Bayes Empirical Bayes (BEB) site identification

9.1 Locate the BEB analysis section in the M8 output file. Search for the line containing 'Bayes Empirical Bayes (BEB) analysis'. 9.2 Extract codons with posterior probability of belonging to the ω >1 class:

Step 12: Physicochemical property divergence analysis

12.1 For each candidate positively selected site, extract the amino acid in the reference sequence and the amino acid in the positively selected lineage. 12.2 Assess changes in the following physicochemical properties: