May 27, 2026

5-Fold Spatial Block Cross-Validation Enhanced MaxEnt Species Distribution Modeling with MESS Extrapolation Risk Assessment for Ophiocordyceps sinensis on the Tibetan Plateau

  • Qiuyang Wei1
  • 1Chongqing Academy of Chinese Materia Medica
  • wqycsic
Icon indicating open access to content
QR code linking to this content
Protocol CitationQiuyang Wei 2026. 5-Fold Spatial Block Cross-Validation Enhanced MaxEnt Species Distribution Modeling with MESS Extrapolation Risk Assessment for Ophiocordyceps sinensis on the Tibetan Plateau. protocols.io https://dx.doi.org/10.17504/protocols.io.3byl4me2rlo5/v1
License: This is an open access  protocol  distributed under the terms of the  Creative Commons Attribution License,  which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Protocol status: Working
We use this protocol and it's working
Created: May 27, 2026
Last Modified: May 27, 2026
Protocol  Integer ID: 317989
Keywords: 5-Fold Spatial Block Cross-Validation, MaxEnt, Distribution Modeling , MESS Extrapolation Risk Assessment, validation enhanced maxent species distribution modeling, future climatic suitability of ophiocordyceps sinensi, ophiocordyceps sinensis on the tibetan plateau, species occurrence data, filtered species occurrence data, final habitat suitability map, mess extrapolation risk assessment, ophiocordyceps sinensi, future climatic suitability, tibetan plateau, multivariate environmental similarity surface, analysis for extrapolation risk quantification, threshold sensitivity analysis, extrapolation risk quantification, reproducible from raw occurrence data, selected climate change scenario
Funders Acknowledgements:
The Science and Technology Research Program of Chongqing Municipal Education Commission
Grant ID: KJQN202515101
The Fundamental Scientific Research Funds Project of Chongqing Municipal Science and Technology Bureau
Grant ID: CSTB2025jxjl-jbkyX0017
Disclaimer
Ophiocordyceps sinensis is a critically important and economically valuable entomopathogenic fungus endemic to high-altitude meadows (3,000–5,000 m) of the Tibetan Plateau. It is IUCN-assessed as Vulnerable (VU). Understanding how climate change will alter its climatically suitable habitat is essential for evidence-based conservation planning. This protocol employs MaxEnt species distribution modeling (SDM) enhanced by rigorous spatial validation to minimize overfitting due to spatial autocorrelation, a common source of inflated model performance in traditional random-split validation approaches.
Abstract
This protocol describes a reproducible workflow for modeling the potential distribution and future climatic suitability of Ophiocordyceps sinensis (Berk.) G.H. Sung et al. across the Tibetan Plateau, China, under a selected climate change scenario (BCC-CSM2-MR, SSP3-7.0). The protocol integrates: (1) spatially filtered species occurrence data, (2) 5-fold spatial block cross-validation (SBCV) implemented in ENMTools v.1.1.2, (3) MaxEnt v.3.4.1 modeling with optimized hyperparameters, (4) Multivariate Environmental Similarity Surface (MESS) analysis for extrapolation risk quantification, and (5) threshold sensitivity analysis with three alternative thresholding methods. All steps are designed to be fully reproducible from raw occurrence data through to final habitat suitability maps.
Safety warnings
Important Limitations and Caveats
1. Single GCM and scenario: All future projections are strictly conditional on BCC-CSM2-MR and SSP3-7.0. Results may differ substantially under other GCMs or emission scenarios.
2. Modelled climatic suitability only: The model predicts climatically suitable habitat based on climatic, topographic, and edaphic predictors. Actual O. sinensis population persistence additionally depends on: (a) harvesting pressure and overexploitation; (b) host population dynamics (Thitarodes spp.); (c) land use change; (d) governance and protected area coverage; (e) microbial community interactions.
3. MESS interpretation: MESS values reported are for occurrence localities only, not for all projected grid cells. Projected cells with low/negative MESS values represent environmental novelty and require cautious interpretation.
4. Soil and terrain stasis: Soil and terrain variables are held constant across time periods, which may introduce uncertainty if land degradation occurs.
5. Sample size and range: While 201 points exceeds the ~50-point MaxEnt performance stabilization threshold, the species' distribution is still undersampled in some border regions (Nepal, Bhutan, India).
Before start
Materials and Equipment
Software:
• ArcGIS 10.8 (ESRI, Redlands, CA) — for spatial data processing, visualization, and map production
• SDM Toolbox v2.5 (Brown 2014; Brown et al. 2017) — for centroid shift analysis and areal change quantification
• Python 3.9 with: NumPy ≥1.23, SciPy ≥1.9, Scikit-learn 1.2, Matplotlib 3.7, Pandas ≥1.5 — for MESS analysis
• Java Runtime Environment (JRE) ≥8 — required for MaxEnt
Input Data:
• Species occurrence records: 207 raw records → 201 spatially filtered records (10 km thinning) (Supplementary Table S1 in the associated Zenodo repository)
• WorldClim v2.1 bioclimatic variables (19 variables, 30-arcsecond resolution, ~1 km²) for 1970–2000 historical baseline: https://www.worldclim.org/data/bioclim.html
• BCC-CSM2-MR CMIP6 climate projections under SSP3-7.0 for 2021–2040 and 2041–2060 (30-arcsecond resolution): https://www.worldclim.org/data/cmip6/cmip6_clim30s.html
• HWSD v2.0 soil variables (16 pedological variables, 0–30 cm depth): https://data.tpdc.ac.cn/zh-hans/
• Digital Elevation Model (DEM) at 30-arcsecond resolution — for deriving slope and aspect
• China boundary mask (GS (2022)1873) for spatial harmonization
Step 1: Occurrence Data Collection and Spatial Filtering
1.1 Data sources: Compile occurrence records from: • GBIF (https://www.gbif.org) — search "Ophiocordyceps sinensis" • Discover Life (https://www.discoverlife.org) • ESPECIES platform (http://www.especies.cn)   • Published ecological survey literature (manual extraction)
1.2 Data cleaning: • Remove duplicate records (identical lat/lon) • Remove records outside China boundary (apply China mask GS (2022)1873) • Remove records with coordinate uncertainty >10 km • Visual inspection for obvious geocoding errors (e.g., points in ocean)
1.3 Spatial thinning (ENMTools v1.1.2):  • Apply 10 km minimum nearest-neighbor distance filter • Command: Use ENMTools thin.algorithm = "random" with dist = 10 km • Expected output: 201 spatially independent occurrence points • Save as: occurrences_filtered_201pts.csv (columns: species, longitude, latitude)
CAUTION: Spatial thinning reduces sampling bias but may remove genuine range-edge records. Document the number of records removed at each step.
 
Environmental Variable Preparation
2.1 Download and prepare environmental layers: • Download WorldClim v2.1 bioclimatic variables at 30-arcsecond resolution (bio_1 to bio_19) • Download HWSD v2.0 soil variables (select 16 pedological variables for 0–30 cm topsoil layer) • Derive slope and aspect from DEM using ArcGIS Spatial Analyst: Surface > Slope / Aspect • Download BCC-CSM2-MR SSP3-7.0 projections for 2021–2040 and 2041–2060
2.2 Spatial harmonization (ArcGIS 10.8): • Reproject all layers to WGS84 geographic coordinate system • Resample all layers to 30-arcsecond (~1 km²) resolution using bilinear interpolation • Clip all layers to China boundary mask (GS (2022)1873) • Verify spatial alignment: all layers must have identical extent, resolution, and CRS
2.3 Initial variable screening (MaxEnt v3.4.1, first run): • Run MaxEnt with all 38 variables, 10,000 background points, 500 iterations, logistic output • Exclude variables with percent contribution <0.1% • Note remaining ecologically relevant variables from published O. sinensis literature
2.4 Multicollinearity filtering (ENMTools v1.1.2): • Compute Pearson correlation matrix for all candidate variables across occurrence locations • Remove one variable from each pair with |r| ≥ 0.8, retaining the variable with higher ecological relevance and contribution • Final optimized predictor set: 16 variables (alt, bio_1, bio_3, bio_4, bio_5, bio_17, bio_18, aspect, slope, t_caco3, t_esp, t_gravel, t_oc, t_ph_h2o, t_silt, t_usda_tex)
5-Fold Spatial Block Cross-Validation Setup
3.1 Spatial block generation (ENMTools v1.1.2):
• Divide study area into non-overlapping 1°×1° geographic grid blocks covering the full O. sinensis distribution range
• Total: 24 geographic blocks
• Block size rationale: matches the spatial autocorrelation scale of O. sinensis on the Tibetan Plateau
3.2 Fold assignment:
• For each of 5 folds: randomly assign 1 geographic block as the independent test set
• All remaining blocks form the training set
• Ensure no geographic overlap between training and test sets in each fold
• Fold assignments must be non-overlapping across the 5 folds
3.3 Background point selection:
• Generate 10,000 random background points from the full study area (China boundary mask)
• Stratify background points spatially to ensure coverage of full environmental space
• Exclude background points within 10 km of known occurrence localities
MaxEnt Model Tuning and Running
4.1 Hyperparameter tuning:
• Test regularization multipliers (rm): 0.5, 1.0, 1.5, 2.0, 2.5, 3.0
• Test feature class combinations: L (linear), LQ (linear+quadratic), LQH (linear+quadratic+hinge), LQHP, LQHPT
• For each rm × feature class combination, run MaxEnt with 5-fold SBCV
• Select the combination maximizing mean test AUC across 5 folds
• Final optimal settings: feature classes = LQH; regularization multiplier = 1.0
4.2 MaxEnt v3.4.1 settings for final model:
• Maximum iterations: 500
• Background points: 10,000 (stratified random)
• Output format: logistic
• Replicates: 11 bootstrap replicates per SBCV fold (= 55 MaxEnt runs total)
• Random seed: use fixed seed for reproducibility (e.g., seed = 42)
• Jackknife: enabled (for variable importance analysis)
• Do NOT use random test percentage (SBCV handles this externally)
4.3 Performance evaluation:
• Primary metric: mean test AUC across all SBCV folds and bootstrap replicates
• Secondary metrics: omission rate at FCV1 threshold, True Skill Statistic (TSS)
• TSS = Sensitivity + Specificity – 1; values >0.5 indicate reliable performance
• Expected results: mean training AUC = 0.963 ± 0.001; mean test AUC = 0.953 ± 0.002; TSS ≈ 0.85
4.4 Final model:
• Train on all 201 occurrence points with optimal hyperparameters
• Run 11 bootstrap replicates; average outputs for final suitability map
MESS Extrapolation Risk Analysis
5.1 Data preparation: • Extract environmental values at all 201 occurrence points for all 16 optimized predictors • Partition into training (n=180, 90%) and test (n=21, 10%) sets • Standardize all variables: Z-score based on training set statistics (mean, SD) • Apply same transformation to test set using training set parameters (prevent data leakage)
5.2 MESS computation (Python 3.9): • For each variable i and each occurrence point j: - Calculate Si,j based on the min-max range of the training dataset - If p(i,j) < min(training): Si,j = ((p(i,j) - min(training)) / range(training)) × 100 - If p(i,j) > max(training): Si,j = ((max(training) - p(i,j)) / range(training)) × 100 - If min ≤ p(i,j) ≤ 50th percentile: Si,j = (p(i,j) - min) / (50th -min) × 100 - If 50th percentile < p(i,j) ≤ max: Si,j = (max - p(i,j)) / (max -50th) × 100 • Weight each variable by its percent contribution to MaxEnt model • Total MESS score = Σ(wi × Si,j) for all 16 predictors • MESS range: −∞ (extreme novelty) to 1.0 (complete similarity)
5.3 Risk stratification: • Low Risk: Si ≥ 0.8 (highly matched environment) • Moderate Risk: 0.5 ≤ Si < 0.8 (moderately matched) • High Risk: 0.2 ≤ Si < 0.5 (poorly matched) • Very High Risk: 0 ≤ Si < 0.2 (minimally matched) • Extreme Risk: Si < 0 (environment entirely out of training range)
5.4 Interpretation note: • IMPORTANT: MESS values reported here characterize the 201 occurrence localities only. • They do not characterize MESS distribution across all grid cells in the projected landscape.
• Projected grid cells with low MESS scores should be flagged and interpreted with caution. • Expected results: mean MESS = 0.734 ± 0.135; no extreme risk samples (Si < 0) in occurrence dataset
Habitat Suitability Classification and Temporal Projections
6.1 Suitability thresholding (ArcGIS 10.8): • Primary threshold classification: - Very Unsuitable: <0.05 - Unsuitable: 0.05–0.30 - Low Suitable: 0.30–0.50 - Moderately Suitable: 0.50–0.70 - Highly Suitable: >0.70 • Lower bound justification: 0.30 = minimum occurrence probability from field validation
6.2 Alternative threshold validation: • MaxSS threshold: threshold maximizing sum of training sensitivity + specificity • 10% OMT threshold: threshold at which 10% of training localities are omitted • ±0.05 interval shifts: 0.25 and 0.35 as lower bounds • Compare area estimates and spatial patterns across all thresholds
6.3 Temporal projections: • Apply final MaxEnt model to: (a) Historical baseline: 1970–2000 WorldClim layers (b) Near-term: 2021–2040 BCC-CSM2-MR SSP3-7.0 layers (c) Mid-century: 2041–2060 BCC-CSM2-MR SSP3-7.0 layers • NOTE: All future projections are conditional on the BCC-CSM2-MR model and SSP3-7.0 scenario. • Soil and terrain variables held constant across time periods.
6.4 Habitat change quantification (SDM Toolbox v2.5): • Calculate net area change between periods (denominator = 1970–2000 baseline suitable area = 396,000 km²) • Quantify range expansion and contraction zones • Calculate distribution centroid shifts (km and direction) • All areal percentage changes must clearly state denominator
Statistical Analysis and Reporting
7.1 AUC and TSS: Report mean ± SD across all folds and replicates
7.2 Omission rates: Report at FCV1 threshold (mean ± SD)
7.3 Variable importance: Report percent contribution and permutation importance from MaxEnt
7.4 Spatial block statistics: Report mean prediction probability and training-test differences per block
7.5 MESS statistics: Report mean ± SD, range, and risk category distribution for occurrence points
7.6 Habitat area estimates: Express as both km² and % of China land area; always specify denominator for percentage changes
7.7 Threshold sensitivity: Report range of area estimates across all tested thresholds
Protocol references
1. Phillips SJ, Dudík M (2008) Modeling of species distributions with Maxent: new extensions and a comprehensive evaluation. Ecography 31:161–175.
2. Merow C, Smith MJ, Silander JA (2013) A practical guide to MaxEnt for modeling species' distributions. Ecography 36:1058–1069.
3. Brown JL (2014) SDMtoolbox: a Python-based GIS toolkit for landscape genetic, biogeographic and species distribution model analyses. Methods Ecol Evol 5:694–700.
4. Brown JL, Bennett JR, French CM (2017) SDMtoolbox 2.0. PeerJ 5:e4095.
5. Warren DL, Matzke NJ, Cardillo M et al. (2021) ENMTools 1.0: an R package for comparative studies of species' niches and distributions. Ecography 44:504–511.
6. Elith J, Kearney M, Phillips SJ (2010) The art of modelling range-shifting species. Methods Ecol Evol 1:330–342.
7. Norberg A et al. (2019) A comprehensive evaluation of predictive performance of 33 species distribution models. Ecol Monogr 89:e01370.