Jul 03, 2026

Full in silico workflow for network toxicology: compound target mining, multi-cohort GEO transcriptome analysis, WGCNA, machine learning diagnostic modeling, SHAP interpretation, molecular docking and molecular dynamics simulation

  • Shenghao Li1,
  • Qing Peng2,3,
  • Liyuan Hao2,3,4,
  • Jingyu Mao1,
  • bingjie huo1
  • 1The Fourth Affiliated Hospital of Hebei Medical University;
  • 2Chengdu University of Traditional Chinese Medicine;
  • 3Hospital of Chengdu University of Traditional Chinese Medicine;
  • 4Hebei University of Chinese Medicine
Icon indicating open access to content
QR code linking to this content
Protocol CitationShenghao Li, Qing Peng, Liyuan Hao, Jingyu Mao, bingjie huo 2026. Full in silico workflow for network toxicology: compound target mining, multi-cohort GEO transcriptome analysis, WGCNA, machine learning diagnostic modeling, SHAP interpretation, molecular docking and molecular dynamics simulation. protocols.io https://dx.doi.org/10.17504/protocols.io.81wgbmdmyvpk/v1
License: This is an open access  protocol  distributed under the terms of the  Creative Commons Attribution License,  which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Protocol status: Working
This protocol contains bioinformatic codes, S1-S19 raw tables and molecular simulation files for manuscript under peer review at PLOS Computational Biology.
Created: June 04, 2026
Last Modified: July 03, 2026
Protocol  Integer ID: 318509
Funders Acknowledgements:
Hebei Natural Science Foundation
Grant ID: H2022206592
Hebei Natural Science Foundation
Grant ID: H2023206920
the Central Guidance for Local Science and Technology Development Fund
Grant ID: 236Z7726G
The Fourth Hospital of Hebei Medical University Research Innovation Team Support Program
Grant ID: 2023B10
the Science and Technology Program of Hebei
Grant ID: 252W7717D
Abstract
This protocol describes a workflow exploring DEHP-induced gastric cancer mechanism submitted to PLOS Computational Biology, covering DEHP target prediction from multi-public databases, multi-GEO cohort transcriptome integration, differential gene screening, WGCNA co-expression network mining, core target PPI and functional enrichment, multi-algorithm machine learning biomarker construction, SHAP interpretative analysis, as well as follow-up molecular docking and 200 ns molecular dynamics verification. All original analytical R codes and raw result datasets S1–S19 are attached as supplementary compressed files. This protocol is kept private during peer review and will switch to public access after formal manuscript acceptance following PLOS open science policy.
Attachments
Guidelines
The bioinformatics analysis was carried out strictly in accordance with the standard R programming rules and the published network toxicology workflow specifications; all parameter settings were fully recorded in the accompanying source code and supplementary tables to ensure reproducibility.
Safety warnings
Pay attention to version consistency of R packages to avoid code runtime error; improper grid box parameter setting will lead to invalid molecular docking results; inappropriate solvation and equilibrium parameters may cause unstable MD simulation trajectory.
Before start
Before running the full pipeline, ensure R ≥4.5.0 and corresponding bioinformatics packages (limma, WGCNA, clusterProfiler, xgboost, shap) are pre-installed. All raw transcriptome datasets are downloaded from public GEO database with verified sample grouping information.
Section 1: Compound potential target identification (Table results S1)
The structural data of DEHP were obtained from the PubChem database. Potential target proteins associated with DEHP were collected from four publicly available databases: ChEMBL, PharmMapper and Swiss Target Prediction databases. All candidate targets from three databases were merged, duplicate genes were removed, and the final consolidated compound-target gene list was exported as supplementary table S1 Compound gene.
Section 2: GEO dataset collection and transcriptome preprocessing (Table results S2, S10,S11)
Six gastric cancer microarray datasets (GSE66229, GSE65801, GSE54129, GSE51575, GSE19826, GSE13911) were downloaded from the NCBI GEO database. GSE66229, GSE65801 and GSE54129 were defined as the training cohort, while GSE51575, GSE19826 and GSE13911 were used as the independent validation cohort. All probe-level expression data were converted to gene-level expression values based on corresponding GPL platform annotation files. Multiple probes matching the same gene were averaged, and unmapped probes were discarded. Raw expression values were log2-transformed, and cross-sample quantile normalization was performed using the limma package. Systematic batch effects among different datasets and platforms were corrected using SVA and ComBat algorithms. After integration of common genes across all cohorts, the final unified normalized expression matrix was generated and saved as S2 Normalized_Expression_Matrix. Detailed clinical and grouping annotation information for the training cohort and validation cohort were organized and summarized intoS10 experimental cohorts and S11 validation cohorts, respectively.
Section 3: Differential expression analysis (Table results S3, S14)
Based on the batch-corrected unified expression matrix, differentially expressed genes (DEGs) between gastric cancer and normal tissues were identified using the limma R package. The DEG screening thresholds were set as adjusted P-value < 0.05 and |log2FC| > 0.585 (1.5-fold change). The false discovery rate (FDR) method was applied for multiple testing correction. All statistically significant DEGs across the full integrated dataset were exported as S3 Differentially_Expressed_Genes_DEGs_Results. The subset of DEGs participating in subsequent machine learning modeling was further extracted and saved as S14 DE_significant_genes.
Section 4: WGCNA co-expression network analysis and key gene screening (Table results S4, S5)
Weighted gene co-expression network analysis (WGCNA) was performed to identify gastric cancer-related core gene modules. The optimal soft‑thresholding power was determined to construct a scale‑free network. Low-variation genes and outlier samples were removed before network construction. The adjacency matrix was transformed into a topological overlap matrix (TOM), and genes were clustered and partitioned into distinct modules using dynamic tree cutting. Module-trait correlation analysis was performed to screen disease-related key modules. Genes satisfying both high module membership (|MM| ≥ 0.8) and high gene significance (|GS| ≥ 0.2) were defined as hub module genes, which were exported as S4 WGCNA_Green_Module_Gene_List. Intersection analysis was performed between WGCNA hub module genes and disease-related gene sets, and overlapping key genes were retained and saved as S5 WGCNA_Disease_Gene_Set_Intersection.
Section 5: Core target screening, PPI network construction and GO/KEGG enrichment (Table results S6/S7/S8/S9)
The core DEHP-related gastric cancer targets were obtained by intersecting compound target genes (S1) and disease key genes (S5), and the final core target set was saved asS6 Intersection_Compound_Disease_Genes. Core overlapping genes were imported into the STRING database (species: Homo sapiens) to construct a protein-protein interaction network with a minimum interaction score of 0.15, and all interaction relationships were exported as S7 PPI. For functional enrichment analysis, Gene Ontology (GO) enrichment was performed using the DAVID database, and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment was implemented via the clusterProfiler package. Significant GO terms and KEGG pathways were filtered and summarized into S9 GO and S8 KEGG, respectively.
Section 6: Machine learning model construction and feature evaluation (Table results S12/S13/S15/S16)
Multiple machine learning algorithms including glmBoost, Elastic Net, Random Forest and XGBoost were applied to establish diagnostic models based on the core candidate genes. All models were trained and optimized using k‑fold cross‑validation to avoid overfitting. The optimal model with the highest AUC performance in the validation cohort was selected. The final modeling feature genes were summarized in S12 model.genes. Model regression coefficients and corresponding confidence intervals were recorded in S13 Model_Coefficients_With_CI. Feature permutation importance analysis was performed to evaluate the predictive contribution of each gene, and the results were saved asS15 feature_permutation_importance. ROC analysis was conducted in both training and validation cohorts, and all AUC values and model performance results were summarized into S16 model_auc_summary.
Section 7: SHAP model interpretability analysis (Table results S17/S18/S19)
To interpret the predictive mechanism of the optimal machine learning model, SHAP analysis was performed to quantify the contribution of each core gene. Individual feature importance, grouped feature importance, and top-ranked key biomarker genes were calculated systematically. The quantitative SHAP results were exported asS17 shap_feature_importance, S18 shap_groupwise_importance, and S19 shap_top_genes, respectively.
Section 8: Molecular docking and molecular dynamics simulation (Molecular docking and molecular dynamics simulation)
For molecular docking verification, target protein crystal structures were downloaded from the PDB database and pretreated to remove water molecules, heteroatoms and redundant ligands, followed by hydrogen addition and charge assignment. Small-molecule ligand structures were obtained from PubChem and optimized for energy minimization. AutoDock Vina was used to perform molecular docking with a unified grid-box setting and standard docking parameters. The optimal binding conformation with the lowest binding energy was selected for subsequent analysis. All docking parameter files and binding energy results are provided in the attached simulation folder. For molecular dynamics simulation, the optimal docking complex was used as the initial structure for 200 ns all-atom MD simulation via GROMACS. The CHARMM36 force field and TIP3P water model were applied. The system was solvated, neutralized, energy-minimized, and equilibrated through NVT and NPT ensemble simulations. Structural stability indicators including RMSD, RMSF, Rg, SASA and hydrogen bond occupancy were calculated. Principal component analysis and free energy surface analysis were performed to evaluate global protein conformational changes. MM-PBSA binding free energy calculation was conducted to quantify the binding affinity. All simulation parameter files, topology files, MD configuration files and analytical result data are provided in the uploaded simulation attachment folder.