Full in silico workflow for network toxicology: compound target mining, multi-cohort GEO transcriptome analysis, WGCNA, machine learning diagnostic modeling, SHAP interpretation, molecular docking and  molecular dynamics simulation

Shenghao Li; Qing Peng; Liyuan Hao; Jingyu Mao; bingjie huo

Jul 03, 2026

Full in silico workflow for network toxicology: compound target mining, multi-cohort GEO transcriptome analysis, WGCNA, machine learning diagnostic modeling, SHAP interpretation, molecular docking and molecular dynamics simulation

DOI

https://dx.doi.org/10.17504/protocols.io.81wgbmdmyvpk/v1

Shenghao Li¹,
Qing Peng^2,3,
Liyuan Hao^2,3,4,
Jingyu Mao¹,
bingjie huo¹

¹The Fourth Affiliated Hospital of Hebei Medical University;
²Chengdu University of Traditional Chinese Medicine;
³Hospital of Chengdu University of Traditional Chinese Medicine;
⁴Hebei University of Chinese Medicine

bingjie huo

The Fourth Affiliated Hospital of Hebei Medical University

DOI: https://dx.doi.org/10.17504/protocols.io.81wgbmdmyvpk/v1

Protocol Citation: Shenghao Li, Qing Peng, Liyuan Hao, Jingyu Mao, bingjie huo 2026. Full in silico workflow for network toxicology: compound target mining, multi-cohort GEO transcriptome analysis, WGCNA, machine learning diagnostic modeling, SHAP interpretation, molecular docking and molecular dynamics simulation. protocols.io https://dx.doi.org/10.17504/protocols.io.81wgbmdmyvpk/v1

License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

Protocol status: Working

This protocol contains bioinformatic codes, S1-S19 raw tables and molecular simulation files for manuscript under peer review at PLOS Computational Biology.

Created: June 04, 2026

Last Modified: July 03, 2026

Protocol Integer ID: 318509

Funders Acknowledgements:

Hebei Natural Science Foundation

Grant ID: H2022206592

Hebei Natural Science Foundation

Grant ID: H2023206920

the Central Guidance for Local Science and Technology Development Fund

Grant ID: 236Z7726G

The Fourth Hospital of Hebei Medical University Research Innovation Team Support Program

Grant ID: 2023B10

the Science and Technology Program of Hebei

Grant ID: 252W7717D

Abstract

This protocol describes a  workflow exploring DEHP-induced gastric cancer mechanism submitted to PLOS Computational Biology, covering DEHP target prediction from multi-public databases, multi-GEO cohort transcriptome integration, differential gene screening, WGCNA co-expression network mining, core target PPI and functional enrichment, multi-algorithm machine learning biomarker construction, SHAP interpretative analysis, as well as follow-up molecular docking and 200 ns molecular dynamics verification. All original analytical R codes and raw result datasets S1–S19 are attached as supplementary compressed files. This protocol is kept private during peer review and will switch to public access after formal manuscript acceptance following PLOS open science policy.

Attachments

Full_Source_Data.zip

90MB

Guidelines

The bioinformatics analysis was carried out strictly in accordance with the standard R programming rules and the published network toxicology workflow specifications; all parameter settings were fully recorded in the accompanying source code and supplementary tables to ensure reproducibility.

Safety warnings

Pay attention to version consistency of R packages to avoid code runtime error; improper grid box parameter setting will lead to invalid molecular docking results; inappropriate solvation and equilibrium parameters may cause unstable MD simulation trajectory.

Before start

Before running the full pipeline, ensure R ≥4.5.0 and corresponding bioinformatics packages (limma, WGCNA, clusterProfiler, xgboost, shap) are pre-installed. All raw transcriptome datasets are downloaded from public GEO database with verified sample grouping information.

Section 1: Compound potential target identification (Table results S1)

The structural data of DEHP were obtained from the PubChem database. Potential target proteins associated with DEHP were collected from four publicly available databases: ChEMBL, PharmMapper and Swiss Target Prediction databases. All candidate targets from three databases were merged, duplicate genes were removed, and the final consolidated compound-target gene list was exported as supplementary table S1 Compound gene.

Section 2: GEO dataset collection and transcriptome preprocessing (Table results S2, S10,S11)

Six gastric cancer microarray datasets (GSE66229, GSE65801, GSE54129, GSE51575, GSE19826, GSE13911) were downloaded from the NCBI GEO database. GSE66229, GSE65801 and GSE54129 were defined as the training cohort, while GSE51575, GSE19826 and GSE13911 were used as the independent validation cohort. All probe-level expression data were converted to gene-level expression values based on corresponding GPL platform annotation files. Multiple probes matching the same gene were averaged, and unmapped probes were discarded. Raw expression values were log2-transformed, and cross-sample quantile normalization was performed using the limma package. Systematic batch effects among different datasets and platforms were corrected using SVA and ComBat algorithms. After integration of common genes across all cohorts, the final unified normalized expression matrix was generated and saved as S2 Normalized_Expression_Matrix. Detailed clinical and grouping annotation information for the training cohort and validation cohort were organized and summarized intoS10 experimental cohorts and S11 validation cohorts, respectively.

Section 3: Differential expression analysis (Table results S3, S14)

Based on the batch-corrected unified expression matrix, differentially expressed genes (DEGs) between gastric cancer and normal tissues were identified using the limma R package. The DEG screening thresholds were set as adjusted P-value < 0.05 and |log2FC| > 0.585 (1.5-fold change). The false discovery rate (FDR) method was applied for multiple testing correction. All statistically significant DEGs across the full integrated dataset were exported as S3 Differentially_Expressed_Genes_DEGs_Results. The subset of DEGs participating in subsequent machine learning modeling was further extracted and saved as S14 DE_significant_genes.

Section 4: WGCNA co-expression network analysis and key gene screening (Table results S4, S5)

Weighted gene co-expression network analysis (WGCNA) was performed to identify gastric cancer-related core gene modules. The optimal soft‑thresholding power was determined to construct a scale‑free network. Low-variation genes and outlier samples were removed before network construction. The adjacency matrix was transformed into a topological overlap matrix (TOM), and genes were clustered and partitioned into distinct modules using dynamic tree cutting. Module-trait correlation analysis was performed to screen disease-related key modules. Genes satisfying both high module membership (|MM| ≥ 0.8) and high gene significance (|GS| ≥ 0.2) were defined as hub module genes, which were exported as S4 WGCNA_Green_Module_Gene_List. Intersection analysis was performed between WGCNA hub module genes and disease-related gene sets, and overlapping key genes were retained and saved as S5 WGCNA_Disease_Gene_Set_Intersection.

Section 5: Core target screening, PPI network construction and GO/KEGG enrichment (Table results S6/S7/S8/S9)

The core DEHP-related gastric cancer targets were obtained by intersecting compound target genes (S1) and disease key genes (S5), and the final core target set was saved asS6 Intersection_Compound_Disease_Genes. Core overlapping genes were imported into the STRING database (species: Homo sapiens) to construct a protein-protein interaction network with a minimum interaction score of 0.15, and all interaction relationships were exported as S7 PPI. For functional enrichment analysis, Gene Ontology (GO) enrichment was performed using the DAVID database, and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment was implemented via the clusterProfiler package. Significant GO terms and KEGG pathways were filtered and summarized into S9 GO and S8 KEGG, respectively.

Section 6: Machine learning model construction and feature evaluation (Table results S12/S13/S15/S16)

Multiple machine learning algorithms including glmBoost, Elastic Net, Random Forest and XGBoost were applied to establish diagnostic models based on the core candidate genes. All models were trained and optimized using k‑fold cross‑validation to avoid overfitting. The optimal model with the highest AUC performance in the validation cohort was selected. The final modeling feature genes were summarized in S12 model.genes. Model regression coefficients and corresponding confidence intervals were recorded in S13 Model_Coefficients_With_CI. Feature permutation importance analysis was performed to evaluate the predictive contribution of each gene, and the results were saved asS15 feature_permutation_importance. ROC analysis was conducted in both training and validation cohorts, and all AUC values and model performance results were summarized into S16 model_auc_summary.

Section 7: SHAP model interpretability analysis (Table results S17/S18/S19)

To interpret the predictive mechanism of the optimal machine learning model, SHAP analysis was performed to quantify the contribution of each core gene. Individual feature importance, grouped feature importance, and top-ranked key biomarker genes were calculated systematically. The quantitative SHAP results were exported asS17 shap_feature_importance, S18 shap_groupwise_importance, and S19 shap_top_genes, respectively.

Section 8: Molecular docking and molecular dynamics simulation (Molecular docking and molecular dynamics simulation)

For molecular docking verification, target protein crystal structures were downloaded from the PDB database and pretreated to remove water molecules, heteroatoms and redundant ligands, followed by hydrogen addition and charge assignment. Small-molecule ligand structures were obtained from PubChem and optimized for energy minimization. AutoDock Vina was used to perform molecular docking with a unified grid-box setting and standard docking parameters. The optimal binding conformation with the lowest binding energy was selected for subsequent analysis. All docking parameter files and binding energy results are provided in the attached simulation folder. For molecular dynamics simulation, the optimal docking complex was used as the initial structure for 200 ns all-atom MD simulation via GROMACS. The CHARMM36 force field and TIP3P water model were applied. The system was solvated, neutralized, energy-minimized, and equilibrated through NVT and NPT ensemble simulations. Structural stability indicators including RMSD, RMSF, Rg, SASA and hydrogen bond occupancy were calculated. Principal component analysis and free energy surface analysis were performed to evaluate global protein conformational changes. MM-PBSA binding free energy calculation was conducted to quantify the binding affinity. All simulation parameter files, topology files, MD configuration files and analytical result data are provided in the uploaded simulation attachment folder.