Mar 19, 2026

Integrative bioinformatics and Mendelian randomization analysis to identify phospholipid metabolism-related biomarkers in type 1 diabetes mellitus

  • Songbai Liu1
  • 1Suzhou Vocational Health College
Icon indicating open access to content
QR code linking to this content
Protocol Citation Songbai Liu 2026. Integrative bioinformatics and Mendelian randomization analysis to identify phospholipid metabolism-related biomarkers in type 1 diabetes mellitus. protocols.io https://dx.doi.org/10.17504/protocols.io.n92ld465ol5b/v1
License: This is an open access  protocol  distributed under the terms of the  Creative Commons Attribution License,  which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Protocol status: Working
We use this protocol and it's working
Created: March 19, 2026
Last Modified: March 19, 2026
Protocol  Integer ID: 313583
Keywords: diabetes mellitus, biomarkers in type, integrative bioinformatics, diagnostic biomarker, related biomarker, regulatory network prediction, mendelian randomization analysis, immune infiltration profiling, differential expression analysis, phospholipid metabolism in type, phospholipid metabolism, mendelian randomization, machine learning
Funders Acknowledgements:
the Jiangsu Provincial Health Commission Medical Research Project
Grant ID: M2022109
Abstract
This protocol describes a step-by-step computational workflow to identify and validate diagnostic biomarkers associated with phospholipid metabolism in type 1 diabetes mellitus (T1D). The workflow integrates differential expression analysis, machine learning, Mendelian randomization (MR), functional enrichment, immune infiltration profiling, regulatory network prediction, and molecular docking. All steps are designed to ensure reproducibility and transparency.
Guidelines
**Step-by-Step Protocol**

1. Data Acquisition**
1. Download transcriptomic datasets from the Gene Expression Omnibus (GEO;https://www.ncbi.nlm.nih.gov/gds/):
1. Training set: GSE156035 (platform GPL20844). Contains 20 T1D peripheral blood mononuclear cell (PBMC) samples and 20 healthy control PBMC samples.
2. Validation set: GSE72377 (platform GPL10558). Contains 15 T1D PBMC samples and 20 healthy control PBMC samples.
3. Access date: October 18, 2025.
2. Retrieve phospholipid metabolism-related genes (PMRGs):**
Download a set of 212 PMRGs from the Molecular Signatures Database (MSigDB;https://www.gsea-msigdb.org/gsea/msigdb/) using the same access date.

2. Identification of Candidate Genes (CGs)**
1. Perform differential expression analysis on the training set using the limma package in R:
1. Input: Normalized expression matrix of GSE156035.
2. Thresholds: adjusted P 3c 0.05, |log~~2 fold change| 3e 0.5.
3. Output: List of differentially expressed genes (DEGs).
2. Identify overlapping genes between DEGs and PMRGs using the ggvenn package:
1. Command: ggvenn(list(DEGs, PMRGs))
2. Output: Candidate genes (CGs).

3. Enrichment Analysis and PPI Network Construction**
1. Perform Gene Ontology (GO) and KEGG pathway enrichment on CGs using clusterProfiler:
1. Parameters: p-value cutoff = 0.05, q-value cutoff = 0.2.
2. Visualize top enriched terms.
2. Construct a protein-protein interaction (PPI) network:**
1. Upload CG list to the STRING database (https://string-db.org/).
2. Parameters: confidence score 3e 0.15.
3. Download interaction data and visualize using Cytoscape (v3.10.2).

4. Machine Learning-Based Biomarker Screening**
1. Apply Least Absolute Shrinkage and Selection Operator (LASSO) regression using glmnet:
1. Method: 5-fold cross-validation.
2. Parameter: optimal λ (lambda.min) selected.
3. Output: LASSO candidate biomarkers.
2. Apply Support Vector Machine-Recursive Feature Elimination (SVM-RFE) using e1071:
1. Method: 5-fold cross-validation, recursive feature elimination.
2. Output: SVMRFE candidate biomarkers.
3. Apply Boruta algorithm using Boruta package:
1. Parameters: 100 iterations, P 3c 0.01.
2. Output: Boruta candidate biomarkers.
4. Take the intersection of LASSO candidate, SVMRFE candidate, and Boruta candidate biomarkers using ggvenn to obtain final candidate biomarkers.

5. Expression Validation and ROC Analysis**
1. Validate expression levels of candidate biomarkers in both training and validation sets using Wilcoxon signed-rank test (P 3c 0.05).
2. Perform receiver operating characteristic (ROC) curve analysis using pROC:
1. Calculate area under the curve (AUC) for each biomarker.
2. Threshold: AUC 3e 0.7 indicates good diagnostic performance.
3. Select final biomarkers based on consistent expression trends and high AUC in both cohorts.

6. Nomogram Construction and Evaluation**
1. Build a nomogram using rms package based on final biomarker expression in the training set.
2. Assess calibration with Hosmer-Lemeshow test (P 3e 0.05) and calibration curve.
3. Plot ROC curve for the nomogram and calculate AUC.
4. Perform decision curve analysis (DCA) using ggDCA to evaluate clinical utility.

7. Gene Set Enrichment Analysis (GSEA)**
1. Perform Spearman correlation between each final biomarker and all other genes in the training set using base R cor() function.
2. Rank genes by correlation coefficient.
3. Run GSEA using clusterProfiler with the MSigDB KEGG gene set "c2.cp.v2024.1.Hs.symbols.gmt".
Parameters: |normalized enrichment score (NES)| 3e 1, adjusted P 3c 0.05.
4. Visualize top enriched pathways using enrichplot.

8. Immune Infiltration Analysis**
1. Calculate enrichment scores for 28 immune cell types using single-sample GSEA (ssGSEA) implemented in GSVA:
Input: Normalized expression matrix of training set.
Method: ssGSEA.
2. Identify differentially infiltrated immune cells (DIICs) between T1D and control groups using Wilcoxon rank-sum test (P 3c 0.05).
3. Perform Spearman correlation analysis using psych:
Between DIICs and between biomarkers and DIICs.
Thresholds: |correlation coefficient| 3e 0.3, P 3c 0.05.

9. Construction of miRNA-mRNA Regulatory Network**
1. Predict miRNAs targeting each final biomarker using:
miRWalk database (http://mirwalk.umm.uni-heidelberg.de/)
TarBase database (https://dianalab.e-ce.uth.gr/tarbasev9/)
2. Identify overlapping miRNAs from both databases.
3. Build regulatory network using Cytoscape.

10. Chromosomal and Subcellular Localization**
1. Map chromosomal locations of final biomarkers using RCircos.
2. Retrieve subcellular localization from GeneCards database (https://www.genecards.org/).

11. Drug Prediction and Molecular Docking**
1. Predict potential drugs targeting final biomarkers using DSigDB (Drug Signatures Database;https://dsigdb.tanlab.org/DSigDBv1.0/).
2. Select core drugs that target both biomarkers with highest combined scores.
3. Retrieve 3D structures:**
1. Protein structures from Protein Data Bank (PDB;https://www.rcsb.org/) or UniProt (https://www.uniprot.org/).
2. Drug structures from PubChem (https://pubchem.ncbi.nlm.nih.gov/).
4. Perform molecular docking using PyRx software (v0.8.0):
Binding energy threshold:c -5 kcal/mol indicates favorable binding.
5. Visualize docking poses with PyMOL (v2.5; Schrödinger, LLC).

12. Statistical Analysis**
1. All statistical tests were performed in R (v4.3.3).**
2. Two-group comparisons: Wilcoxon rank-sum test.
3. Correlation analyses: Spearman’s test.
4. Statistical significance: two-tailed P c 0.05.
Materials
**Hardware: Standard computer (Windows, macOS, or Linux) with internet access.

**Software:**
1. R (version 4.3.3; The R Foundation for Statistical Computing, Vienna, Austria; https://www.r-project.org/)
2. RStudio (optional, but recommended)
3. Required R packages with versions and sources:
1. limma(v3.56.2; Bioconductor)
2. ggvenn(v0.1.10; CRAN)
3. clusterProfiler(v4.2.2; Bioconductor)
4. WGCNA(v1.71; CRAN)
5. glmnet(v4.1.8; CRAN)
6. e1071(v1.7.16; CRAN)
7. Boruta(v8.0.0; CRAN)
8. pROC(v1.18.5; CRAN)
9. rms(v6.8.1; CRAN)
10. ggDCA(v1.1.0; CRAN)
11. GSVA(v1.53.28; Bioconductor)
12. psych(v2.2.9; CRAN)
13. RCircos(v1.1.2; CRAN)
14. enrichplot(v1.18.4; Bioconductor)
15. ggplot2(v3.4.4; CRAN)
16. ComplexHeatmap(v2.18.0; Bioconductor)
17. TwoSampleMR(v0.5.6; MRC IEU)
4. PyRx(v0.8.0;https://pyrx.sourceforge.io/) for molecular docking.
5. Web browsers: Chrome, Firefox, or Safari (for accessing online databases).
Data Acquisition
Download transcriptomic datasets from the Gene Expression Omnibus (GEO; https://www.ncbi.nlm.nih.gov/gds/):
Training set: GSE156035 (platform GPL20844). Contains 20 T1D peripheral blood mononuclear cell (PBMC) samples and 20 healthy control PBMC samples.
Validation set: GSE72377 (platform GPL10558). Contains 15 T1D PBMC samples and 20 healthy control PBMC samples.
Access date: October 18, 2025.
Retrieve phospholipid metabolism-related genes (PMRGs):
Download a set of 212 PMRGs from the Molecular Signatures Database (MSigDB; https://www.gsea-msigdb.org/gsea/msigdb/) using the same access date.
Identification of Candidate Genes (CGs)
Perform differential expression analysis on the training set using the limma package in R:
Input: Normalized expression matrix of GSE156035.
Thresholds: adjusted P c 0.05, |log~~2 fold change| e 0.5.
Output: List of differentially expressed genes (DEGs).
Identify overlapping genes between DEGs and PMRGs using the ggvenn package:
Command: ggvenn(list(DEGs, PMRGs))
Output: Candidate genes (CGs).
Enrichment Analysis and PPI Network Construction
Perform Gene Ontology (GO) and KEGG pathway enrichment on CGs using clusterProfiler:
Parameters: p-value cutoff = 0.05, q-value cutoff = 0.2.
Visualize top enriched terms.
Construct a protein-protein interaction (PPI) network:
Upload CG list to the STRING database (https://string-db.org/).
Parameters: confidence score e 0.15.
Download interaction data and visualize using Cytoscape (v3.10.2).
Machine Learning-Based Biomarker Screening
Apply Least Absolute Shrinkage and Selection Operator (LASSO) regression using glmnet:
Method: 5-fold cross-validation.
Parameter: optimal λ (lambda.min) selected.
Output: LASSO candidate biomarkers.
Apply Support Vector Machine-Recursive Feature Elimination (SVM-RFE) using e1071:
Method: 5-fold cross-validation, recursive feature elimination.
Output: SVMRFE candidate biomarkers.
Apply Boruta algorithm using Boruta package:
Parameters: 100 iterations, P c 0.01.
Output: Boruta candidate biomarkers.
Take the intersection of LASSO candidate, SVMRFE candidate, and Boruta candidate biomarkers using ggvenn to obtain final candidate biomarkers.
Expression Validation and ROC Analysis
Validate expression levels of candidate biomarkers in both training and validation sets using Wilcoxon signed-rank test (P c 0.05).
Perform receiver operating characteristic (ROC) curve analysis using pROC:
Calculate area under the curve (AUC) for each biomarker.
Threshold: AUC e 0.7 indicates good diagnostic performance.
Select final biomarkers based on consistent expression trends and high AUC in both cohorts.
Nomogram Construction and Evaluation
Build a nomogram using rms package based on final biomarker expression in the training set.
Assess calibration with Hosmer-Lemeshow test (P e 0.05) and calibration curve.
Plot ROC curve for the nomogram and calculate AUC.
Perform decision curve analysis (DCA) using ggDCA to evaluate clinical utility.
Gene Set Enrichment Analysis (GSEA)
Perform Spearman correlation between each final biomarker and all other genes in the training set using base R cor() function.
Rank genes by correlation coefficient.
Run GSEA using clusterProfiler with the MSigDB KEGG gene set "c2.cp.v2024.1.Hs.symbols.gmt".
Parameters: |normalized enrichment score (NES)| e 1, adjusted P c 0.05.
Visualize top enriched pathways using enrichplot.
Immune Infiltration Analysis
Calculate enrichment scores for 28 immune cell types using single-sample GSEA (ssGSEA) implemented in GSVA:
Input: Normalized expression matrix of training set.
Method: ssGSEA.
Identify differentially infiltrated immune cells (DIICs) between T1D and control groups using Wilcoxon rank-sum test (P c 0.05).
Perform Spearman correlation analysis using psych:
Between DIICs and between biomarkers and DIICs.
Thresholds: |correlation coefficient| e 0.3, P c 0.05.
Construction of miRNA-mRNA Regulatory Network
Predict miRNAs targeting each final biomarker using:
miRWalk database (http://mirwalk.umm.uni-heidelberg.de/)
TarBase database (https://dianalab.e-ce.uth.gr/tarbasev9/)
Identify overlapping miRNAs from both databases.
Build regulatory network using Cytoscape.
Chromosomal and Subcellular Localization
Map chromosomal locations of final biomarkers using RCircos.
Retrieve subcellular localization from GeneCards database (https://www.genecards.org/).
Drug Prediction and Molecular Docking
Predict potential drugs targeting final biomarkers using DSigDB (Drug Signatures Database; https://dsigdb.tanlab.org/DSigDBv1.0/).
Select core drugs that target both biomarkers with highest combined scores.
Retrieve 3D structures:
Protein structures from Protein Data Bank (PDB; https://www.rcsb.org/) or UniProt (https://www.uniprot.org/).
Drug structures from PubChem (https://pubchem.ncbi.nlm.nih.gov/).
Perform molecular docking using PyRx software (v0.8.0):
Binding energy threshold: c -5 kcal/mol indicates favorable binding.
Visualize docking poses with PyMOL (v2.5; Schrödinger, LLC).
Statistical Analysis
All statistical tests were performed in R (v4.3.3).
Two-group comparisons: Wilcoxon rank-sum test.
Correlation analyses: Spearman’s test.
Statistical significance: two-tailed P c 0.05.
Expected Outcomes
A list of DEGs, CGs, and final biomarkers.
LASSO, SVM-RFE, and Boruta feature selections.
Validated biomarkers (ETNK1 and CEPT1) with expression plots and ROC curves.
A nomogram with calibration, ROC, and DCA plots.
GSEA enrichment results and immune infiltration profiles.
miRNA-mRNA regulatory network.
Drug prediction and molecular docking results with binding energies.