Integrative bioinformatics and Mendelian randomization analysis to identify phospholipid metabolism-related biomarkers in type 1 diabetes mellitus

Songbai Liu

Mar 19, 2026

Integrative bioinformatics and Mendelian randomization analysis to identify phospholipid metabolism-related biomarkers in type 1 diabetes mellitus

DOI

https://dx.doi.org/10.17504/protocols.io.n92ld465ol5b/v1

Songbai Liu¹

¹Suzhou Vocational Health College

Anonymous

DOI: https://dx.doi.org/10.17504/protocols.io.n92ld465ol5b/v1

Protocol Citation: Songbai Liu 2026. Integrative bioinformatics and Mendelian randomization analysis to identify phospholipid metabolism-related biomarkers in type 1 diabetes mellitus. protocols.io https://dx.doi.org/10.17504/protocols.io.n92ld465ol5b/v1

License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

Protocol status: Working

We use this protocol and it's working

Created: March 19, 2026

Last Modified: March 19, 2026

Protocol Integer ID: 313583

Keywords: diabetes mellitus, biomarkers in type, integrative bioinformatics, diagnostic biomarker, related biomarker, regulatory network prediction, mendelian randomization analysis, immune infiltration profiling, differential expression analysis, phospholipid metabolism in type, phospholipid metabolism, mendelian randomization, machine learning

Funders Acknowledgements:

the Jiangsu Provincial Health Commission Medical Research Project

Grant ID: M2022109

Abstract

This protocol describes a step-by-step computational workflow to identify and validate diagnostic biomarkers associated with phospholipid metabolism in type 1 diabetes mellitus (T1D). The workflow integrates differential expression analysis, machine learning, Mendelian randomization (MR), functional enrichment, immune infiltration profiling, regulatory network prediction, and molecular docking. All steps are designed to ensure reproducibility and transparency.

Guidelines

**Step-by-Step Protocol**

Data Acquisition**
Download transcriptomic datasets from the Gene Expression Omnibus (GEO;https://www.ncbi.nlm.nih.gov/gds/):
Training set: GSE156035 (platform GPL20844). Contains 20 T1D peripheral blood mononuclear cell (PBMC) samples and 20 healthy control PBMC samples.
Validation set: GSE72377 (platform GPL10558). Contains 15 T1D PBMC samples and 20 healthy control PBMC samples.
Access date: October 18, 2025.
Retrieve phospholipid metabolism-related genes (PMRGs):**
      Download a set of 212 PMRGs from the Molecular Signatures Database (MSigDB;https://www.gsea-msigdb.org/gsea/msigdb/) using the same access date.

Identification of Candidate Genes (CGs)**
Perform differential expression analysis on the training set using the limma package in R:
Input: Normalized expression matrix of GSE156035.
Thresholds: adjusted P 3c 0.05, |log~~2 fold change| 3e 0.5.
Output: List of differentially expressed genes (DEGs).
Identify overlapping genes between DEGs and PMRGs using the ggvenn package:
Command: ggvenn(list(DEGs, PMRGs))
Output: Candidate genes (CGs).

Enrichment Analysis and PPI Network Construction**
Perform Gene Ontology (GO) and KEGG pathway enrichment on CGs using clusterProfiler:
Parameters: p-value cutoff = 0.05, q-value cutoff = 0.2.
Visualize top enriched terms.
Construct a protein-protein interaction (PPI) network:**
Upload CG list to the STRING database (https://string-db.org/).
Parameters: confidence score 3e 0.15.
Download interaction data and visualize using Cytoscape (v3.10.2).

Machine Learning-Based Biomarker Screening**
Apply Least Absolute Shrinkage and Selection Operator (LASSO) regression using glmnet:
Method: 5-fold cross-validation.
Parameter: optimal λ (lambda.min) selected.
Output: LASSO candidate biomarkers.
Apply Support Vector Machine-Recursive Feature Elimination (SVM-RFE) using e1071:
Method: 5-fold cross-validation, recursive feature elimination.
Output: SVMRFE candidate biomarkers.
Apply Boruta algorithm using Boruta package:
Parameters: 100 iterations, P 3c 0.01.
Output: Boruta candidate biomarkers.
Take the intersection of LASSO candidate, SVMRFE candidate, and Boruta candidate biomarkers using ggvenn to obtain final candidate biomarkers.

Expression Validation and ROC Analysis**
Validate expression levels of candidate biomarkers in both training and validation sets using Wilcoxon signed-rank test (P 3c 0.05).
Perform receiver operating characteristic (ROC) curve analysis using pROC:
Calculate area under the curve (AUC) for each biomarker.
Threshold: AUC 3e 0.7 indicates good diagnostic performance.
Select final biomarkers based on consistent expression trends and high AUC in both cohorts.

Nomogram Construction and Evaluation**
Build a nomogram using rms package based on final biomarker expression in the training set.
Assess calibration with Hosmer-Lemeshow test (P 3e 0.05) and calibration curve.
Plot ROC curve for the nomogram and calculate AUC.
Perform decision curve analysis (DCA) using ggDCA to evaluate clinical utility.

Gene Set Enrichment Analysis (GSEA)**
Perform Spearman correlation between each final biomarker and all other genes in the training set using base R cor() function.
Rank genes by correlation coefficient.
Run GSEA using clusterProfiler with the MSigDB KEGG gene set "c2.cp.v2024.1.Hs.symbols.gmt".
      
Parameters: |normalized enrichment score (NES)| 3e 1, adjusted P 3c 0.05.
Visualize top enriched pathways using enrichplot.

Immune Infiltration Analysis**
Calculate enrichment scores for 28 immune cell types using single-sample GSEA (ssGSEA) implemented in GSVA:
      
Input: Normalized expression matrix of training set.
      
Method: ssGSEA.
Identify differentially infiltrated immune cells (DIICs) between T1D and control groups using Wilcoxon rank-sum test (P 3c 0.05).
Perform Spearman correlation analysis using psych:
      
      Between DIICs and between biomarkers and DIICs.
      
Thresholds: |correlation coefficient| 3e 0.3, P 3c 0.05.

Construction of miRNA-mRNA Regulatory Network**
Predict miRNAs targeting each final biomarker using:
      
      miRWalk database (http://mirwalk.umm.uni-heidelberg.de/)
      
      TarBase database (https://dianalab.e-ce.uth.gr/tarbasev9/)
Identify overlapping miRNAs from both databases.
Build regulatory network using Cytoscape.

Chromosomal and Subcellular Localization**
Map chromosomal locations of final biomarkers using RCircos.
Retrieve subcellular localization from GeneCards database (https://www.genecards.org/).

Drug Prediction and Molecular Docking**
Predict potential drugs targeting final biomarkers using DSigDB (Drug Signatures Database;https://dsigdb.tanlab.org/DSigDBv1.0/).
Select core drugs that target both biomarkers with highest combined scores.
Retrieve 3D structures:**
Protein structures from Protein Data Bank (PDB;https://www.rcsb.org/) or UniProt (https://www.uniprot.org/).
Drug structures from PubChem (https://pubchem.ncbi.nlm.nih.gov/).
Perform molecular docking using PyRx software (v0.8.0):
      
      Binding energy threshold:c -5 kcal/mol indicates favorable binding.
Visualize docking poses with PyMOL (v2.5; Schrödinger, LLC).

Statistical Analysis**
All statistical tests were performed in R (v4.3.3).**
Two-group comparisons: Wilcoxon rank-sum test.
Correlation analyses: Spearman’s test.
Statistical significance: two-tailed P c 0.05.

Materials

**Hardware: Standard computer (Windows, macOS, or Linux) with internet access.

**Software:**
R (version 4.3.3; The R Foundation for Statistical Computing, Vienna, Austria; https://www.r-project.org/)
RStudio (optional, but recommended)
Required R packages with versions and sources:
limma(v3.56.2; Bioconductor)
ggvenn(v0.1.10; CRAN)
clusterProfiler(v4.2.2; Bioconductor)
WGCNA(v1.71; CRAN)
glmnet(v4.1.8; CRAN)
e1071(v1.7.16; CRAN)
Boruta(v8.0.0; CRAN)
pROC(v1.18.5; CRAN)
rms(v6.8.1; CRAN)
ggDCA(v1.1.0; CRAN)
GSVA(v1.53.28; Bioconductor)
psych(v2.2.9; CRAN)
RCircos(v1.1.2; CRAN)
enrichplot(v1.18.4; Bioconductor)
ggplot2(v3.4.4; CRAN)
ComplexHeatmap(v2.18.0; Bioconductor)
TwoSampleMR(v0.5.6; MRC IEU)
PyRx(v0.8.0;https://pyrx.sourceforge.io/) for molecular docking.
Web browsers: Chrome, Firefox, or Safari (for accessing online databases).

Data Acquisition

Download transcriptomic datasets from the Gene Expression Omnibus (GEO; https://www.ncbi.nlm.nih.gov/gds/):
Training set: GSE156035 (platform GPL20844). Contains 20 T1D peripheral blood mononuclear cell (PBMC) samples and 20 healthy control PBMC samples.
Validation set: GSE72377 (platform GPL10558). Contains 15 T1D PBMC samples and 20 healthy control PBMC samples.
Access date: October 18, 2025.

Retrieve phospholipid metabolism-related genes (PMRGs):
Download a set of 212 PMRGs from the Molecular Signatures Database (MSigDB; https://www.gsea-msigdb.org/gsea/msigdb/) using the same access date.

Identification of Candidate Genes (CGs)

Perform differential expression analysis on the training set using the limma package in R:
Input: Normalized expression matrix of GSE156035.
Thresholds: adjusted P c 0.05, |log~~2 fold change| e 0.5.
Output: List of differentially expressed genes (DEGs).

Identify overlapping genes between DEGs and PMRGs using the ggvenn package:
Command: ggvenn(list(DEGs, PMRGs))
Output: Candidate genes (CGs).

Enrichment Analysis and PPI Network Construction

Perform Gene Ontology (GO) and KEGG pathway enrichment on CGs using clusterProfiler:
Parameters: p-value cutoff = 0.05, q-value cutoff = 0.2.
Visualize top enriched terms.

Construct a protein-protein interaction (PPI) network:
Upload CG list to the STRING database (https://string-db.org/).
Parameters: confidence score e 0.15.
Download interaction data and visualize using Cytoscape (v3.10.2).

Machine Learning-Based Biomarker Screening

Apply Least Absolute Shrinkage and Selection Operator (LASSO) regression using glmnet:
Method: 5-fold cross-validation.
Parameter: optimal λ (lambda.min) selected.
Output: LASSO candidate biomarkers.

Apply Support Vector Machine-Recursive Feature Elimination (SVM-RFE) using e1071:
Method: 5-fold cross-validation, recursive feature elimination.
Output: SVMRFE candidate biomarkers.

Apply Boruta algorithm using Boruta package:
Parameters: 100 iterations, P c 0.01.
Output: Boruta candidate biomarkers.

Take the intersection of LASSO candidate, SVMRFE candidate, and Boruta candidate biomarkers using ggvenn to obtain final candidate biomarkers.

Expression Validation and ROC Analysis

Validate expression levels of candidate biomarkers in both training and validation sets using Wilcoxon signed-rank test (P c 0.05).

Perform receiver operating characteristic (ROC) curve analysis using pROC:
Calculate area under the curve (AUC) for each biomarker.
Threshold: AUC e 0.7 indicates good diagnostic performance.

Select final biomarkers based on consistent expression trends and high AUC in both cohorts.

Nomogram Construction and Evaluation

Build a nomogram using rms package based on final biomarker expression in the training set.

Assess calibration with Hosmer-Lemeshow test (P e 0.05) and calibration curve.

Plot ROC curve for the nomogram and calculate AUC.

Perform decision curve analysis (DCA) using ggDCA to evaluate clinical utility.

Gene Set Enrichment Analysis (GSEA)

Perform Spearman correlation between each final biomarker and all other genes in the training set using base R cor() function.

Rank genes by correlation coefficient.

Run GSEA using clusterProfiler with the MSigDB KEGG gene set "c2.cp.v2024.1.Hs.symbols.gmt".
Parameters: |normalized enrichment score (NES)| e 1, adjusted P c 0.05.

Visualize top enriched pathways using enrichplot.

Immune Infiltration Analysis

Calculate enrichment scores for 28 immune cell types using single-sample GSEA (ssGSEA) implemented in GSVA:
Input: Normalized expression matrix of training set.
Method: ssGSEA.

Identify differentially infiltrated immune cells (DIICs) between T1D and control groups using Wilcoxon rank-sum test (P c 0.05).

Perform Spearman correlation analysis using psych:
Between DIICs and between biomarkers and DIICs.
Thresholds: |correlation coefficient| e 0.3, P c 0.05.

Construction of miRNA-mRNA Regulatory Network

Predict miRNAs targeting each final biomarker using:
miRWalk database (http://mirwalk.umm.uni-heidelberg.de/)
TarBase database (https://dianalab.e-ce.uth.gr/tarbasev9/)

Identify overlapping miRNAs from both databases.

Build regulatory network using Cytoscape.

Chromosomal and Subcellular Localization

Map chromosomal locations of final biomarkers using RCircos.

Retrieve subcellular localization from GeneCards database (https://www.genecards.org/).

Drug Prediction and Molecular Docking

Predict potential drugs targeting final biomarkers using DSigDB (Drug Signatures Database; https://dsigdb.tanlab.org/DSigDBv1.0/).

Select core drugs that target both biomarkers with highest combined scores.

Retrieve 3D structures:
Protein structures from Protein Data Bank (PDB; https://www.rcsb.org/) or UniProt (https://www.uniprot.org/).
Drug structures from PubChem (https://pubchem.ncbi.nlm.nih.gov/).

Perform molecular docking using PyRx software (v0.8.0):
Binding energy threshold: c -5 kcal/mol indicates favorable binding.

Visualize docking poses with PyMOL (v2.5; Schrödinger, LLC).

Statistical Analysis

All statistical tests were performed in R (v4.3.3).

Two-group comparisons: Wilcoxon rank-sum test.

Correlation analyses: Spearman’s test.

Statistical significance: two-tailed P c 0.05.

Expected Outcomes

A list of DEGs, CGs, and final biomarkers.

LASSO, SVM-RFE, and Boruta feature selections.

Validated biomarkers (ETNK1 and CEPT1) with expression plots and ROC curves.

A nomogram with calibration, ROC, and DCA plots.

GSEA enrichment results and immune infiltration profiles.

miRNA-mRNA regulatory network.

Drug prediction and molecular docking results with binding energies.