Jun 15, 2026
  • Zhou Gong1,
  • Beirong Zhang2,
  • Xiaofang He1,
  • Bowen Zhong2,
  • Yueling Zhu1,
  • Lichun He1,
  • Kangning Tan3,
  • Zhu Liu3,
  • Jing Chen2,
  • Zhen Liang2,
  • Xu Zhang1,
  • Yukui Zhang2,
  • Lihua Zhang2,
  • Maili Liu1,
  • Qun Zhao2
  • 1Innovation Academy for Precision Measurement Science and Technology, Chinese Academy of Sciences;
  • 2Dalian Institute of Chemical Physics, Chinese Academy of Sciences;
  • 3Huazhong Agricultural University
  • X-PIE
Icon indicating open access to content
QR code linking to this content
Protocol CitationZhou Gong, Beirong Zhang, Xiaofang He, Bowen Zhong, Yueling Zhu, Lichun He, Kangning Tan, Zhu Liu, Jing Chen, Zhen Liang, Xu Zhang, Yukui Zhang, Lihua Zhang, Maili Liu, Qun Zhao 2026. X-PIE curation. protocols.io https://dx.doi.org/10.17504/protocols.io.n2bvj5r6bgk5/v1
License: This is an open access  protocol  distributed under the terms of the  Creative Commons Attribution License,  which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Protocol status: Working
We use this protocol and it's working
Created: June 06, 2026
Last Modified: June 15, 2026
Protocol  Integer ID: 318639
Keywords: crosslinking mass spectrometry (XL-MS), protein-protein interaction, STRING, PDB databse, data filtering, protein crosslink psm, crosslinking mass spectrometry, ambiguous protein, protein interaction, confidence protein, protein, experimental rcsb pdb entries for homologous structural support, mass spectrometry, annotating plink, crosslinked site pair, experimental rcsb pdb entry, homologous structural support
Abstract
X-PIE curation is a computational pipeline for filtering, validating, and annotating pLink crosslinking mass spectrometry (XL-MS) results to generate high-confidence protein-protein interaction (PPI) datasets. The workflow filters target-target inter-protein crosslink PSMs, resolves ambiguous protein-pair assignments by dataset-level frequency voting, summarizes unique crosslinked site pairs, and separates STRING-reported PPIs from candidate novel PPIs. For STRING-unreported PPIs, the workflow further searches experimental RCSB PDB entries for homologous structural support and retains strict hits only when both proteins exceed a user-defined local sequence identity threshold. This protocol describes the complete workflow from input data preparation to final output generation.
Guidelines
Overview & Scope
X-PIE curation is a computational module designed to process raw pLink cross-linking mass spectrometry (XL-MS) results for protein-protein interaction (PPI) discovery and validation. The workflow filters inter-protein cross-linked peptide-spectrum matches (PSMs), resolves cross-link site pairs, queries the STRING database for functional association evidence, and searches the RCSB PDB for homologous experimental complex structures. This protocol is intended for users who wish to systematically prioritize high-confidence PPIs from XL-MS datasets and obtain structural templates for downstream integrative modeling (e.g., the X-PIE modeling pipeline).

Prerequisites & Skills
  • Operating system: Linux/Unix environment with Python 3.13 installed.
  • Programming skills: Basic command-line proficiency; ability to edit configuration paths and run Python scripts.
  • Domain knowledge: Familiarity with pLink output formats, XL-MS FDR control, and UniProt accession conventions.

Network Access Requirements
This workflow is network-dependent. It requires active internet connections to:
  • STRING (functional association scores)
  • RCSB PDB (homologous structure retrieval)
If the network is unavailable, the workflow can either:
  • Exit immediately (recommended for automated pipelines), or Run in local-only mode, while skipping STRING and PDB annotation.
Materials
  • CPU: Single-core execution is sufficient; no GPU required.
  • Memory: < 4 GB RAM for typical datasets (≤10⁵ PSMs).
  • Disk space: Output is minimal (~MB scale); log files are timestamped.
  • Runtime: Seconds to minutes, depending on API response times from STRING and RCSB PDB.
Troubleshooting
Problem
Missing columns in the input CSV
Solution
Check whether the pLink export contains all required columns: - `Peptide_Type` - `Protein_Type` - `Score` - `Target_Decoy` - `Q-value` - `Proteins`
Problem
No rows remain after filtering
Solution
Possible reasons include: - very few target-target inter-protein crosslinks in the source data - an overly strict FDR threshold Suggested actions: - inspect `Target_Decoy` - confirm `Peptide_Type` and `Protein_Type` - relax `--fdr` only if scientifically justified
Problem
The `Proteins` field cannot be parsed
Solution
Check whether the annotation format matches one of the supported accession-site patterns.
Problem
Many proteins fail STRING mapping
Solution
Check whether: - the input contains mixed organisms - the UniProt accessions are valid - isoform-heavy accessions require manual review
Problem
The PDB searching process is slow
Solution
The PDB stage depends on external UniProt and RCSB services and may require substantially more time for large sets of STRING-unreported PPIs.
Safety warnings
The database retrieval step may be time-consuming, depending on the size of the dataset to be analyzed and the network access speed. Please monitor network connectivity and response status periodically.
Before start
The following software or components must be installed and confirmed to be functioning correctly prior to running X-PIE curation.
The following component versions have been tested and confirmed to operate normally. Other versions may work but have not been validated.

Python 3.13.2
Biopython 1.85
NumPy 1.26.4
pandas 2.3.1
requests 2.32.4


Internet access is required for the UniProt, STRING, and RCSB PDB web services used during annotation and homologous structure search.

Prepare Input Files
Place one or more cross-linking result (CSV files) from pLink software in the working directory (defult: ./input

Each input CSV file should contain the following columns:
- Peptide_Type
- Protein_Type
- Score
- Target_Decoy
- Q-value
- Proteins

The Protein column must contain protein-site annotations in one of the following supported formats:

1) sp|P12345|PROT1(100)-sp|P67890|PROT2(200)
2) P12345(100)-P67890(200)
If multiple candidate protein pairs are reported in a single row and separated by /, the workflow resolves them by frequency voting across the complete dataset.

A complete example of the CSV information is shown below; the first line is a comment.

Order,Title,Charge,Precursor_MH,Peptide_Type,Peptide,Peptide_MH,Modifications,Refined_Score,SVM_Score,Re-score_CSM,Score,E-value,Precursor_Mass_Error(Da),Precursor_Mass_Error(ppm),Target_Decoy,Q-value,Q-value_CSM,isRetainHighLevel,Proteins,Protein_Type,FileID,isComplexSatisfied,isFilterIn
1,Hela-TDS-G2-90min-FAIMS-40CV-1.19104.19104.4.0.dta,4,1588.88,0,THLNLVVLGHVDSGK,1588.88,null,84.1477,13.0125,13.0125,2.232246e-006,1.000000e+000,-0.001322,-0.832032,2,0.000000,0.000000,1,sp|P68104|EF1A1_HUMAN /sp|Q05639|EF1A2_HUMAN /,0,1,1,1
2,Hela-TDS-G2-90min-FAIMS-40CV-1.15490.15490.4.0.dta,4,2406.187283,0,HGEVCPAGWKPGSDTLKPDVQK,2406.186944,Carbamidomethyl[C](5),98.382610,12.903400,12.903400,2.489565e-006,1.000000e+000,0.000339,0.140887,2,0.000000,0.000000,1,sp|Q06830|PRDX1_HUMAN /,0,1,1,1
3,Hela-TDS-G2-90min-FAIMS-40CV-1.19063.19063.4.0.dta,4,1588.879355,0,THLNLVVLGHVDSGK,1588.880489,null,83.276366,12.843100,12.843100,2.644304e-006,1.000000e+000,-0.001134,-0.713710,2,0.000000,0.000000,1,sp|P68104|EF1A1_HUMAN /sp|Q05639|EF1A2_HUMAN /,0,1,1,1
4,Hela-TDS-G2-90min-FAIMS-40CV-1.28169.28169.4.0.dta,4,2477.368347,0,KGVNLPGAAVDLPAVSEKDLQDLK,2477.360859,null,95.635368,12.778600,12.778600,2.820482e-006,1.000000e+000,0.007488,3.022571,2,0.000000,0.000000,1,sp|P14618|KPYM_HUMAN /,0,1,1,1
5,Hela-TDS-G2-90min-FAIMS-40CV-1.16954.16954.4.0.dta,4,1562.894183,0,GFLYVHKPPVHLR,1562.895356,null,81.691361,12.753200,12.753200,2.893039e-006,1.000000e+000,-0.001173,-0.750530,2,0.000000,0.000000,1,sp|Q08945|SSRP1_HUMAN /,0,1,1,1


Run X-PIE curation
Execute Command

python x-pie-curation.py

This command runs the program in interactive mode. At each step, the program will prompt you to enter relevant information while simultaneously displaying the default values for these parameters.

The program will then proceed to Step 0: Checking dependencies.
It automatically verifies whether the aforementioned software has been properly installed. If any dependency is missing, a warning message will be displayed and the program will exit.

Note
At startup, the workflow also checks Internet access to UniProt, STRING, and RCSB PDB.
- If all three services are reachable, the workflow continues normally.
- If internet access is unavailable, the program prints: `Internet connection is required for STRING and PDB annotation`
- In interactive mode, you can then choose to:
- exit immediately
- continue with local XL-MS filtering only and skip STRING/PDB annotation

If all dependencies are correctly installed, the program will prompt you to enter the directory containing the input pLink CSV files, and the output directory:

Enter the input CSV file or folder containing pLink CSV files [input]:
Enter the output directory [output]:

The program will then prompt you to enter the relevant parameter thresholds sequentially.


Enter the cross-link PSM FDR threshold (%) [1.0]:
This controls the maximum q-value allowed for crosslink PSM (Peptide Spectra Matches) retention.
- The value is entered as a percentage, for example `1` for 1% FDR.
- Lower values are more stringent and retain fewer but more confident PSMs.
- Higher values retain more PSMs but may introduce lower-confidence identifications.
- This parameter directly affects how many inter-protein crosslink PSMs survive the quality-control step.



Enter the minimum XL site-pair count required to keep a PPI [1]:
This defines how many unique crosslinked site pairs must support a protein pair before it is retained as a curated PPI.
- For example, `1` keeps any PPI supported by at least one unique XL site pair.
- A larger value such as `2` or `3` increases stringency by requiring multiple independent crosslink site pairs for the same PPI.
- This parameter mainly affects the confidence and size of the final curated file.


Enter the STRING score threshold for Reported_PPI.dat (0-1) [0.0]:
Enter the filtering threshold for the STRING score. PPI information below this threshold, as well as PPIs not present in the STRING database, will be output to Putative_PPI.dat; PPI information above the threshold will be output to Reported_PPI.dat.

Enter the homologous PDB identity threshold (%) [30.0]:
For PPIs absent from the STRING database, the program automatically queries the PDB database to identify whether complex structures of homologous proteins have been experimentally determined. This parameter sets the sequence homology threshold.
- The value is entered as a percentage, for example `30` for 30% identity.
- For a STRING-unreported PPI to be reported in `PDB_homology_results.dat`, both proteins must exceed this threshold against homologous chains found in the same experimental PDB entry.
- Lower values are more permissive and may return more structural-support candidates.
- Higher values are more conservative and retain only closer homologues.




Program Execution and Outputs
Once the above input is completed and confirmed, the program will automatically proceed with the subsequent steps and generate the corresponding outputs like the following.

============================================================
Step 1: Loading pLink result files
============================================================
Found 1 CSV file(s).
Loaded: pFind-TDS-FAIMS-DB-20240604_2024.06.05.csv (163748 rows)
Total rows loaded: 163748

============================================================
Step 2: Filtering inter-protein cross-link PSMs
============================================================
Cross-link PSMs: 85866
Inter-protein cross-link PSMs: 81816
Retained TT PSMs at FDR <= 1.00%: 282

============================================================
Step 3: Resolving PPIs and XL site pairs
============================================================
Unique resolved XL site pairs: 191
Unique PPIs before thresholding: 125

============================================================
Step 4: Applying XL site-pair threshold
============================================================
Minimum site-pair count: 3
Retained PPIs: 13
Retained XL site pairs: 59

============================================================
Step 5: Wrote XLMS summary files
============================================================
PPI.dat: output/PPI.dat
PPI_XL_Sites.dat: output/PPI_XL_Sites.dat

============================================================
Detected species from UniProt accessions
============================================================
Selected species: Homo sapiens (9606)

============================================================
Step 6: Evaluating PPIs against STRING
============================================================
STRING score threshold for Reported_PPI.dat: 0.000
Reported PPIs above threshold: 12
Putative PPIs for PDB follow-up: 1

============================================================
Step 7: Searching homologous PDB structures for putative PPIs
============================================================
[UniProt] 1/2 P00338
[UniProt] 2/2 P07195
[PDB Pair] 1/1 P00338 vs P07195
Putative PPIs with homologous PDB support: 1
Reported_PPI.dat: output/Reported_PPI.dat
Putative_PPI.dat: output/Putative_PPI.dat
PDB_homology_results.dat: output/PDB_homology_results.dat
Reported PPIs: 12
Putative PPIs: 1
Putative PPIs with homologous PDB support: 1

============================================================
Workflow complete
============================================================
Input CSV files processed: 1
PPI.dat: output/PPI.dat
PPI_XL_Sites.dat: output/PPI_XL_Sites.dat
Reported_PPI.dat: output/Reported_PPI.dat
Putative_PPI.dat: output/Putative_PPI.dat
PDB_homology_results.dat: output/PDB_homology_results.dat
Reported PPIs: 12
Putative PPIs: 1
Putative PPIs with homologous PDB support: 1

After execution, the following five result files will be generated in the output directory.

1. `PPI.dat`

Complete information on protein–protein interactions, including protein names and the total number of intermolecular cross-linked sites. The file contains three columns, namely:

- `Protein1`
- `Protein2`
- `SitePairCount`

2. `PPI_XL_Sites.dat`

Intermolecular cross-link site information for each PPIs is listed row by row, and can be directly used as input for the X-PIE-modeling module. The file contains four columns, namely:

- `Protein1`
- `Site1`
- `Protein2`
- `Site2`

3. `Reported_PPI.dat`

PPI information already reported in the STRING database. The file contains four columns, namely:

- `Protein1`
- `Protein2`
- `SitePairCount`
- `STRING_CombinedScore`

4. `Putative_PPI.dat`

PPI information below the threshold, as well as PPIs not present in the STRING database. While outputting the PPI information, the corresponding STRING score will also be output (displayed as NA if the PPI is not present in the STRING database), along with an indication of whether homologous structure information exists in the PDB database. The file contains five columns, namely:

- `Protein1`
- `Protein2`
- `SitePairCount`
- `STRING_CombinedScore`
- `HasHomologousPDB`

5. `PDB_homology_results.dat`

Information on homologous complex structures in the PDB database for unreported PPIs (when available), including protein names, corresponding PDB entry identifiers, sequence similarity, and the chain IDs of the corresponding PDB files.

- `Protein1`
- `Protein2`
- `PDB_ID`
- `Protein1_Homologue_UniProt`
- `Protein1_IdentityPct`
- `Protein1_Chain`
- `Protein2_Homologue_UniProt`
- `Protein2_IdentityPct`
- `Protein2_Chain`
Batch Modeling Using command-line options


In addition to interactive mode, X-PIE curation supports reproducible batch execution through command-line arguments.

The supported options are:


--input-dir input CSV file or folder containing pLink CSV files
--output-dir output directory
--fdr crosslink PSM FDR threshold in percent
--min-site-pairs minimum number of unique site pairs per PPI
--string-score-threshold minimum STRING combined score
--identity-threshold minimum local identity required for strict PDB support
--network-failure-mode Running mode when network connection is unavailable: choose to exit or run locally (skip STRING/PDB annotation).


Example:


python xlms_ppi_evaluation.py --non-interactive \
--input-dir ./input \
--output-dir ./output \
--fdr 1 \
--min-site-pairs 1 \
--string-score-shreshold 0.7 \
--identity-threshold 30
--network-failure-mode local-only(exit)


By changing the input folder contents and rerunning the command, multiple curation jobs can be processed in a consistent batch mode.


Note
All associated source code, README documentation, and representative input/output files are publicly available on Zenodo (10.5281/zenodo.20523994).