X-PIE curation

Zhou Gong; Beirong Zhang; Xiaofang He; Bowen Zhong; Yueling Zhu; Lichun He; Kangning Tan; Zhu Liu; Jing Chen; Zhen Liang; Xu Zhang; Yukui Zhang; Lihua Zhang; Maili Liu; Qun Zhao

Jun 15, 2026

X-PIE curation

DOI

https://dx.doi.org/10.17504/protocols.io.n2bvj5r6bgk5/v1

Zhou Gong¹,
Beirong Zhang²,
Xiaofang He¹,
Bowen Zhong²,
Yueling Zhu¹,
Lichun He¹,
Kangning Tan³,
Zhu Liu³,
Jing Chen²,
Zhen Liang²,
Xu Zhang¹,
Yukui Zhang²,
Lihua Zhang²,
Maili Liu¹,
Qun Zhao²

¹Innovation Academy for Precision Measurement Science and Technology, Chinese Academy of Sciences;
²Dalian Institute of Chemical Physics, Chinese Academy of Sciences;
³Huazhong Agricultural University

X-PIE

Zhou Gong

Innovation Academy for Precision Measurement Science and Tec...

DOI: https://dx.doi.org/10.17504/protocols.io.n2bvj5r6bgk5/v1

Protocol Citation: Zhou Gong, Beirong Zhang, Xiaofang He, Bowen Zhong, Yueling Zhu, Lichun He, Kangning Tan, Zhu Liu, Jing Chen, Zhen Liang, Xu Zhang, Yukui Zhang, Lihua Zhang, Maili Liu, Qun Zhao 2026. X-PIE curation. protocols.io https://dx.doi.org/10.17504/protocols.io.n2bvj5r6bgk5/v1

License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

Protocol status: Working

We use this protocol and it's working

Created: June 06, 2026

Last Modified: June 15, 2026

Protocol Integer ID: 318639

Keywords: crosslinking mass spectrometry (XL-MS), protein-protein interaction, STRING, PDB databse, data filtering, protein crosslink psm, crosslinking mass spectrometry, ambiguous protein, protein interaction, confidence protein, protein, experimental rcsb pdb entries for homologous structural support, mass spectrometry, annotating plink, crosslinked site pair, experimental rcsb pdb entry, homologous structural support

Abstract

X-PIE curation is a computational pipeline for filtering, validating, and annotating pLink crosslinking mass spectrometry (XL-MS) results to generate high-confidence protein-protein interaction (PPI) datasets. The workflow filters target-target inter-protein crosslink PSMs, resolves ambiguous protein-pair assignments by dataset-level frequency voting, summarizes unique crosslinked site pairs, and separates STRING-reported PPIs from candidate novel PPIs. For STRING-unreported PPIs, the workflow further searches experimental RCSB PDB entries for homologous structural support and retains strict hits only when both proteins exceed a user-defined local sequence identity threshold. This protocol describes the complete workflow from input data preparation to final output generation.

Guidelines

Overview & Scope
X-PIE curation is a computational module designed to process raw pLink cross-linking mass spectrometry (XL-MS) results for protein-protein interaction (PPI) discovery and validation. The workflow filters inter-protein cross-linked peptide-spectrum matches (PSMs), resolves cross-link site pairs, queries the STRING database for functional association evidence, and searches the RCSB PDB for homologous experimental complex structures. This protocol is intended for users who wish to systematically prioritize high-confidence PPIs from XL-MS datasets and obtain structural templates for downstream integrative modeling (e.g., the X-PIE modeling pipeline).

Prerequisites & Skills
Operating system: Linux/Unix environment with Python 3.13 installed.
Programming skills: Basic command-line proficiency; ability to edit configuration paths and run Python scripts.
Domain knowledge: Familiarity with pLink output formats, XL-MS FDR control, and UniProt accession conventions.

Network Access Requirements
This workflow is network-dependent. It requires active internet connections to:
STRING (functional association scores)
RCSB PDB (homologous structure retrieval)
If the network is unavailable, the workflow can either:
Exit immediately (recommended for automated pipelines), or Run in local-only mode, while skipping STRING and PDB annotation.

Materials

CPU: Single-core execution is sufficient; no GPU required.
Memory: < 4 GB RAM for typical datasets (≤10⁵ PSMs).
Disk space: Output is minimal (~MB scale); log files are timestamped.
Runtime: Seconds to minutes, depending on API response times from STRING and RCSB PDB.

Troubleshooting

Problem

Missing columns in the input CSV

Solution

Check whether the pLink export contains all required columns: - `Peptide_Type` - `Protein_Type` - `Score` - `Target_Decoy` - `Q-value` - `Proteins`

Problem

No rows remain after filtering

Solution

Possible reasons include: - very few target-target inter-protein crosslinks in the source data - an overly strict FDR threshold Suggested actions: - inspect `Target_Decoy` - confirm `Peptide_Type` and `Protein_Type` - relax `--fdr` only if scientifically justified

Problem

The `Proteins` field cannot be parsed

Solution

Check whether the annotation format matches one of the supported accession-site patterns.

Problem

Many proteins fail STRING mapping

Solution

Check whether: - the input contains mixed organisms - the UniProt accessions are valid - isoform-heavy accessions require manual review

Problem

The PDB searching process is slow

Solution

The PDB stage depends on external UniProt and RCSB services and may require substantially more time for large sets of STRING-unreported PPIs.

Safety warnings

The database retrieval step may be time-consuming, depending on the size of the dataset to be analyzed and the network access speed. Please monitor network connectivity and response status periodically.

Before start

The following software or components must be installed and confirmed to be functioning correctly prior to running X-PIE curation.
The following component versions have been tested and confirmed to operate normally. Other versions may work but have not been validated.

Python  3.13.2
Biopython  1.85
NumPy   1.26.4
pandas  2.3.1
requests  2.32.4


Internet access is required for the UniProt, STRING, and RCSB PDB web services used during annotation and homologous structure search.

Prepare Input Files

Place one or more cross-linking result (CSV files) from pLink software in the working directory (defult: ./input

Each input CSV file should contain the following columns:
- Peptide_Type
- Protein_Type
- Score
- Target_Decoy
- Q-value
- Proteins

The Protein column must contain protein-site annotations in one of the following supported formats:

1)  sp|P12345|PROT1(100)-sp|P67890|PROT2(200) 
2)  P12345(100)-P67890(200)
If multiple candidate protein pairs are reported in a single row and separated by /,  the workflow resolves them by frequency voting across the complete dataset.

A complete example of the CSV information is shown below; the first line is a comment.

Order,Title,Charge,Precursor_MH,Peptide_Type,Peptide,Peptide_MH,Modifications,Refined_Score,SVM_Score,Re-score_CSM,Score,E-value,Precursor_Mass_Error(Da),Precursor_Mass_Error(ppm),Target_Decoy,Q-value,Q-value_CSM,isRetainHighLevel,Proteins,Protein_Type,FileID,isComplexSatisfied,isFilterIn
1,Hela-TDS-G2-90min-FAIMS-40CV-1.19104.19104.4.0.dta,4,1588.88,0,THLNLVVLGHVDSGK,1588.88,null,84.1477,13.0125,13.0125,2.232246e-006,1.000000e+000,-0.001322,-0.832032,2,0.000000,0.000000,1,sp|P68104|EF1A1_HUMAN /sp|Q05639|EF1A2_HUMAN /,0,1,1,1
2,Hela-TDS-G2-90min-FAIMS-40CV-1.15490.15490.4.0.dta,4,2406.187283,0,HGEVCPAGWKPGSDTLKPDVQK,2406.186944,Carbamidomethyl[C](5),98.382610,12.903400,12.903400,2.489565e-006,1.000000e+000,0.000339,0.140887,2,0.000000,0.000000,1,sp|Q06830|PRDX1_HUMAN /,0,1,1,1
3,Hela-TDS-G2-90min-FAIMS-40CV-1.19063.19063.4.0.dta,4,1588.879355,0,THLNLVVLGHVDSGK,1588.880489,null,83.276366,12.843100,12.843100,2.644304e-006,1.000000e+000,-0.001134,-0.713710,2,0.000000,0.000000,1,sp|P68104|EF1A1_HUMAN /sp|Q05639|EF1A2_HUMAN /,0,1,1,1
4,Hela-TDS-G2-90min-FAIMS-40CV-1.28169.28169.4.0.dta,4,2477.368347,0,KGVNLPGAAVDLPAVSEKDLQDLK,2477.360859,null,95.635368,12.778600,12.778600,2.820482e-006,1.000000e+000,0.007488,3.022571,2,0.000000,0.000000,1,sp|P14618|KPYM_HUMAN /,0,1,1,1
5,Hela-TDS-G2-90min-FAIMS-40CV-1.16954.16954.4.0.dta,4,1562.894183,0,GFLYVHKPPVHLR,1562.895356,null,81.691361,12.753200,12.753200,2.893039e-006,1.000000e+000,-0.001173,-0.750530,2,0.000000,0.000000,1,sp|Q08945|SSRP1_HUMAN /,0,1,1,1

Run X-PIE curation

Execute Command

python x-pie-curation.py

This command runs the program in interactive mode. At each step, the program will prompt you to enter relevant information while simultaneously displaying the default values for these parameters.

The program will then proceed to Step 0: Checking dependencies. 
It automatically verifies whether the aforementioned software has been properly installed. If any dependency is missing, a warning message will be displayed and the program will exit. 

Note
At startup, the workflow also checks Internet access to UniProt, STRING, and RCSB PDB.
- If all three services are reachable, the workflow continues normally.
- If internet access is unavailable, the program prints: `Internet connection is required for STRING and PDB annotation`
- In interactive mode, you can then choose to:
  - exit immediately
  - continue with local XL-MS filtering only and skip STRING/PDB annotation

If all dependencies are correctly installed, the program will prompt you to enter the directory containing the input pLink CSV files, and the output directory:

Enter the input CSV file or folder containing pLink CSV files [input]:
Enter the output directory [output]:

The program will then prompt you to enter the relevant parameter thresholds sequentially.

Enter the cross-link PSM FDR threshold (%) [1.0]:
This controls the maximum q-value allowed for crosslink PSM (Peptide Spectra Matches) retention.
- The value is entered as a percentage, for example `1` for 1% FDR.
- Lower values are more stringent and retain fewer but more confident PSMs.
- Higher values retain more PSMs but may introduce lower-confidence identifications.
- This parameter directly affects how many inter-protein crosslink PSMs survive the quality-control step.

Enter the minimum XL site-pair count required to keep a PPI [1]:
This defines how many unique crosslinked site pairs must support a protein pair before it is retained as a curated PPI.
- For example, `1` keeps any PPI supported by at least one unique XL site pair.
- A larger value such as `2` or `3` increases stringency by requiring multiple independent crosslink site pairs for the same PPI.
- This parameter mainly affects the confidence and size of the final curated file.

Enter the STRING score threshold for Reported_PPI.dat (0-1) [0.0]:
Enter the filtering threshold for the STRING score. PPI information below this threshold, as well as PPIs not present in the STRING database, will be output to Putative_PPI.dat; PPI information above the threshold will be output to Reported_PPI.dat.

Enter the homologous PDB identity threshold (%) [30.0]:
For PPIs absent from the STRING database, the program automatically queries the PDB database to identify whether complex structures of homologous proteins have been experimentally determined. This parameter sets the sequence homology threshold.
- The value is entered as a percentage, for example `30` for 30% identity.
- For a STRING-unreported PPI to be reported in `PDB_homology_results.dat`, both proteins must exceed this threshold against homologous chains found in the same experimental PDB entry.
- Lower values are more permissive and may return more structural-support candidates.
- Higher values are more conservative and retain only closer homologues.

Program Execution and Outputs

Once the above input is completed and confirmed, the program will automatically proceed with the subsequent steps and generate the corresponding outputs like the following.

============================================================
Step 1: Loading pLink result files
============================================================
  Found 1 CSV file(s).
  Loaded: pFind-TDS-FAIMS-DB-20240604_2024.06.05.csv (163748 rows)
  Total rows loaded: 163748

============================================================
Step 2: Filtering inter-protein cross-link PSMs
============================================================
  Cross-link PSMs: 85866
  Inter-protein cross-link PSMs: 81816
  Retained TT PSMs at FDR <= 1.00%: 282

============================================================
Step 3: Resolving PPIs and XL site pairs
============================================================
  Unique resolved XL site pairs: 191
  Unique PPIs before thresholding: 125

============================================================
Step 4: Applying XL site-pair threshold
============================================================
  Minimum site-pair count: 3
  Retained PPIs: 13
  Retained XL site pairs: 59

============================================================
Step 5: Wrote XLMS summary files
============================================================
  PPI.dat: output/PPI.dat
  PPI_XL_Sites.dat: output/PPI_XL_Sites.dat

============================================================
Detected species from UniProt accessions
============================================================
  Selected species: Homo sapiens (9606)

============================================================
Step 6: Evaluating PPIs against STRING
============================================================
  STRING score threshold for Reported_PPI.dat: 0.000
  Reported PPIs above threshold: 12
  Putative PPIs for PDB follow-up: 1

============================================================
Step 7: Searching homologous PDB structures for putative PPIs
============================================================
[UniProt] 1/2 P00338
[UniProt] 2/2 P07195
[PDB Pair] 1/1 P00338 vs P07195
  Putative PPIs with homologous PDB support: 1
  Reported_PPI.dat: output/Reported_PPI.dat
  Putative_PPI.dat: output/Putative_PPI.dat
  PDB_homology_results.dat: output/PDB_homology_results.dat
  Reported PPIs: 12
  Putative PPIs: 1
  Putative PPIs with homologous PDB support: 1

============================================================
Workflow complete
============================================================
  Input CSV files processed: 1
  PPI.dat: output/PPI.dat
  PPI_XL_Sites.dat: output/PPI_XL_Sites.dat
  Reported_PPI.dat: output/Reported_PPI.dat
  Putative_PPI.dat: output/Putative_PPI.dat
  PDB_homology_results.dat: output/PDB_homology_results.dat
  Reported PPIs: 12
  Putative PPIs: 1
  Putative PPIs with homologous PDB support: 1

After execution, the following five result files will be generated in the output directory.

1. `PPI.dat`

Complete information on protein–protein interactions, including protein names and the total number of intermolecular cross-linked sites. The file contains three columns, namely:

- `Protein1`
- `Protein2`
- `SitePairCount`

2. `PPI_XL_Sites.dat`

Intermolecular cross-link site information for each PPIs is listed row by row, and can be directly used as input for the X-PIE-modeling module. The file contains four columns, namely:

- `Protein1`
- `Site1`
- `Protein2`
- `Site2`

3. `Reported_PPI.dat`

PPI information already reported in the STRING database. The file contains four columns, namely:

- `Protein1`
- `Protein2`
- `SitePairCount`
- `STRING_CombinedScore`

4. `Putative_PPI.dat`

PPI information below the threshold, as well as PPIs not present in the STRING database. While outputting the PPI information, the corresponding STRING score will also be output (displayed as NA if the PPI is not present in the STRING database), along with an indication of whether homologous structure information exists in the PDB database. The file contains five columns, namely:

- `Protein1`
- `Protein2`
- `SitePairCount`
- `STRING_CombinedScore`
- `HasHomologousPDB`

5. `PDB_homology_results.dat`

Information on homologous complex structures in the PDB database for unreported PPIs (when available), including protein names, corresponding PDB entry identifiers, sequence similarity, and the chain IDs of the corresponding PDB files.

- `Protein1`
- `Protein2`
- `PDB_ID`
- `Protein1_Homologue_UniProt`
- `Protein1_IdentityPct`
- `Protein1_Chain`
- `Protein2_Homologue_UniProt`
- `Protein2_IdentityPct`
- `Protein2_Chain`

Batch Modeling Using command-line options

In addition to interactive mode, X-PIE curation supports reproducible batch execution through command-line arguments.

The supported options are:


--input-dir                              input CSV file or folder containing pLink CSV files
--output-dir                            output directory
--fdr                                        crosslink PSM FDR threshold in percent
--min-site-pairs                     minimum number of unique site pairs per PPI
--string-score-threshold      minimum STRING combined score
--identity-threshold              minimum local identity required for strict PDB support
--network-failure-mode       Running mode when network connection is unavailable: choose to exit or run locally (skip STRING/PDB annotation).


Example:


python xlms_ppi_evaluation.py --non-interactive \
    --input-dir ./input \
    --output-dir ./output \
    --fdr 1 \
    --min-site-pairs 1 \
    --string-score-shreshold 0.7 \
    --identity-threshold 30
    --network-failure-mode  local-only(exit)


By changing the input folder contents and rerunning the command, multiple curation jobs can be processed in a consistent batch mode.


Note
All associated source code, README documentation, and representative input/output files are publicly available on Zenodo (10.5281/zenodo.20523994).