May 04, 2026

Dual-Method Convergence Pipeline for LC-MS Metabolite Annotation: R IPA and ipaPy2

  • Lita Doolan1
  • 1KCL
  • LDolanLDolan
Icon indicating open access to content
QR code linking to this content
Protocol CitationLita Doolan 2026. Dual-Method Convergence Pipeline for LC-MS Metabolite Annotation: R IPA and ipaPy2. protocols.io https://dx.doi.org/10.17504/protocols.io.bp2l6jqkkvqe/v1
License: This is an open access  protocol  distributed under the terms of the  Creative Commons Attribution License,  which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Protocol status: Working
We use this protocol and it's working
Created: May 03, 2026
Last Modified: May 04, 2026
Protocol  Integer ID: 316224
Keywords: metabolomics, LC-MS, annotation, ipaPy2, IPA, probabilistic annotation, dual-method validation, HMDB, ECMDB, KEGG, E. coli, untargeted metabolomics, convergence, false positive reduction, untargeted metabolomics data, metabolomics data, ms metabolite annotation, method convergence pipeline for lc, reproducible pipeline, method convergence pipeline, confidence annotation set
Abstract
A reproducible pipeline applying R IPA and ipaPy2 independently to LC-MS untargeted metabolomics data and intersecting outputs to produce a high-confidence annotation set. Demonstrated on E. coli data (Del Carratore et al., Zenodo DOI: 10.5281/zenodo.3414903). Code available at: https://github.com/LDolanLDolan/metabolomics-dual-validation-pipeline
Guidelines
1. Paths**
- Define project root and output directory.
- Specify input, database, adducts, and results file paths.

2. Check files exist**
- Verify existence of input, database, and adducts files.

3. Load files**
- Load input, database, and adducts files using `pd.read_csv`.

4. Clean DB columns**
- Strip whitespace from column names.
- Rename specific columns for consistency.
- Remove duplicated columns.

5. Build required DB columns**
- Ensure required columns exist and are correctly formatted.
- Assign IDs based on available columns.
- Handle missing columns with default values.

6. Validate DB formulas**
- Define `is_valid_formula` function to check formula validity.
- Apply validation and remove invalid entries.

7. Clean adducts**
- Define required adduct columns.
- Raise error if columns are missing.
- Clean and format adduct data.

8. Reset indices**
- Reset indices for DB, adductsAll, and df_ipa.

9. Run simpleIPA**
- Execute `ipa.simpleIPA` with specified parameters.
- Note: This may take a few minutes for 1000 features.

10. Save results**
- Handle both dict and dataframe outputs.
- Convert dict result to dataframe if necessary.
- Insert feature_id into annotations.
- Concatenate rows into final dataframe.
- Save final dataframe to CSV.
- Print top annotations or first 20 rows if no 'post' column.

11. Handle annotations**
- Check if annotations are a DataFrame or dict.
- Insert feature_id and append to rows.
- Concatenate rows into final_df.
- Save final_df to CSV and print top annotations.
5. Build required DB columns
If 'pk' is not in DB.columns, assign a range from 1 to the length of DB to DB['pk'].
If 'MS2' is not in DB.columns, set DB['MS2'] to an empty string.
If 'reactions' is not in DB.columns, set DB['reactions'] to an empty string.
Remove duplicated columns from DB.
6. Validate DB formulas
Define `is_valid_formula` function to check the validity of a formula.
Apply `is_valid_formula` to the 'formula' column in DB and store results in a new column '_valid'.
Remove rows with invalid formulas from DB and reset the index.
Print the number of rows remaining in DB after validation.
7. Clean adducts
Define required adduct columns: 'name', 'calc', 'Charge', 'Mult', 'Mass', 'Ion_mode', 'Formula_add', 'Formula_ded', 'Multi'.
Raise a ValueError if any required adduct columns are missing from adductsAll.columns.
Clean 'Formula_add' column in adductsAll by filling empty values, converting to string, stripping whitespace, and replacing 'H0' and 'nan' with an empty string.
Clean 'Formula_ded' column in adductsAll by filling empty values, converting to string, stripping whitespace, and replacing 'H0' and 'nan' with an empty string.
Fill missing values in 'name', 'calc', and 'Ion_mode' columns in adductsAll with empty strings and convert to string type.
Convert 'Charge', 'Mult', 'Mass', and 'Multi' columns in adductsAll to numeric, coercing errors.
8. Reset indices
Reset indices for DB, adductsAll, and df_ipa.
9. Run simpleIPA
Execute `ipa.simpleIPA` with specified parameters: df=df_ipa, ionisation=1, DB=DB, adductsAll=adductsAll, ppm=10.
Note: This may take a few minutes for 1000 features.
10. Save results
Print the result type using `type(result)`.
Handle both dict and dataframe outputs.
If `result` is a dict, convert it to a dataframe and insert `feature_id` into annotations.
Concatenate rows into `final_df`.
If `result` is a dataframe, assign it to `final_df`.
Save `final_df` to CSV if it contains data.
Print top annotations or first 20 rows if no 'post' column is present.
11. Handle annotations
If `annotations` is a DataFrame, copy it to `df_ann`, insert `feature_id`, and append to `rows`.
If `annotations` is a dict, create a row with `feature_id`, update with `annotations`, and append to `rows`.
Concatenate `rows` into `final_df` if not empty, else create an empty DataFrame.
If `result` is a DataFrame, assign it to `final_df`.
If `result` is not a DataFrame, print unexpected result type and create an empty DataFrame.
If `final_df` has more than 0 rows, save to CSV and print top annotations.
If 'post' is in `final_df` columns, sort and print top annotations; otherwise, print first 20 rows.
If `final_df` is empty, print 'No annotations to save'.