Dual-Method Convergence Pipeline for LC-MS Metabolite Annotation: R IPA and ipaPy2

Lita Doolan

May 04, 2026

Dual-Method Convergence Pipeline for LC-MS Metabolite Annotation: R IPA and ipaPy2

DOI

https://dx.doi.org/10.17504/protocols.io.bp2l6jqkkvqe/v1

Lita Doolan¹

¹KCL

LDolanLDolan

Lita Doolan

KCL

DOI: https://dx.doi.org/10.17504/protocols.io.bp2l6jqkkvqe/v1

Protocol Citation: Lita Doolan 2026. Dual-Method Convergence Pipeline for LC-MS Metabolite Annotation: R IPA and ipaPy2. protocols.io https://dx.doi.org/10.17504/protocols.io.bp2l6jqkkvqe/v1

License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

Protocol status: Working

We use this protocol and it's working

Created: May 03, 2026

Last Modified: May 04, 2026

Protocol Integer ID: 316224

Keywords: metabolomics, LC-MS, annotation, ipaPy2, IPA, probabilistic annotation, dual-method validation, HMDB, ECMDB, KEGG, E. coli, untargeted metabolomics, convergence, false positive reduction, untargeted metabolomics data, metabolomics data, ms metabolite annotation, method convergence pipeline for lc, reproducible pipeline, method convergence pipeline, confidence annotation set

Abstract

A reproducible pipeline applying R IPA and ipaPy2 independently to LC-MS untargeted metabolomics data and intersecting outputs to produce a high-confidence annotation set. Demonstrated on E. coli data (Del Carratore et al., Zenodo DOI: 10.5281/zenodo.3414903). Code available at: https://github.com/LDolanLDolan/metabolomics-dual-validation-pipeline

Guidelines

1. Paths**
   - Define project root and output directory.
   - Specify input, database, adducts, and results file paths.

2. Check files exist**
   - Verify existence of input, database, and adducts files.

3. Load files**
- Load input, database, and adducts files using `pd.read_csv`.

4. Clean DB columns**
   - Strip whitespace from column names.
   - Rename specific columns for consistency.
   - Remove duplicated columns.

5. Build required DB columns**
   - Ensure required columns exist and are correctly formatted.
   - Assign IDs based on available columns.
   - Handle missing columns with default values.

6. Validate DB formulas**
- Define `is_valid_formula` function to check formula validity.
   - Apply validation and remove invalid entries.

7. Clean adducts**
   - Define required adduct columns.
   - Raise error if columns are missing.
   - Clean and format adduct data.

8. Reset indices**
- Reset indices for DB, adductsAll, and df_ipa.

9. Run simpleIPA**
   - Execute `ipa.simpleIPA` with specified parameters.
   - Note: This may take a few minutes for 1000 features.

10. Save results**
   - Handle both dict and dataframe outputs.
   - Convert dict result to dataframe if necessary.
- Insert feature_id into annotations.
   - Concatenate rows into final dataframe.
   - Save final dataframe to CSV.
   - Print top annotations or first 20 rows if no 'post' column.

11. Handle annotations**
   - Check if annotations are a DataFrame or dict.
- Insert feature_id and append to rows.
- Concatenate rows into final_df.
- Save final_df to CSV and print top annotations.

5. Build required DB columns

If 'pk' is not in DB.columns, assign a range from 1 to the length of DB to DB['pk'].
If 'MS2' is not in DB.columns, set DB['MS2'] to an empty string.
If 'reactions' is not in DB.columns, set DB['reactions'] to an empty string.
Remove duplicated columns from DB.

6. Validate DB formulas

Define `is_valid_formula` function to check the validity of a formula.
Apply `is_valid_formula` to the 'formula' column in DB and store results in a new column '_valid'.
Remove rows with invalid formulas from DB and reset the index.
Print the number of rows remaining in DB after validation.

7. Clean adducts

Define required adduct columns: 'name', 'calc', 'Charge', 'Mult', 'Mass', 'Ion_mode', 'Formula_add', 'Formula_ded', 'Multi'.
Raise a ValueError if any required adduct columns are missing from adductsAll.columns.
Clean 'Formula_add' column in adductsAll by filling empty values, converting to string, stripping whitespace, and replacing 'H0' and 'nan' with an empty string.
Clean 'Formula_ded' column in adductsAll by filling empty values, converting to string, stripping whitespace, and replacing 'H0' and 'nan' with an empty string.
Fill missing values in 'name', 'calc', and 'Ion_mode' columns in adductsAll with empty strings and convert to string type.
Convert 'Charge', 'Mult', 'Mass', and 'Multi' columns in adductsAll to numeric, coercing errors.

8. Reset indices

Reset indices for DB, adductsAll, and df_ipa.

9. Run simpleIPA

Execute `ipa.simpleIPA` with specified parameters: df=df_ipa, ionisation=1, DB=DB, adductsAll=adductsAll, ppm=10.
Note: This may take a few minutes for 1000 features.

10. Save results

Print the result type using `type(result)`.
Handle both dict and dataframe outputs.
If `result` is a dict, convert it to a dataframe and insert `feature_id` into annotations.
Concatenate rows into `final_df`.
If `result` is a dataframe, assign it to `final_df`.
Save `final_df` to CSV if it contains data.
Print top annotations or first 20 rows if no 'post' column is present.

11. Handle annotations

If `annotations` is a DataFrame, copy it to `df_ann`, insert `feature_id`, and append to `rows`.
If `annotations` is a dict, create a row with `feature_id`, update with `annotations`, and append to `rows`.
Concatenate `rows` into `final_df` if not empty, else create an empty DataFrame.
If `result` is a DataFrame, assign it to `final_df`.
If `result` is not a DataFrame, print unexpected result type and create an empty DataFrame.
If `final_df` has more than 0 rows, save to CSV and print top annotations.
If 'post' is in `final_df` columns, sort and print top annotations; otherwise, print first 20 rows.
If `final_df` is empty, print 'No annotations to save'.