Jul 03, 2026

Identifying methods citations in Parkinsons journal articles

  • 1Francis Crick Institute
Icon indicating open access to content
QR code linking to this content
Protocol CitationBeth Montague-Hellen 2026. Identifying methods citations in Parkinsons journal articles. protocols.io https://dx.doi.org/10.17504/protocols.io.8epv5k5y6v1b/v1
License: This is an open access  protocol  distributed under the terms of the  Creative Commons Attribution License,  which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Protocol status: Working
We use this protocol and it's working
Created: November 04, 2025
Last Modified: July 03, 2026
Protocol  Integer ID: 231505
Keywords: Open Methods, Parkinsons, Pubmed, Parkinson's Disease, Open Access Corpus
Funders Acknowledgements:
Cancer Research UK
Grant ID: CC0103
Wellcome Trust
Grant ID: CC0103
UK Medical Research Council
Grant ID: CC0103
Abstract
Protocol for creation of a corpus of Parkinson's Disease articles from Pubmed, and identification of whether these articles have protocols.io mentions or open methods journal citations.
Guidelines
As with all computer algorithms, the code included here may introduce errors. The only way to be certain is to check the results.
Download pubmed data
Search Pubmed for "Parkinson Disease"[MeSH Terms] AND ((excludepreprints[Filter]) AND (english[Filter]) AND (2023:2025[pdat])) NOT Review[ptyp]

Click save button. Select "All results" and format "csv".
Identify records with fully available text

Note
Whilst there are other ways to download large corpuses of text, this process ensures that all articles are legally allowed to be text mined.


Run retrievePMCID_papers.py
Output file is Parkinson_papers_with_pmcids.csv
Download OA file list at https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_file_list.txt
Run CheckOAstatus.py
Output:
- Parkinsons_papers_with_pmcids_and_oa.csv
Run download_pmc_xml.py
Output files for each paper are in a directory called: pmc_bioc_xml/
Identify papers from set of funders
Run scan_funders_global.py
Output:
- funder_counts_global.csv - a matrix with funder in column 1 and number of articles in column 2
- funder_matches_global.csv - a binary matrix with PMCID in column 1 and funders in each subsequent column. 1 where a funder is acknowledged in a paper, 0 where the funder is not acknowledged in the paper.
To ensure that all PMCID columns have a correct ID.
Run python normalise_pmcid.py
Output:
- funder_matches_global_w_pmc.csv
Identify links or citations to protocols.io or methods journals
Run protocolsio_ref_counts.py
Output:
- funder_matches_global_with_protocolsio_flags.csv

Note
This over predicts protocols.io mentions, particularly in PLoS One, and so results should be checked by a human. The references.txt file shows references identified which makes this process quicker than otherwise.