Jun 26, 2025

Public workspaceOpen Science 24/25 - Protocol V.3

  • Alberto Ciarrocca1,
  • Anna Nicoletti1,
  • Ahmadreza Nazari1,
  • Pietro Tisci1,
  • Lucrezia Pograri1,
  • Martina Pensalfini1,
  • Sergei Slinkin1
  • 1University of Bologna
  • Open Science
Icon indicating open access to content
QR code linking to this content
Protocol CitationAlberto Ciarrocca, Anna Nicoletti, Ahmadreza Nazari, Pietro Tisci, Lucrezia Pograri, Martina Pensalfini, Sergei Slinkin 2025. Open Science 24/25 - Protocol. protocols.io https://dx.doi.org/10.17504/protocols.io.4r3l264ppv1y/v3Version created by Alberto Ciarrocca
License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License,  which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Protocol status: Working
We use this protocol and it's working
Created: June 25, 2025
Last Modified: June 26, 2025
Protocol Integer ID: 221022
Keywords: open science, university of bologna, unibo, alma mater studiorum, iris, researchers across various repository, citation impact of unibo, integration of these research object, research object, overlap among these repository, such research object, citation analysis tool, current research information system, various repository, much of such research object, analyzed metadata, repository, research output, open science practices at the university, existing repository, citation dynamic, citation pattern, dissemination of diverse research output, research, many citation, researcher, coverage of these research output, diverse research output, strengthening open science practice, citation impact, generalist repository, repository usage, open science, protocol purpose this study, research question, unibo personnel, metadata, current research, methodology the project team, university of bologna, dissemination, providing insight, dissemination pattern, unibo, database, project team, integration within unibo, opencitation
Abstract
Purpose
This study aims to evaluate the representation and dissemination of diverse research outputs - such as software, databases, exhibitions, audio-visual materials, and others - produced by the University of Bologna's (UNIBO) researchers across various repositories. It seeks to assess the extent of overlap among these repositories, analyze citation dynamics involving these research objects, and determine their integration within UNIBO's [Current Research Information System (IRIS)](https://cris.unibo.it).
The Research Questions (RQs) are formulated as follows:
1. What is the current coverage of these kinds of research objects created by UNIBO personnel in existing repositories?
2. Is there any overlap among these repositories (i.e., research objects deposited in more than one)?
3. How many citations (incoming and outgoing), as in OpenCitations, are these research objects involved in?
4. How much of such research objects are actually mapped in IRIS?

Methodology
The project team has systematically collected and analyzed metadata from selected institutional, disciplinary, and generalist repositories, including [AMS Acta](https://amsacta.unibo.it/), [Software Heritage](https://www.softwareheritage.org/), and [Zenodo](https://zenodo.org/). Relevant data and metadata were extracted using APIs and web scraping techniques. The analysis identified cross-repository depositions to assess overlaps and employ citation analysis tools, such as [OpenCitations](https://opencitations.net/), to examine citation patterns. Additionally, the collected data were cross-referenced with IRIS to evaluate its coverage of these research outputs.

Value
This study shed light on the dissemination patterns and citation impact of UNIBO's diverse research outputs, providing insights to enhance their visibility and accessibility. The findings can inform strategies to optimize repository usage and improve the integration of these research objects within IRIS, ultimately strengthening open science practices at the University of Bologna and promoting a more cohesive approach for maximizing the impact of its research outputs.
Troubleshooting
Before start
Note: Python 3.13.2 or previous versions are required to run the software.
Prepare the working environment by executing the following commands:

# Clone the repository
# Move to the repository folder
cd 2024-2025
# Install dependencies
pip install -r requirements.txt


Data Gathering
From Zenodo use the RESTful API protocol to harvest metadata in JSON format.


All the records were retrieved through the "records" operation of the API (https://zenodo.org/api/records), by querying for affiliation (of creators and contributors) and ORCID IDs extracted from IRIS (for creators and contributors).

The script used for the query is in [ZenodoDataExtraction.ipynb]

Expected result
ZenodoData.json file that follows the JSON Metadata Schema containing the properties to describe research objects, such as:

'title', 'doi', 'publication_date', 'description', 'access_right', 'license', 'creators', 'contributors', 'keywords', 'language', 'resource_type', 'relations', 'swh'



Metadata was harvested and filtered from the AMS Acta repository using the OAI-PMH API and the Python library Sickle.


Metadata was collected to obtain ePrint IDs, which were then used to retrieve enriched JSON records. A two-step filtering process - based on author affiliations and ORCID identifiers - was applied to isolate records associated with the University of Bologna. The final dataset includes approximately 1,000 publications, highlighting various research outputs including datasets and software.

The script used for the query is in [AMSActaReduced.ipynb] and the script for the filtering is in [filteringAMSActa.ipynb]

Expected result
amsacta_filtered_affiliation_or_orcid_doubles.json file in Dublin Core format containing the properties to describe research objects, such as:

'uri', 'creators', 'monograph_type', 'publisher', 'title', 'date', 'doi', 'eprintid', 'keywords', 'type', 'abstract'

Dataset
amsacta_filtered_affiliation_or_orcid_doubles.json
NAME

Query Software Heritage API endpoint to identify software repositories affiliated with the University of Bologna.


An initial query on the API endpoint was performed on relevant keywords, and later filtered using heuristic-based keyword searches and email domain analysis: the pipeline filters candidate origins by inspecting README content and commit metadata for institutional markers.

The script used for the query and filtering is in [swh_data_extraction.ipynb]

Expected result
unibo_repositories_swh.json file of validated repositories containing the following properties:

'url', 'rev', 'dir_id'

The 'authors' object is added during the processing and contains the 'name' and 'email' properties.


The newest version (May 30th 2025) of the IRIS data dump was retrieved to be later used in our research.
Dataset
UNIBO IRIS bibliograpghic data dump
NAME
It is comprised of six CSV files describing 412,272 bibliographic entities and with a total uncompressed size of 413 MB (125 MB zipped).
Convert and Normalize Data across datasets
Read all datasets obtained in previous steps with pandas and extract necessary information from nested key-value pairs.
Normalize the text (deleting white spaces, removing html tags from text...), drop duplicates and normalize all the column labels.
Additionally, each entry has been flagged with the corresponding origin dataset, expressed in the 'src_repo' column.
Merge vertically all four datasets and output one csv file of 419,829 records with a size of 120 MB.

The script for the conversion, normalization and merging is in [mashup.ipynb]

Expected result

ABCDEFGHIJKLMNOPQRS
titleiddoicreatorsorciddatedescriptionresource_typeurltyperightspublisherrelationcommunitiesswh_idkeywordssrc_repoissnpmid
mashup.csv



Data Analysis
Filter the mash-up data to obtain research objects meaningful to our research purposes.

To do so we created the dictionary 'iris_map' lists the 15 IRIS categories relevant to our project. For every incoming row the helper 'classify_row()' concatenates the 'type' and 'resource_type' fields, lowers the case and tests the strings against the regular-expression list attached to each IRIS code.

Rows that match any pattern receive a new label 'iris_cat', the others are discarded on the fly.

All labelled chunks are concatenated.

Expected result
mashup_IRIS_subset_v3.csv (8 699 rows, 20 columns – the original metadata + iris_cat)

Answer RQ1 ("What is the current coverage of these kinds of research objects created by UNIBO personnel in existing repositories?") by counting in mashup_IRIS_subset_v3.csv how many research objects we have exctracted from each dataset.

The script for the filtering of the data and for answering RQ1 is [RQ_1.py]
Answer RQ2 ("Is there any overlap among these repositories (i.e., research objects deposited in more than one)?") by detecting duplicates among all repositories in mashup_IRIS_subset_v3.csv.

We accomplished this by comparing both the 'doi' and 'title' columns and producing a new table outlining the amount of overlapping objects for each pair of datasets.

Expected result
duplicate_objects_v3.csv

The script for answering RQ2 is [RQ_2.py]

The Answer to RQ4 ("How much of such research objects are actually mapped in IRIS?") is already present in duplicate_objects_v3.csv, since IRIS is part of mashup_IRIS_subset_v3.csv.
Answer RQ3 ("How many citations (incoming and outgoing), as in OpenCitations, are these research objects involved in?")

The script for answering RQ3 is [RQ_3.py]
We query OpenCitation's API to retrieve the number of ingoing and outgoing citation for each DOI available.

We used 'citations' operation to retrieve ingoing citation and 'references' to retrieve outgoing citations; the resulting JSON file was then converted in CSV format.

Expected result

ABCDEF
doicit_num_outgoingcit_num_ingoingoci_outgoingoci_ingoingstatus
CitationsCount.csv


After collecting all the citations from the API we needed to reassign each DOI to its original repository connecting the list of DOI used to the correspective 'src_repo' from mashup_IRIS_subset_v3.csv
With the method '.iterrows()' we calculated the total of ingoing and outgoing citations for each repository and stored it in two dictionaries 'ingoing' and 'outgoing' where the keys are the 4 repositories and the values are the sum of the citations
Data Visualisation
In order to visualise key findings from our research, we chose to represent the data obtained in the Data Analysis section in bar charts, using pandas and matplotlib.pyplot.
A Jupyter-Notebook was created representing the results with visualisations
Expected result
DataVisualisation.ipynb

Data Publication
All software has been published here
All datasets have been published here