License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Protocol status: Working
We use this protocol and it's working
Created: May 27, 2026
Last Modified: June 15, 2026
Protocol Integer ID: 318021
Keywords: OpenCitations, ROR, OpenAIRE, map of italian science, citations network, italian universities through opencitation, italian university, institutional network, citation link, bloom map, cited publication, research output, university, research, publication, author affiliation, opencitation, international reach of italian academic research, italian academic research, research organization registry, iris publication, tracing citation link, bloom project, part of the bloom project, institution, data from opencitation, specific institution, international reach, mapping, publication with author affiliation, research question, country, bloom map of italian science protocol, italian science protocol, specific italian institution, italian institution, university citation, outgoing citation relationship, iris dataset, reproducible workflow, compressed archive
Abstract
As part of the BLOOM project, this protocol describes a reproducible workflow to map the international reach of Italian academic research. The workflow addresses the research question: what are the institutions and countries that cite, or are cited by, the IRIS publications of a specific Italian institution as indexed in OpenCitations?
The protocol starts from the IRIS dataset for six Italian institutions and combines it with OpenCitations, OpenAIRE and ROR data. The workflow uses timestamped data dumps and streaming scripts that scan compressed archives,
keep only the records needed by the project, and write explicit intermediate files.
The final outputs are per-university citation-count tables for incoming and outgoing citation relationships, aggregated both by organization and by country.
Prepare the starting existing datasets
The IRIS dump contains the publication records for the six Italian institutions considered in the project. It is structured by university, with one folder per institution. The key input for this workflow is each university’s iris_in_oc_index/iris_in_oc_index.csv, which lists citation pairs involving IRIS publications already recognized in OpenCitations. Each row includes the citing and cited OMIDs, the citation identifier, and flags indicating whether each side belongs to the IRIS set.
The ROR dump contains Research Organization Registry records. It is provided as a JSON dataset where each record describes an organization, including its ROR identifier, names, and geographic location data. In this workflow, it is used to normalize organization names and assign country names and country codes to OpenAIRE organizations that expose a ROR identifier.
The OpenCitations Meta dump contains bibliographic metadata and persistent identifiers for publications indexed by OpenCitations. It is distributed as a compressed tar archive containing CSV files. In this workflow, it is streamed to resolve OMIDs from the IRIS citation pairs into DOI, PMID, ISBN and publication date values.
In this case, we download the version-specific compressed archive dump but we don't extract it as it's too big; we will stream its content piece by piece using standard Python tools.
Finally, the OpenAIRE Graph dump is the biggest one, but the only pieces we're interested in are publication, organization and relation records. It is split into multiple tar archives: 14 publication dumps, 14 relation dumps and a single organization dump. Publication records are used to match DOI/PMID values to OpenAIRE publication IDs; relation records are used to extract author-affiliation links between publications and organizations; organization records provide OpenAIRE organization identifiers and metadata.
The data processing workflow requires a modern version of Python and one external library for streaming content of JSON files that are too big to be loaded in memory.
We can simply use requirements.txt to install the dependencies for this project.
Command
Install dependencies
pip install -r requirements.txt
Map IRIS citation pairs to OpenCitations PIDs
The first step is enriching the IRIS index file (iris_in_oc_index.csv) with PIDs of each publication usin the script build_iris_oc_pids.py.
The script performs three phases:
It reads all six IRIS `iris_in_oc_index.csv` files and collects every citing and cited OMID appearing in the IRIS citation pairs.
It streams the OpenCitations Meta tar.gz dump directly, without extracting the full archive to disk, and extracts DOI, PMID, ISBN and publication date only for the OMIDs collected in phase 1.
It re-reads the IRIS citation-pair CSVs and writes enriched per-university rows containing the citation direction and the PIDs for both sides of each citation pair.
Citation direction is derived from the IRIS membership flags in the input index:
incoming: the cited publication is in the university's IRIS set and the citing publication is external.
outgoing: the citing publication is in the university's IRIS set and the cited publication is external.
internal: both sides of the citation are in the same university's IRIS set.
The script also creates a deduplicated csv index of unique pids. Publications that share at least one PID are grouped together so the OpenAIRE lookup in the next stages does not repeat work unnecessarily.
The following files are produced and written in data/iris_oc_pids/:
<university>/iris_oc_pids.csv: enriched citation-pair table with OCI, direction, citing OMID/PIDs/date and cited OMID/PIDs/date.
<university>/iris_oc_pids.missing.csv: citation pairs where one or both OMIDs could not be resolved in the OpenCitations dump.
<university>/iris_oc_pids.metadata.json: Runtime and row-count metadata for that university.
unique_pids.csv: deduplicated list of OMID, DOI, PMID and ISBN groups across all universities.
unique_pids.metadata.json: metadata for the unique PID generation step.
Map OpenAIRE organizations to ROR and countries
We then create a mapping between OpenAIRE organizations to ROR organizations, which contain also country data.
The script match_organizations_countries.py first loads the local ROR dump and builds an index from ROR identifier to display name and country. It then streams the OpenAIRE organization.tar dump and creates a lookup from OpenAIRE organization ID to a normalized organization record.
Country resolution follows this order:
If the OpenAIRE organization has a ROR PID and that ROR ID is present in the local ROR dump, the country and display name are taken from ROR.
If no ROR PID is available, the script falls back to OpenAIRE's own country field, when present.
Organizations without a usable ROR match and without an OpenAIRE country are excluded from the output lookup.
The following files are produced and written in data/openaire_ror_countries/:
openaire_ror_countries.json: mapping from OpenAIRE organization ID to legal/display name, country name, country code, country source and ROR ID.
openaire_ror_countries.metadata.json: counts for ROR matches, OpenAIRE-country fallbacks, missing records and output size.
Resolve publication PIDs to OpenAIRE organizations
We can now map publications to their authors' affiliations, using the script resolve_pids_organizations.py.
This is the most computationally intensive stage. It performs a streaming scan of the OpenAIRE publication and relation dumps.
The script performs four phases:
It reads data/iris_oc_pids/unique_pids.csv, normalizes DOI and PMID values, and builds lookup indexes. Rows without DOI or PMID are written to a missing file because they are not searchable in the OpenAIRE publication dump.
It streams all publication_*.tar files and matches OpenAIRE publication records to the needed publications by DOI first and PMID second (ISBN are not used by OpenAIRE).
It streams all relation_*.tar files and keeps only author-affiliation relations, specifically hasAuthorInstitution and isAuthorInstitutionOf, where the publication endpoint is one of the matched OpenAIRE publications.
It resolves the collected OpenAIRE organization IDs through openaire_ror_countries.json from step 2, attaching organization names, ROR IDs and countries.
Because this stage can take several hours, it writes checkpoint files after publication and relation tar files:
On a successful run the checkpoints are removed. If the process is interrupted, rerunning the script resumes from the checkpointed state.
The following files are produced under data/iris_openaire_organizations/:
omid_organizations.json: OMID-keyed mapping containing publication PIDs, the matched OpenAIRE publication ID and the resolved affiliated organizations.
missing_no_searchable_pid.csv: PID groups from unique_pids.csv that have no DOI or PMID and therefore cannot be searched in OpenAIRE.
Count incoming and outgoing citations by organization and country
The final script count_citations.py performs the counting.
This step creates the main tabular outputs used for analysis and visualization. It reads the per-university iris_oc_pids.csv files from step 1 and streams omid_organizations.json from step 3 using an incremental JSON parser.
For each university, the script determines which external publication OMIDs contribute to incoming and outgoing citation counts:
For an incoming row, the citing publication's organizations are counted as incoming sources for the university.
For an outgoing row, the cited publication's organizations are counted as outgoing targets for the university.
For an internal row, the cited side contributes to incoming counts and the citing side contributes to outgoing counts.
Organization lists are deduplicated per publication, preferring ROR IDs when available and otherwise merging by case-insensitive organization name plus country code. Final organization counters are also merged by ROR and then by name/country to reduce duplicate organization rows.
The following files are produced under data/citation_counts/:
<university>/citation_counts_organizations_incoming.csv: incoming citation counts by citing organization.
<university>/citation_counts_organizations_outgoing.csv: outgoing citation counts by cited organization.
<university>/citation_counts_countries_incoming.csv: incoming citation counts aggregated by citing country.
<university>/citation_counts_countries_outgoing.csv: outgoing citation counts aggregated by cited country.
The analysis utilizes an Integer Counting approach, which assigns a full weight of one to every unique research organization or nation involved in a citation event. This methodological choice was made to uphold the principles of transparency and reproducibility, as it avoids the "fractional artifacts" and interpretational complexities that often arise from splitting citation values. Furthermore, this strategy specifically addresses the characteristics of the OpenAIRE dataset, where records can be fragmented and certain publications may feature exceptionally long lists of institutional affiliates. In these scenarios, a fractional counting model would likely marginalize individual actors by diluting their contributions into nearly invisible, infinitesimal values. Conversely, by applying integer weights, we ensure that every institutional and geographic connection remains clearly surfaced, thereby preserving the true scale and integrity of knowledge flows within global scientific collaborations.
Produce data visualizations
With the final dataset of 24 CSV (6 universities × 4 incoming/outgoing organizations/countries pairs) we can now visualize the data.
Data loading and setup
First, we developed the data_utils.py script to load, normalize, and clean the datasets. The processed datasets are then validated using validation.py, exported, and reused across all visualizations. For each type of visualization, we create a separate version for each Italian institution and, when relevant, for each citation direction (incoming and outgoing).
Visualizations were developed using the following python libraries:
pandas for data loading, cleaning and manipulation
plotly for creating interactive charts and visualizations
This notebook explores citation patterns at the country level. The analysis begins from a macro perspective: a stacked bar chart shows the geographic distribution of both incoming and outgoing citations. To uncover institution-specific behaviors that may be hidden within aggregate patterns, several complementary visualizations were developed.
The data are explored to highlight relationships between institutions and countries, as well as similarities and differences between incoming and outgoing citation flows. In particular, an asymmetry map is introduced to investigate whether the six Italian institutions act primarily as citation receivers or citation consumers. Additionally, a heatmap is used to assess how the behavior of each institution diverges from the overall average, revealing differences that may be obscured by the shared structural patterns.
This notebook analyzes citation flows at the organizational level. It includes diverging bar charts to compare incoming and outgoing citation patterns among the top 15 partner organizations. Further analyses, particularly through scatter plots, investigate the reciprocity of citation relationships between institutions and their partners.
The notebook also provides a concrete example of how including or excluding Italian organizations affects the interpretation and visualization of the data.
Country and organization citation analysis - countries_organisations_analysis.ipynb
This notebook combines the country-level and organization-level perspectives to explore deeper connections between them. The analysis begins by measuring the concentration ratio of the top three organizations within each country. The main objective is to understand how citations are distributed across partner countries and their associated institutions.
The notebook also incorporates a temporal dimension by extending the scatter plot analysis to yearly datasets, examining the relationship between citation concentration and citation volume over time. Finally, a sunburst chart is included to visualize the top 50 organizations within the top 20 countries contributing to the citation volume of each institution.
All notebooks are available in the GitHub repository. In addition, a complementary website has been developed featuring interactive visualizations and a narrative framework that presents the analysis and key findings.