BLOOM Disciplinary Flow Protocol v.3

Regina Manyara; Yi Hua Li; Yiğit Ak; Ilaria De Dominicis; Shiho Nakamura; Tianchi Yang; Qinghao Chen

Jun 15, 2026

Version 3

BLOOM Disciplinary Flow Protocol v.3 V.3

DOI

https://dx.doi.org/10.17504/protocols.io.dm6gp7nw1gzp/v3

Regina Manyara¹,
Yi Hua Li¹,
Yiğit Ak¹,
Ilaria De Dominicis¹,
Shiho Nakamura¹,
Tianchi Yang¹,
Qinghao Chen¹

¹University of Bologna

Yigit Ak

unibo

DOI: https://dx.doi.org/10.17504/protocols.io.dm6gp7nw1gzp/v3

External link: https://github.com/open-sci/2025-2026/tree/1b

Protocol Citation: Regina Manyara, Yi Hua Li, Yiğit Ak, Ilaria De Dominicis, Shiho Nakamura, Tianchi Yang, Qinghao Chen 2026. BLOOM Disciplinary Flow Protocol v.3. protocols.io https://dx.doi.org/10.17504/protocols.io.dm6gp7nw1gzp/v3Version created by Yigit Ak

License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

Protocol status: Working

We use this protocol and it's working

Created: May 30, 2026

Last Modified: June 15, 2026

Protocol Integer ID: 318235

Keywords: Open Science, Bibliometrics, Citation Analysis, Digital Humanities, Open Citations, institutional citation dataset, reusing institutional citation dataset, raw citation link, raw citation links with journal issn, citation link, knowledge flow across different academic discipline, network visualization, journal issn, standard disciplinary classification, visualization, different academic discipline, tagged dataset, specific institution, dataset, bloom project, institution, knowledge flow, bloom disciplinary flow protocol, part of the bloom project

Abstract

As part of the BLOOM project, this protocol outlines a data processing workflow for analyzing cross-disciplinary citation patterns. By reusing institutional citation datasets (IRIS) and querying the OpenCitations infrastructure, we enrich raw citation links with journal ISSNs and map them to standard disciplinary classifications. The expected result is a cleaned, discipline-tagged dataset that can be directly used to generate network visualizations. These visualizations will reveal the knowledge flow across different academic disciplines for specific institutions.

Materials

Dataset:

IRIS to Open Citation Mappings: This contains the scholarly information available in both IRIS (6 institutions in Italy) and Open Citation. The dataset is divided into six files, one for each institution and each with two files - iris_oc_meta - which contains the scholarly metadata and iris_oc_index which contains the citation information for the journals. From https://zenodo.org/records/18202530.

Scimago: This dataset is directly downloaded from https://www.scimagojr.com/journalrank.php, which is a list of all journals metadata in SCImago Journal Rank, ranked by the SJR (a measure of journals impact, influence or prestige), available on our GitHub: https://github.com/open-sci/2025-2026/blob/1b/bloom/disciplinary_flow/journal_data/.scimago.csv.icloud.

DOAJ: The Journal CSV provides a full list of all public journals in the DOAJ database, available on our GitHub: https://github.com/open-sci/2025-2026/blob/1b/bloom/disciplinary_flow/journal_data/.doaj.csv.icloud

Alignment with LOC Categories: a pre-existing SKOS alignment dataset that maps multiple classification systems to the Library of Congress scheme: https://github.com/open-sci/2025-2026/tree/main/bloom/categories.

Troubleshooting

Problem

Dataset is too big to run all the rows in a single api request

Solution

Download the OC data instead of using API

Before start

Before starting, please check the list of requirements from our GitHub repository: https://github.com/open-sci/2025-2026/blob/main/requirements.txt

Data Gathering and Cleaning

Preparation of Existing Dataset 

This workflow builds upon the IRIS-OC mapping dataset produced by Andreose et al. (2026), publicly available on Zenodo at https://doi.org/10.5281/zenodo.18202530. The dataset contains one folder per institution, each including two files: “iris_oc_meta”, which provides bibliographic metadata for publications recognized in both IRIS and OpenCitations, and “iris_oc_index”, which contains the citation relationships between those publications.

For this workflow, only “iris_oc_index” was used as input, as it contains the citation pairs required for disciplinary flow analysis. 

As a preliminary step, a flow column was added to each “iris_oc_index” file to classify each citation as Incoming, Outgoing, or Internal, defined from the perspective of the IRIS institution and based on the existing “is_citing_iris” and “is_cited_iris” boolean flags. This classification was introduced to make subsequent analysis and visualization more straightforward, and is preserved throughout all subsequent processing steps.

Step 1: Retrieve venue data to expand IRIS_OC_Index 

We intitially wanted to approach this by making an API call to the OC Index API endpoint, but in the end the number of IDs we needed to pass, made the request too long. To address this issue we instead downloaded a copy of the OC Index data dump from OpenCitations Meta (Dump created on 2025-06-06) and created a script to pass through it without needing to unzip the big dataset. 

Dataset
META
NAME
https://opencitations.net/download#meta
LINK

The script uses the tarfile python library to parse through the data dump without unzipping all the content.
The script will run on each institution file respectively using the following workflow:

Import iris_oc_index from original IRIS_OC mapping dataset as dataframe
Before expanding the data with the venue information from the data dump we added an intermidary step to add a "flow" column defining: outgoing, incoming and internal citations. This was intended to make the subsequent analysis and visualisation of the data more straightforward. 
The citation flow is defined from the perspective of the IRIS institution, summarising the "is_citing_iris" and "is_cited_iris" columns into a single column.

def classify_flow(row):
    if row["is_citing_iris"] and row["is_cited_iris"]:
        return "Internal"
    elif row["is_citing_iris"] and not row["is_cited_iris"]:
        return "Outgoing"
    elif not row["is_citing_iris"] and row["is_cited_iris"]:
        return "Incoming"

iris_oc_index["flow"] = iris_oc_index.apply(classify_flow, axis=1)

Get unique OMIDs to look up in the OC data dump
Creates a set of all unique omid values from both the citing and cited columns in the input dataset.

citing_omids_set = set(iris_oc_index["citing"].unique())
cited_omids_set = set(iris_oc_index["cited"].unique())
all_omids_needed = citing_omids_set | cited_omids_set

Extract venue data directly from tar.gz data dump
For each csv in the zip folder, the script converts it to a JSON file
It then extracts the OMID from the larger ID string, which contains multiple PIDs 
Once the ID matches the format of the citing and cited columns of the input dataset, the OMID was used as a key to locate the corresponding venue data of that entity
The resulting mapping was then saved to a dictionary
A seperate dictionary was used to save the OMIDs which did not have any venue data
The full execution of this part of the script can be found in the Github repository

Use omid-venue mapping to expand original dataset

Once all corresponding venues had been found we used this to create two new columns in the input dataset -> "citing_venue" and "cited_venue". This newly formed dataset was then saved in an external csv file.

Step 2: Use venue PIDs (e.g issn) to extract subject info from external data dumps (DOAJ and Scimago)

This is the disciplinary enrichment section. 

The script takes the iris_oc_venues_matched.csv produced by the previous script and maps each venue to a scholarly discipline by matching ISSNs.

Pandas and Regex should be uploaded to script. 

All file paths are declared at the top under the configuration section. This makes it rapid to use the script for different institutions because the only need is changing the strings rather than searching through the code.

import pandas as pd
import re


# ==================================================
# CONFIGURATION
# ==================================================

#Update your paths

INPUT_FILE = r"D:\Downloads\Open Science\data\UNITO\iris_oc_venues_matched.csv"
SCIMAGO_FILE = r"D:\Downloads\Open Science\journal_data\scimago.csv"
DOAJ_FILE = r"D:\Downloads\Open Science\journal_data\doaj.csv"
OUTPUT_FILE = r"D:\Downloads\Open Science\data\UNITO\disciplinary_map_matched.csv"
NO_MATCH_FILE = r"D:\Downloads\Open Science\data\UNITO\disciplinary_map_no_match.csv"
NO_ISSN_FILE = r"D:\Downloads\Open Science\data\UNITO\disciplinary_map_no_issn.csv"

Extraction of ISSN from Venue Metadata

ISSNs are the only identifier that is standardized, stable, unique. Therefore, ISSNs are used as persistent identifiers (PIDs) for journals. 

The venue field contain all the information in the same string. Since ISSN has a completely fixed format, we used Regex to extract the ISSNs from the venue string. 

Each match was normalized — hyphen removed, uppercased — 
This normalization is critical because the lookup dictionaries built use the same format. If one side has a hyphen and the other does not, a valid ISSN would fail to match.

The function returns a list rather than a single string because a journal can have two ISSNs (print and electronic), and both need to be tried during lookup.

def extract_issns(text):
    """
    Extract all ISSNs from venue metadata.
    Returns a list of normalized ISSNs.
    Example:
        1234-5678 -> 12345678
    """

    if pd.isna(text):
        return []

    matches = re.findall(r"\d{4}-\d{3}[\dX]", str(text))

    cleaned = []

    for match in matches:
        cleaned.append(
            match.replace("-", "").upper()
        )

    return cleaned

Two new fields citing_issn and cited_issn are created.

The function below is for loading the datasets.

SCImago has a semicolon-separated CSV, not comma-separated, so reading it with the default comma separator would produce completely broken columns. SCImago contains accented characters in journal titles and country names (e.g. Université). The encoding Latin1 is used for these characters and names.

def load_datasets():

    print("Step 1: Loading datasets...")

    try:
        main_df = pd.read_csv(INPUT_FILE)

        # SCImago files are generally semicolon-separated
        scimago = pd.read_csv(
            SCIMAGO_FILE,
            sep=";",
            encoding="latin1"
        )

        doaj = pd.read_csv(DOAJ_FILE)

        print("✓ Files loaded successfully.")

        return main_df, scimago, doaj

    except Exception as e:

        print(f"✗ Loading Error: {e}")

        return None, None, None

OpenCitations provides citation relationships and venue metadata, it does not directly provide subject classifications or disciplinary categories.

In order to gather disciplinary information about Journals, we relied on SCImago Journal Rank (https://www.scimagojr.com/) and Directory of Open Access Journal (https://doaj.org/docs/journal-csv). 

Dataset
SCImago Journal Rank
NAME
https://www.scimagojr.com
LINK

Dataset
Directory of Open Access Journal
NAME
https://doaj.org/docs/journal-csv
LINK

We collected both data dumps and performed two processing steps for cleaning the ISSNs.

DOAJ stores print and electronic ISSNs in two separate columns and many journals have both. If we only kept the print ISSN in the lookup dictionary, every record where the electronic ISSN was used would fail to match. The coverage of the matching step was increased by normalizing both issn_p and issn_e.

def clean_doaj(doaj):

    print("Step 2: Cleaning DOAJ ISSNs...")

    doaj["issn_p"] = (
        doaj["Journal ISSN (print version)"]
        .fillna("")
        .astype(str)
        .str.replace("-", "", regex=False)
        .str.upper()
    )

    doaj["issn_e"] = (
        doaj["Journal EISSN (online version)"]
        .fillna("")
        .astype(str)
        .str.replace("-", "", regex=False)
        .str.upper()
    )

    return doaj

SCImago stores multiple ISSNs per journal as a single comma-separated string in one cell. If you use the whole string as a dictionary key, it would never match anything because the incoming ISSN from the citation data is just a substring, not the full key.

We split the string and then used .explode() which takes each element of the list and creates a separate row for it while duplicating all other columns. One journal with three ISSNs becomes three rows, each with a single ISSN as its key. This makes every individual ISSN independently usable for lookup.

def clean_scimago(scimago):

    print("Step 3: Cleaning SCImago ISSNs...")

    scimago["Issn"] = (
        scimago["Issn"]
        .fillna("")
        .astype(str)
        .str.upper()
    )

    scimago["issn_list"] = (
        scimago["Issn"]
        .str.replace("-", "", regex=False)
        .str.split(",")
    )

    # remove whitespace
    cleaned_lists = []

    for items in scimago["issn_list"]:

        cleaned = []

        for item in items:
            cleaned.append(item.strip())

        cleaned_lists.append(cleaned)

    scimago["issn_list"] = cleaned_lists

    return scimago.explode("issn_list")

We created four new fields for both citing and cited parts, "scimago_area" and "scimago_categories" fields from SCImago data dump and, "doaj_lcc" and "doaj_subject" fields from DOAJ data dump.

Keeping all four rather than collapsing to one is important for coverage and flexibility.

We used Pandas library to perform mapping. The citing_issn and cited_issn fields were joined with the collected journal datasets.
Software
Pandas
NAME
https://github.com/pandas-dev/pandas
REPOSITORY
https://pypi.org/project/pandas/
SOURCE LINK
Lookup maps were constructed to support efficient ISSN-based metadata enrichment across external datasets.

From the SCImago dataset, ISSN identifiers were paired with corresponding disciplinary Areas and Categories metadata. Separate lookup dictionaries were created for each metadata field in order to enable journal-level disciplinary classification.

From the DOAJ dataset, both print ISSNs and electronic ISSNs were mapped to LCC Codes and Subjects metadata. Print and electronic ISSN mappings were then combined into unified lookup structures to maximize journal matching coverage across datasets.

These lookup maps enabled rapid ISSN-to-metadata retrieval during the enrichment process and reduced the computational cost of repeated dataset joins or merges on large-scale citation data.

# ==================================================
# BUILD LOOKUP MAPS
# ==================================================

def build_lookup_maps(scimago, doaj):

    print("Step 4: Building lookup maps...")

    # ---------- SCImago ----------
    s_area_map = dict(
        zip(scimago["issn_list"], scimago["Areas"])
    )

    s_category_map = dict(
        zip(scimago["issn_list"], scimago["Categories"])
    )

    # ---------- DOAJ ----------
    doaj_area_map = dict(
        zip(doaj["issn_p"], doaj["LCC Codes"])
    )

    doaj_area_map.update(
        dict(zip(doaj["issn_e"], doaj["LCC Codes"]))
    )

    doaj_subject_map = dict(
        zip(doaj["issn_p"], doaj["Subjects"])
    )

    doaj_subject_map.update(
        dict(zip(doaj["issn_e"], doaj["Subjects"]))
    )

    return (
        s_area_map,
        s_category_map,
        doaj_area_map,
        doaj_subject_map
    )


For each journal record, all extracted ISSN identifiers were iteratively checked against the corresponding lookup dictionaries derived from SCImago and DOAJ datasets. The enrichment function returned the first valid metadata match associated with the journal ISSN, allowing both print and electronic identifiers to contribute to successful matching.

An enrichment pipeline was then applied across citation records to systematically populate disciplinary metadata fields, including subject classifications, disciplinary areas, and journal categories.

# ==================================================
#  ENRICHMENT FUNCTIONS
# ==================================================
def get_from_map_scimago(issn_list, target_map):
    """
    All-match for SCImago area and category fields.
    Joins with ';' -- the separator resolve_scimago splits on.
    """
    if not isinstance(issn_list, list) or not issn_list:
        return None
    seen = []
    for issn in issn_list:
        result = target_map.get(issn)
        if result is not None and not pd.isna(result) and result not in seen:
            seen.append(result)
    return ";".join(seen) if seen else None


def get_from_map_doaj(issn_list, target_map):
    """
    All-match for DOAJ subject and LCC fields.
    Joins with ' | ' -- the separator resolve_doaj splits on.
    """
    if not isinstance(issn_list, list) or not issn_list:
        return None
    seen = []
    for issn in issn_list:
        result = target_map.get(issn)
        if result is not None and not pd.isna(result) and result not in seen:
            seen.append(result)
    return " | ".join(seen) if seen else None

# ==================================================
# APPLY ENRICHMENT
# ==================================================
def enrich_column(series, map_func, *maps):

    results = []

    for value in series:

        result = map_func(value, *maps)

        results.append(result)

    return results

This is the main pipeline where all the created functions were used.

# ==================================================
#  MAIN PIPELINE
# ==================================================
def run_enrichment():

    # Load datasets
    main_df, scimago, doaj = load_datasets()

    if main_df is None:
        return

    # Clean datasets
    doaj = clean_doaj(doaj)
    scimago = clean_scimago(scimago)

    # Extract ISSNs
    print("Step 5: Extracting ISSNs from venue metadata...")

    main_df["citing_issns"] = (
        main_df["citing_venue"]
        .apply(extract_issns)
    )

    main_df["cited_issns"] = (
        main_df["cited_venue"]
        .apply(extract_issns)
    )

    # ==================================================
    # SPLIT RECORDS WITH / WITHOUT ISSNs
    # ==================================================
    print("Step 5.1: Separating ISSN and non-ISSN records...")

    initial_issn_count = len(main_df)

    # ---------- RECORDS WITH ISSNs ----------
    with_issn_df = main_df[
        (
            main_df["citing_issns"].apply(len) > 0
        ) &
        (
            main_df["cited_issns"].apply(len) > 0
        )
    ]

    # ---------- RECORDS WITHOUT ISSNs ----------
    without_issn_df = main_df[
        ~(
            (
                main_df["citing_issns"].apply(len) > 0
            ) &
            (
                main_df["cited_issns"].apply(len) > 0
            )
        )
    ]

    # Save no-ISSN records
    without_issn_df.to_csv(NO_ISSN_FILE, index=False)

    # Continue pipeline only with ISSN records
    main_df = with_issn_df

    filtered_issn_count = len(main_df)

    removed_issn_count = (
        initial_issn_count - filtered_issn_count
    )

    print(
        f"Records with ISSNs on both sides: "
        f"{filtered_issn_count}"
    )

    print(
        f"Records removed due to missing ISSNs: "
        f"{removed_issn_count}"
    )

    # Build maps
    (
        s_area_map,
        s_category_map,
        doaj_area_map,
        doaj_subject_map
    ) = build_lookup_maps(scimago, doaj)

    # Apply enrichment
    print("Step 6: Applying independent disciplinary enrichment...")

    # ---------- CITING COLUMNS ----------
    main_df["citing_scimago_area"] = enrich_column(
        main_df["citing_issns"], get_from_map_scimago, s_area_map
    )
    main_df["citing_scimago_category"] = enrich_column(
        main_df["citing_issns"], get_from_map_scimago, s_category_map
    )
    main_df["citing_doaj_lcc"] = enrich_column(
        main_df["citing_issns"], get_from_map_doaj, doaj_area_map
    )
    main_df["citing_doaj_subject"] = enrich_column(
        main_df["citing_issns"], get_from_map_doaj, doaj_subject_map
    )

    # ---------- CITED COLUMNS ----------
    main_df["cited_scimago_area"] = enrich_column(
        main_df["cited_issns"], get_from_map_scimago, s_area_map
    )
    main_df["cited_scimago_category"] = enrich_column(
        main_df["cited_issns"], get_from_map_scimago, s_category_map
    )
    main_df["cited_doaj_lcc"] = enrich_column(
        main_df["cited_issns"], get_from_map_doaj, doaj_area_map
    )
    main_df["cited_doaj_subject"] = enrich_column(
        main_df["cited_issns"], get_from_map_doaj, doaj_subject_map
    )

    # ==================================================
    # SPLIT MATCHED / UNMATCHED RECORDS
    # ==================================================
    print("Step 7: Separating matched and unmatched records...")

    initial_count = len(main_df)

    # ---------- MATCHED ----------
    matched_df = main_df[
        (
            main_df["citing_scimago_area"].notna()
        ) |
        (
            main_df["citing_scimago_category"].notna()
        ) |
        (
            main_df["citing_doaj_lcc"].notna()
        ) |
        (
            main_df["citing_doaj_subject"].notna()
        ) |
        (
            main_df["cited_scimago_area"].notna()
        ) |
        (
            main_df["cited_scimago_category"].notna()
        ) |
        (
            main_df["cited_doaj_lcc"].notna()
        ) |
        (
            main_df["cited_doaj_subject"].notna()
        )
    ]

    # ---------- UNMATCHED ----------
    unmatched_df = main_df[
        ~(
            (
                main_df["citing_scimago_area"].notna()
            ) |
            (
                main_df["citing_scimago_category"].notna()
            ) |
            (
                main_df["citing_doaj_lcc"].notna()
            ) |
            (
                main_df["citing_doaj_subject"].notna()
            ) |
            (
                main_df["cited_scimago_area"].notna()
            ) |
            (
                main_df["cited_scimago_category"].notna()
            ) |
            (
                main_df["cited_doaj_lcc"].notna()
            ) |
            (
                main_df["cited_doaj_subject"].notna()
            )
        )
    ]

    matched_count = len(matched_df)
    unmatched_count = len(unmatched_df)

    # ==================================================
    # SAVE OUTPUTS
    # ==================================================
    print("Step 8: Saving datasets...")

    matched_df.to_csv(OUTPUT_FILE, index=False)

    unmatched_df.to_csv(NO_MATCH_FILE, index=False)

    filtered_count = matched_count

    removed_count = initial_count - filtered_count

The dataset is transformed into a map of disciplinary interaction. Each citation record contains both "citing_scimago_area",  "citing_scimago_category", "citing_doaj_lcc", "citing_doaj_subject" within the same row. This allows us to track cross-disciplinary connectivity. 

SCImago and DOAJ have different coverage — a journal might be in DOAJ but not SCImago, or vice versa. We created NO_MATCH_FILE.csv to keep these missing matches.

Records without ISSNs on either side are saved to a separate NO_ISSN_FILE.csv The file lets you audit exactly how many records were lost at this step and why, which matters for honest open science reporting.

The report is created to see the percentage of citing and cited match rates in the script below. 

# ==================================================
# FINAL REPORT & LOGGING
# ==================================================
    report_text = (
        "\n" + "=" * 60 + "\n"
        "PARALLEL DISCIPLINARY ENRICHMENT COMPLETE (v3 — all-match)\n" +
        "=" * 60 + "\n"

        f"Initial Records: {initial_issn_count}\n\n"

        f"Records With ISSNs On Both Sides: "
        f"{filtered_issn_count}\n"

        f"Records Without ISSNs: "
        f"{removed_issn_count}\n\n"

        f"Matched Records: {matched_count}\n"

        f"Unmatched Records: {unmatched_count}\n\n"

        f"Verification: No duplicates created during mapping.\n\n"

        f"Citing Match Rate (SCImago): "
        f"{matched_df['citing_scimago_area'].notna().mean():.2%}\n"

        f"Citing Match Rate (DOAJ): "
        f"{matched_df['citing_doaj_lcc'].notna().mean():.2%}\n\n"

        f"Cited Match Rate (SCImago): "
        f"{matched_df['cited_scimago_area'].notna().mean():.2%}\n"

        f"Cited Match Rate (DOAJ): "
        f"{matched_df['cited_doaj_lcc'].notna().mean():.2%}\n\n"

        f"Matched Dataset Saved: {OUTPUT_FILE}\n"

        f"Unmatched Dataset Saved: {NO_MATCH_FILE}\n"

        f"No ISSN Dataset Saved: {NO_ISSN_FILE}\n"

        + "=" * 60 + "\n"
    )

    print(report_text)

    with open(r"D:\Downloads\Open Science\data\UNITO\matching_summary.txt", "w", encoding="utf-8") as f:
        f.write(report_text)


# ==================================================
# RUN PIPELINE
# ==================================================
if __name__ == "__main__":
    run_enrichment()

Step 3: Use LOC classifications to align and standardise subjects across all iris_oc resources

At this stage, we needed external sources to retrieve disciplinary information about journals. Once again, as in the previous step, we relied on two datasets: the Directory of Open Access Journals (DOAJ) and the SCImago Journal Rank (SJR). The use of both sources was chosen to maximize coverage and reduce the number of journals without disciplinary classification, since the two datasets provide complementary indexing of journal subject areas.
Dataset
Scimago Journal Rank
NAME
https://www.scimagojr.com/journalsearch.php?q=21100924372&tip=sid
LINK

Dataset
Directory of Open Access Journals
NAME
https://doaj.org/docs/journal-csv 
LINK

In order to investigate disciplinary flows, we adopted the Library of Congress Classification (LCC) system to standardize categories across all journals. At this stage, we relied on a pre-existing alignment produced by Nicole Liggeri as part of an internship at OpenCitations within the Digital Humanities Advanced Research Centre (/DH.arc), which provided the basis for linking Scopus subject areas to LOC classes.
Dataset
LOC Alignment
NAME
https://github.com/open-sci/2025-2026/tree/main/bloom/categories
LINK
The alignment dataset is organised into three main components:

A. LoC + alignments (RDF/XML): a set of RDF files representing Library of Congress (LoC) classification concepts enriched with semantic alignments (e.g. skos:closeMatch, skos:narrowMatch) to categories from other classification systems.

B. Original categories (CSV): a collection of CSV files describing the hierarchical structure (broad and narrow categories) of the original classification systems (e.g. Scopus).

C. SKOS vocabularies of external systems (TTL): a set of SKOS vocabularies, expressed in Turtle format, modelling the classification systems of external sources as structured concept schemes, derived from the original CSV data.

We conducted an analysis of both the SCImago and DOAJ datasets to understand how their category classifications are structured. 

DOAJ: The categories in the "Subjects" column are already aligned with LOC. However, the formatting differs, as the separator between category and subcategory is '--' (e.g., "Philosophy. Psychology. Religion--Philosophy (General)"), whereas DOAJ uses ':' e.g., "Philosophy. Psychology. Religion: Philosophy (General)").
def doaj_cats(doaj_csv):
    with open(doaj_csv, newline='', encoding='utf-8') as f:
        reader = csv.DictReader(f)

        for i, row in enumerate(reader):
            print(row["Subjects"])

            if i == 49:   # prime 50 righe
                breakdoaj_cats(doaj_csv)


SCImago: We focused on the “Categories” and “Areas” columns. Categories include quartile information, which is likely not relevant for our purposes. 
def clean_category(cat: str):
    return re.sub(r"\s*\(Q\d\)", "", cat).strip()

Each SCImago journal can have one or more values in both fields; when multiple values are present, they are separated by ";" (e.g., "Medicine; Nursing"). In principle, the first category should correspond to the first area, the second to the second, and so on. We matched and printed all the categories and the corrisponding areas.
However, this alignment is not always consistent, and mismatches do occur (e.g., "Economics, Econometrics and Finance (miscellaneous) (Q1); History (Q1)" vs. "Arts and Humanities; Economics, Econometrics and Finance"). 
def map_areas_categories(input_csv: str):
    rows = set()

    with open(input_csv, encoding="utf-8") as infile:
        reader = csv.DictReader(infile, delimiter=";")

        for row in reader:
            areas = [a.strip() for a in row.get("Areas", "").split(";") if a.strip()]
            categories = [clean_category(c) for c in row.get("Categories", "").split(";") if c.strip()]

            if not areas or not categories:
                continue

            if len(areas) == 1:
                for c in categories:
                    rows.add((areas[0], c))
            else:
                for area, category in zip(areas, categories):
                    rows.add((area, category))

    # alphabetical order to better compare with the file scopus.csv
    rows = sorted(rows, key=lambda x: x[0])

    for area, category in rows:
        print(f"{area};{category}")map_areas_categories(scimago_csv)


By manually comparing the scopus.csv file on GitHub (which includes the columns “Broad” and “Narrow”) with the SCImago dataset, we observed that:

- Areas correspond to SCOPUS “Broad” concepts
- Categories correspond to SCOPUS “Narrow” concepts

As previously said, the GitHub repository contains a set of RDF files, each representing a LOC class aligned with other vocabularies, including SCOPUS. In this dataset, the alignment between a LOC category (A) and another vocabulary category (B) is expressed using SKOS properties:

- skos:closeMatch if A is semantically similar to B;
- skos:narrowMatch if A contains B;
- skos:broadMatch if A is contained by B;
- skos:relatedMatch if A and B overlap, but only partially. 

JSON file for the alignment

After the preliminary analysis of the alignments between Library of Congress Classification (LCC) concepts and the Scopus taxonomy, we generated a consolidated JSON file (merged_loc_scopus.json) to facilitate further processing and inspection of the mappings.

The workflow consisted of three main phases:

1. Parsing the Scopus taxonomy

The Scopus taxonomy, provided as a SKOS Turtle file, was first parsed using RDFLib. For each Scopus concept, we extracted:
the preferred label (skos:prefLabel) the broader concepts, interpreted as disciplinary areas;
the narrower concepts, interpreted as categories belonging to the concept;
the labels associated with both areas and categories.The extraction relies on the SKOS hierarchical relation (skos:broader). 

label = g.value(s, SKOS.prefLabel)

for o in g.objects(s, SKOS.broader):
    graph[s_str]["area"].add(str(o))

for o in g.subjects(SKOS.broader, s):
    graph[s_str]["categories"].add(str(o))
The resulting structure stores, for each Scopus URI, its label, related areas, categories, and their corresponding labels.

2. Parsing the LOC–Scopus alignment files
The alignment data consisted of multiple RDF/XML files, each describing mappings between Library of Congress concepts and Scopus concepts.
For every LOC concept, the script extracted:
- the LOC URI;
- the preferred label (skos:prefLabel);
- alternative labels (skos:altLabel) and (skosxl:altLabel);
- semantic alignment relations linking the LOC concept to Scopus concepts.

The following SKOS mapping predicates were considered:

skos:closeMatch
skos:narrowMatch
skos:broadMatch
skos:relatedMatch

For each mapping relation, the target Scopus URI and the relation type were stored:

for pred, match_type in [
    (SKOS.closeMatch, "closeMatch"),
    (SKOS.narrowMatch, "narrowMatch"),
    (SKOS.broadMatch, "broadMatch"),
    (SKOS.relatedMatch, "relatedMatch"),
]:
    for o in g.objects(s, pred):
        node["scopus_alignments"].append({
            "uri": str(create_safe_uri(str(o))),
            "type": match_type,
        })


This step produced a structured representation of all LOC concepts and their associated Scopus mappings.

3. Enrichment and merging

In the final phase, the alignment information extracted from the LOC files was merged with the Scopus taxonomy graph.
For each mapped Scopus URI, the script retrieved the corresponding information from the Scopus taxonomy and enriched the alignment with:
- Scopus preferred label;
- broader disciplinary areas;
- area labels;
- narrower categories;
- category labels.
info = scopus_graph.get(uri)

enriched.update({
    "label": info.get("label"),
    "area": info.get("area", []),
    "area_labels": info.get("area_labels", []),
    "categories": info.get("categories", []),
    "category_labels": info.get("category_labels", []),
})
If a mapped URI could not be found in the Scopus taxonomy, a warning was added to the record for quality-control purposes.

Output
The resulting file, 
merged_loc_scopus.json
, contains one entry for each LOC concept. Each entry includes:

the LOC identifier and labels;
alternative labels;
all semantic mappings to Scopus concepts;
the corresponding Scopus labels;
associated broad disciplinary areas;
associated narrow categories.

This enriched JSON representation provides a consolidated view of the LOC–Scopus alignments and serves as the basis for subsequent analysis and validation activities.

Please note: This preliminary analysis and the generation of merged_loc_scopus.json  — used in the final stage of the step 3 — are available in the project's GitHub repository, including the Jupyter notebook, source datasets, and all files required to reproduce them.  

Note
As part of the preparation process, a small number of LOC RDF alignment files were manually corrected to fix minor XML syntax errors (primarily mismatched tags) that prevented successful parsing. The validated versions included in the repository are the ones used to generate the final output.

After organizing the different files, we decided to perform the journals' alignment starting from the DOAJ (since it is already aligned with LOC), eventually using SCImago for the unmatched Journals. 

DOAJ
Add the LOC_URI column to the dataset
Match the Journal ISSN (and EISSN) from the original DOAJ dataset to the doaj_jorunals_loc.json and retrieve the loc_uri. 
Add the URI to the corresponding journal. 

SCImago
Add the LOC_URI column to the dataset. 
Take the info about Categories and Areas for every journal. 
Go through the merged_loc_scopus.json and match the string. 
Take the loc_uri.
Add it to the corresponding row. 

Although the next step of the workflow did not require all semantic relations, the merged JSON was designed to preserve the complete alignment information. Therefore, all available SKOS mapping relations, including close matches, exact matches, narrow matches, broad matches, and related matches, were retained and stored together with their alignment type.
This choice ensured that no information was lost during the transformation process and allowed different filtering strategies to be applied in later stages of the analysis.

Assigning LOC Main Classes to Citation Flows

This step enriches the citation-flow dataset with Library of Congress (LOC) main classification labels for both citing and cited publications.
The classification relies on Scimago subject areas and categories as the primary source, with DOAJ subjects used as a fallback when Scimago information is missing or cannot be resolved. The mapping is based on the previously generated 
merged_loc_scopus.json
This design choice was made to ensure interpretability of citation flows. The full LOC classification system is extremely fine-grained and would produce a highly fragmented set of categories, making cross-field comparisons and network-level analysis difficult. Using only top-level classes provides a controlled level of granularity that is better suited for aggregate analysis.

1. Scopus-based resolution
Scimago areas and categories are matched against the alignment index extracted from the JSON file:

area_index, cat_index = load_scopus_index(JSON_PATH)
Each publication is then resolved using:


labels, uris = resolve_scimago(scimago_area, scimago_cat, area_index, cat_index)

Category values are cleaned from quartile annotations:

def strip_quartile(cat: str) -> str:
    return re.sub(r"\s*\(Q\d+\)\s*$", "", cat).strip()

Only trusted alignment relations are used for area-level matching:

TRUSTED_MATCH_TYPES = {"closeMatch", "exactMatch"}
2. DOAJ fallback

When Scimago data is missing or unresolved, DOAJ subjects are used:

general_category = part.strip().split(":")[0].strip().upper()
letter = LOC_LABEL_TO_LETTER.get(general_category)

Mapping to LOC main classesOnce LOC letters are identified, they are converted into human-readable labels and URIs:

def letters_to_results(letters: set) -> tuple[list, list]:
    labels, uris = [], []
    for letter in sorted(letters):
        entry = LOC_MAIN_CLASSES.get(letter.upper())
        labels.append(entry["label"])
        uris.append(entry["uri"])
    return labels, uris
Multiple matches are preserved and concatenated using "|":

return " | ".join(unique) if unique else None

3. Special categories and filtering rules

- Publications containing “multidisciplinary” are directly labelled as Multidisciplinary.
- Records with disciplinary information but no LOC match are labelled Others.
- Rows without any Scimago or DOAJ data remain unresolved.

Output structure

The script generates two outputs:

- output_cat.csv: only rows where both citing and cited publications are successfully classified
- miss_loc.csv: rows where at least one side could not be resolved

Filtering is applied as follows:

both_resolved = df["citing_loc_label"].notna() & df["cited_loc_label"].notna()
df_resolved   = df[both_resolved]
df_missing    = df[~both_resolved]

Using only LOC main classes was a deliberate simplification to reduce dimensionality. This allows citation flows to be analyzed at a stable disciplinary level, avoiding sparsity and over-fragmentation that would arise from full LOC subclass resolution.

Data Aggregation

Step4: Aggregate disciplinary flow data for individual institutions.

We take the "citing_loc_label", "cited_loc_label", and "flow" columns from "output_cat.csv" the output of last step, and use "Integer Counting" method to calculate discipline-to-discipline citations.

Note
Integer Counting: If a journal belongs to multiple disciplines (separated by ' | '), we split them and expand them into all possible pairs (Cartesian product). Each pair counts as 1 full citation action, inheriting the original flow type.
Fractional Counting: A paper is considered as 1, and the multiple disciplines will share it fractionally.

Our unit of analysis is the disciplinary citation relationship, not the paper. Fractional counting would dilute the interdisciplinnary edges we are trying to surface, and may show the paper more than disciplines wihle comparing across the institutions. And Integer counting is considered to suit Transparency and reproducibility principle better, as it avoids fractional artifacts which are harder to explain and read.

df_clean['citing_loc_label'] = df_clean['citing_loc_label'].str.split(' | ', regex=False)
df_clean['cited_loc_label'] = df_clean['cited_loc_label'].str.split(' | ', regex=False)

df_exploded = df_clean.explode('citing_loc_label').explode('cited_loc_label')
E.g., An 'Incoming' citation where a Medicine | Science journal cites a Philosophy | History journal is broken down into 4 separate actions (each weight = 1, flow = Incoming):
    1. Incoming | Medicine -> Philosophy
    2. Incoming | Medicine -> History
    3. Incoming | Science  -> Philosophy
    4. Incoming | Science  -> History

Create an edge list [flow, citing_loc_label, cited_loc_label, weight] with weights representing the number of citation actions between different disciplines, broken down by flow type (Incoming, Outgoing, Internal)

edge_list = df_exploded.groupby(['flow','citing_loc_label', 'cited_loc_label']).size().reset_index(name='weight')
edge_list = edge_list.sort_values(by='weight', ascending=False)
The result shows the number of every citing pairs across disciplines, like

Create an institutional profile matrix [Discipline, cited_count, citing_count, total_count] sorted by total activity. Ideal for comparative bar charts.

    # 1. Count Cited: (Incoming and Internal)
    df_cited_side = df_exploded[df_exploded['flow'].isin(['Incoming', 'Internal'])]
    cited_counts = df_cited_side.groupby('cited_loc_label').size().reset_index(name='cited_count')
    cited_counts.rename(columns={'cited_loc_label': 'Discipline'}, inplace=True)

    # 2. Count Citing: (Outgoing and Internal)
    df_citing_side = df_exploded[df_exploded['flow'].isin(['Outgoing', 'Internal'])]
    citing_counts = df_citing_side.groupby('citing_loc_label').size().reset_index(name='citing_count')
    citing_counts.rename(columns={'citing_loc_label': 'Discipline'}, inplace=True)

    subject_profile = pd.merge(cited_counts, citing_counts, on='Discipline', how='outer').fillna(0)
    
    subject_profile['cited_count'] = subject_profile['cited_count'].astype(int)
    subject_profile['citing_count'] = subject_profile['citing_count'].astype(int)

    subject_profile['total_count'] = subject_profile['cited_count'] + subject_profile['citing_count']
    subject_profile = subject_profile.sort_values(by='total_count', ascending=False)
The result "[INSTITUTION]_profile_output.csv" shows the citing and cited number of each disciplinary in an institution.
E.g.,

Discipline,cited_count,citing_count,total_count
SCIENCE,3248289,3473353,6721642
MEDICINE,2841492,2864034,5705526
GEOGRAPHY. ANTHROPOLOGY. RECREATION,556575,825657,1382232

Data Analysis and Visualization

For visualization and analysis, we produced two Jupyter notebooks files: 
OpenCitation Dataset Overview (iris_oc_data_overview.ipynb) - the overview of the OpenCitation datasets of the six institutions.
Disciplinary Citation Flow Analysis (viz_discipline_flow.ipynb) - the main analysis of the disciplinary flow.
The visualizations can be produced by running each cell of the notebooks from the top cell. The following sections will explain the step structures of each ipynb file. 

Notebook 1 - OpenCitation Dataset Overview (iris_oc_data_overview.ipynb) 
This noteboook is an overview analytical report of the datasets across the six institutions. It documents the total citation counts per institution and summarises the data at each processing phase, from the initial data dump through missing data identification, external data matching, and final dataset consolidation.

As a preliminary step, the number of publication records inputted and outputted in each data processing step is recorded in a JSON file: iris_data_summary.json

The python libraries used are
plotly 

Steps:
1. Load JSON data, set global variables
2. Create following charts:
Bar chart of initial citation counts - to grasp an overall understanding the volume of the six institutions
Sankey diagram of the citation processing pipeline - to visualize the matched and missing data through data processing steps. This chart is produced for each institution. 

Notebook 2 - Disciplinary Citation Flow Analysis (viz_discipline_flow.ipynb) 

This notebook takes the aggregated datasets as input and addresses the two research questions 
"What are the scholarly disciplines that either cite or are cited by the IRIS publications included in OpenCitations for a given institution?" 
 "Do different institutions show different disciplinary citation patterns?" 
through a series of visualizations and quantitative analyses. It examines the distribution of disciplinary citation flows across institutions, the volume and directionality of citing–cited discipline pairs, and cross-institutional differences in disciplines.

Data Loading and Setup

In this subsection we install libraries/packages, and set data paths, global variables and color charts.

The python libraries used are:
pandas for data manipulation
plotly for visualizing charts
ipywidgets for applying interactive widgets (dropdowns and buttons)

The csv datasets of each institutions produced in the previous steps are named:
[INSTITUTION]_profile_output.csv 
[INSTITUTION]_agg_output.csv

Identify Overall Disciplinary Distribution

In this subsection we aim to explore the overall distribution of disciplines that either cite or are cited by each institution, in order to identify the shares of disciplines and institution-level differences of them.

Steps:
Load `[INSTITUTION]_profile_output.csv` file for all six institutions.
For each institution, count and rank disciplines by total citation volume (by "citing IRIS" or "cited by IRIS" publications)
Produce following visualizations:
Summary table - to display the total of citation records by discipline and their share per institution
Butterfly chart - to compare the volume per discipline per institution
Grouped bar chart - to visualize actual counts of citing and cited side by side
100% stacked bar chart - to understand the proportional share of each discipline independent of institution size.
4. From these initial charts, we noticed a skewed distribution in all six institutions, in which "Medicine" and "Science" account for the majority of disciplines, and that some such as "Law" and "Military Science" account for less than 1% of the whole dataset. To further examine the discipline share without the influence of outliers,  we divided the list of disciplines based on their share into three groups:

Note
High Range: Medicine, Science
Middle Range: Geography/Anthropology, Agriculture, Social Sciences, Education, Political Science, Philosophy/Psychology/Religion, Language & Literature, Multidisciplinary, Others
Low Range: Fine Arts, Naval Science, Military Science, General Works, Bibliography/Library Science, Music, Law, Auxiliary Sciences of History

5. Based on the grouping, produce a set of 100% stacked bar charts by discipline range.

Examine the Disciplinary Citation Flow
This subsection aims to identify which disciplines cite which other disciplines, and to examine the degree of within-discipline and cross-discipline citation behavior

Steps:
 Load  `[INSTITUTION]_agg_output.csv` for all six institutions
 For each citing-cited discipline pair, we treat flow type segmented as below

Note
Incoming: external publications citing the institution's work
Internal: citations between publications within the same institution
Outgoing: citations from the institution's publications to external work

 3. We classify each pair as "Self" (same discipline cites itself) or "Cross" (one discipline cites another)
 4. Produce the following visualisation:
Bar chart of citing–cited pair ranking - to show the top N pairs per institution,  colour-coded by flow type (Incoming / Internal / Outgoing).
100% stacked bar chart of self vs. cross citation by discipline - to show the proportion of self vs. cross-disciplinary citations
100% stacked bar chart of self vs. cross by range - to show the same data in terms of the volume range of disciplines (All / High / Mid / Low).
Proportion sunburst chart - a two-ring interactive radial chart showing the proportional structure of citing→cited flows in which the inner ring is the citing discipline and the outer ring is the cited discipline.
Static Sankey diagram - a directional flow diagram for a selected institution for N top citation pairs with link width proportional to citation volume.
Interactive Sankey diagram - a filterable version allowing selection of institution(s), citing disciplines, cited disciplines,  and flow type.

Compare Cross-Institutional Figures

This subsection aims to compare disciplinary citation profiles across institutions relative to the cohort, identifying fields of over or under the average share of the overall distribution of disciplines.

Steps:
For each institution, compute the proportional share of each discipline relative to total citation activity (both citing and cited).
Calculate the deviation of each institution's disciplinary proportion from the cross-institution average (in percentage points).
Produce the following visualisation:
Relative disciplinary heatmap - a matrix of institutions × disciplines, where cell color encodes the degree of deviation from the mean. Produced separately for citing counts and cited counts.

Finalization

The output data will be published on Zenodo and all scripts and interim datasets will be availble on the Github repository:
https://github.com/open-sci/2025-2026/tree/main/bloom

Protocol references

Andreose, E., Di Marzo, S., Heibi, I., Peroni, S., & Zilli, L. (2025). Analysing the coverage of the University of Bologna's bibliographic and citation metadata in OpenCitations collections. arXiv. https://doi.org/10.48550/arXiv.2501.05821

Heibi, I., Moretti, A., Peroni, S., & Soricetti, M. (2024). The OpenCitations Index: description of a database providing open citation data. Scientometrics, 129, 7923–7942. https://doi.org/10.1007/s11192-024-05160-7

Massari, A., Mariani, F., Heibi, I., Peroni, S., & Shotton, D. (2024). OpenCitations Meta. Quantitative Science Studies, 5(1), 50–75. https://doi.org/10.1162/qss_a_00292

Peroni, S., & Shotton, D. (2019). OpenCitations, an infrastructure organization for open scholarship. Quantitative Science Studies, 1(1), 428–444. https://doi.org/10.1162/qss_a_00023

Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J., et al. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, 3. https://doi.org/10.1038/sdata.2016.18

Cioffi, A., Coppini, S., Massari, A., Moretti, A., Peroni, S., Santini, C., & Shahidzadeh Asadi, N. (2022). Identifying and correcting invalid citations due to DOI errors in Crossref data. Scientometrics, 127(6), 3593–3612. https://doi.org/10.1007/s11192-022-04367-w

BLOOM Disciplinary Flow Protocol v.3 V.3

1. Parsing the Scopus taxonomy

2. Parsing the LOC–Scopus alignment files

3. Enrichment and merging

In the final phase, the alignment information extracted from the LOC files was merged with the Scopus taxonomy graph.

For each mapped Scopus URI, the script retrieved the corresponding information from the Scopus taxonomy and enriched the alignment with: - Scopus preferred label; - broader disciplinary areas; - area labels; - narrower categories; - category labels.

1. Scopus-based resolution

Scimago areas and categories are matched against the alignment index extracted from the JSON file:

Mapping to LOC main classesOnce LOC letters are identified, they are converted into human-readable labels and URIs:

3. Special categories and filtering rules

- Publications containing “multidisciplinary” are directly labelled as Multidisciplinary. - Records with disciplinary information but no LOC match are labelled Others. - Rows without any Scimago or DOAJ data remain unresolved.

Output structure

The script generates two outputs:

- output_cat.csv: only rows where both citing and cited publications are successfully classified

- miss_loc.csv: rows where at least one side could not be resolved

Using only LOC main classes was a deliberate simplification to reduce dimensionality. This allows citation flows to be analyzed at a stable disciplinary level, avoiding sparsity and over-fragmentation that would arise from full LOC subclass resolution.