License: This is an open access protocol distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Protocol status: Working
We use this protocol and it's working
Created: June 24, 2026
Last Modified: June 26, 2026
Protocol Integer ID: 319685
Keywords: cumulative probabilities at specific age threshold, specific age threshold, summary quartile, sample quartile, distribution optimization framework for diverse fgm, epidemiological dataset, known quartile, multiple sample cohort, distribution optimization framework, cumulative distribution function, such as age, generalized gamma, probability density function, cumulative probability, distribution, cdf, quality pdf plot, generalised gamma, pdf, diverse fgm
Disclaimer
DISCLAIMER – FOR INFORMATIONAL PURPOSES ONLY; USE AT YOUR OWN RISK
The protocol content here is for informational purposes only and does not constitute legal, medical, clinical, or safety advice, or otherwise; content added to protocols.io is not peer reviewed and may not have undergone a formal approval of any kind. Information presented in this protocol should not substitute for independent professional judgment, advice, diagnosis, or treatment. Any action you take or refrain from taking using or relying upon the information presented here is strictly at your own risk. You agree that neither the Company nor any of the authors, contributors, administrators, or anyone else associated with protocols.io, can be held responsible for your use of the information contained in or linked to this protocol or any of our Sites/Apps and Services.
Abstract
This protocol describes a computational pipeline to fit a Generalised Gamma (Stacy) distribution using three known quartiles (Q1, Median, and Q3). It is designed for demographic or epidemiological datasets (such as age-at-event data) where individual-level data may be missing, but summary quartiles are available. The protocol estimates shape, scale, and threshold parameters, calculates cumulative probabilities at specific age thresholds (5, 10, 15, and 18 years), and exports publication-quality PDF plots of both Probability Density Functions (PDF) and Cumulative Distribution Functions (CDF) for multiple sample cohorts or countries.
Environment: R version 4.0.0 or higher is recommended.
Core Packages Used:
tidyverse (v2.0.0+) – Specifically leverages dplyr for row-wise transformations, tidyrfor unnesting data structures, and readr/purrr utilities.
stats (Built-in R library) – Specifically utilizes nlminb() for bounded optimization and uniroot() for exact root-finding.
2. Required Input File & Directory Setup
File Name: Age_of_Cutting.csv located in the working directory main_dir <- "."
Directory Creation: The script automatically creates a /PLOTS subdirectory for graphical output.
3. Pre-processing & Treatment of Quartiles Equal or Close to Zero
Mathematical Constraints: The mathematical framework strictly requires strictly positive, ordered quartiles: 0 < Q1 < Median < Q3. If this condition is violated (e.g., if a country has a Q1 equal to zero), the script throws a fatal error via stop (Require 0 < q1 < m < q3)
Zero/Tiny Value Defence (Log-Space): To prevent log of zero errors (log(0) = – infinity) during parameter search and density plotting, the script applies a defence floor using a maximum comparison function: pmax(qh, 1e-12) and pmax(tgt, 1e-12) as well as pmax(x, .Machine$double.eps)
Protocol Steps
The protocol is desribed below in six steps:
Step 1: Environment Setup and Library Initialisation
Initialise the R workspace by suppressing startup messages and loading the required data manipulation libraries.
Step 2: Ingest Data and Perform Defensive Type Coercion
Read the external dataset into the R environment. The script executes defensive programming checks to ensure the required columns (q1, m, q3) exist and forces their values into numeric data types to prevent execution errors down the line.
The pipeline handles parameter estimation using an automated, fallback hierarchy across three sequential mathematical approaches inside fit_gengamma_stacy_from_quartiles
Approach A: Exact Algebraic Search
The script calculates log-ratios of the empirical data (r25 = log(Q1 / M) and r75 = log(Q3 / M)). It sets up an inner target function g(k) where:
It screens an exponential sequence of k across a search range bounded by [10-4, 103]. If a sign change is found, uniroot() solves for g(k) = 0 to an exact tolerance of tol = 1e-10
Approach B: Near-Exact Vector Sweep
If Approach A fails to find a root bracket or produces invalid parameters, a high-density grid search evaluates g(k)across 800 exponential intervals. The k minimising |g(k)| is selected. The result is accepted only if the final log-space relative error is strictly less than 10% (relerr0 < 0.10)
Approach C: Bounded Non-Linear Optimiser
If both analytical sweeps fail, the protocol switches to a bounded optimisation algorithm utilising stats::nlminb()
Parameter Transformations: To ensure parameters remain strictly positive, the optimiser operates in log-transformed space:
θ1=log(d), θ2=log(p), θ3=log(a)
Search Ranges / Parameter Constraints: The bounds are explicitly mapped to θ-space using the following parameter-space boundaries:
Starting Values Matrix: To avoid local minima traps, a grid of 5 data-adaptive initialisation profiles (starts) is used to run the optimiser iteratively:
Standard Profile: d≈2,p≈1,a≈M
Alternative Profile: d≈3,p≈0.7,a≈M
Infant Profile (Spiky/L-shaped): d=0.5,p=2.0,a≈M
Broad Tail Profile: d≈4,p≈0.4,a≈2M
Extreme Infant Profile: d=0.1,p=3.0,a≈M⋅1.5
Step 4: Define the Exact Objective Function
During Step 3's Approach C optimisation routine, nlminb(), minimises a custom penalised loss function.
The Mathematical Objective Formula:
Where:
Goodness-of-Fit Criteria: The core metric is the weighted log-relative squared error between the fitted quartiles (q^) and target input quartiles (q).
Weights (w): Asymmetric weights are assigned via w <- c(1, 1.5, 1), placing a heavier penalty (1.5×) on deviations from the Median to anchor the central distribution tendency.
Ridge Penalty: A mild ridge regularization penalty (λ=10−4) is applied to ∑θ2 to stabilize parameter drift in flat optimization landscapes.
Soft Boundary Penalties (Pensoft): Quadratic penalty terms are triggered dynamically if parameters venture within 10−3 or 10−4 units of their defined hard absolute bounds.
Step 5: Execute Model Fitting and Cumulative Probability Extractions
Pass the quartile values row-by-row using a tidyverse optimisation loop. Once the parameters (d, p, a) are settled, the script computes the final parameter variations (μ,σ,Q) matching flexsurv standards and saves individual calculation tracks. It automatically computes milestones representing cumulative probabilities (F(x)) at ages 5, 10, 15, and 18.
Step 6: Export Results & Graphics Compilation
Save the finalised data table to a CSV file. Then, initialise a graphical looping device to generate PDF plots for every country.
Density PDF Toggling: The plot function evaluates the peak height (y_max). If a country exhibits an L-shaped or highly spiky profile (d≤1+10−12 or ymax>50), the function automatically swaps the graphic layout to a logarithmic Y-scale (log="y") using a baseline floor of 10-320 to prevent visualisation crashes.
Acknowledgements
The author used Gemini (accessed in October 2025 and June 2026) to assist with writing the R code used to fit the Generalised Gamma distributions to the age of FGM/C data. The author subsequently reviewed the code, methodology, and results and takes full responsibility for the manuscript's content.