Apr 26, 2026

LSTM-Based Forecasting of Dengue Hospitalizations in Brazil Using Climate and Physician Digital Search Data

This  protocol  is a draft, published without a DOI.
  • Dayanna Quintanilha Palmer1,2,
  • Marcela Motta2,
  • Eduardo Moura2,
  • Danielly Xavier2,
  • Guilherme Schittine2,
  • Angélica Caseri3,
  • Ronaldo Gismondi1,2
  • 1Department of Clinical Medicine, Universidade Federal Fluminense (UFF), Niterói, Rio de Janeiro, Brazil;
  • 2Research & Innovation Center (Afya), São Paulo, São Paulo, Brazil;
  • 3Institute of Mathematical and Computer Sciences (ICMC), Universidade de São Paulo (USP), Brazil
Icon indicating open access to content
QR code linking to this content
Protocol CitationDayanna Quintanilha Palmer, Marcela Motta, Eduardo Moura, Danielly Xavier, Guilherme Schittine, Angélica Caseri, Ronaldo Gismondi 2026. LSTM-Based Forecasting of Dengue Hospitalizations in Brazil Using Climate and Physician Digital Search Data. protocols.io https://dx.doi.org/
License: This is an open access  protocol  distributed under the terms of the  Creative Commons Attribution License,  which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Protocol status: Working
We use this protocol and it's working.
Created: April 13, 2026
Last Modified: April 26, 2026
Protocol  Integer ID: 314955
Keywords: dengue hospitalization forecasting framework, based forecasting of dengue hospitalization, lstm, based forecasting, dengue hospitalization, physician digital search data, ahead forecast, model training, physician digital search data this protocol, neural network, climate data, term memory, official hospitalization record, physician digital search behavior
Funders Acknowledgements:
Afya
Google Phd Fellowship Research Grant
Disclaimer
DISCLAIMER – FOR INFORMATIONAL PURPOSES ONLY; USE AT YOUR OWN RISK

The protocol content here is for informational purposes only and does not constitute legal, medical, clinical, or safety advice, or otherwise; content added to protocols.io is not peer reviewed and may not have undergone a formal approval of any kind. Information presented in this protocol should not substitute for independent professional judgment, advice, diagnosis, or treatment. Any action you take or refrain from taking using or relying upon the information presented here is strictly at your own risk. You agree that neither the Company nor any of the authors, contributors, administrators, or anyone else associated with protocols.io, can be held responsible for your use of the information contained in or linked to this protocol or any of our Sites/Apps and Services.
Abstract
This protocol describes the full analytical workflow for developing a dengue hospitalization forecasting framework using Long Short-Term Memory (LSTM) neural networks applied at the Immediate Geographic Region (IGR) level across Brazil. The pipeline integrates climate data, physician digital search behavior, and official hospitalization records to generate 8-week-ahead forecasts. The protocol covers data acquisition, preprocessing, feature engineering, model training, evaluation, and interpretability analysis using SHAP values.
Materials
5.1 Hospitalization Data — SIH/SUS
- Source: SIH/SUS database (Sistema de Informações Hospitalares do SUS), maintained by DATASUS (Department of Informatics of the Unified Health System).
- Access: Publicly available; anonymized.
- Update frequency: Monthly, with a typical release delay of 1–2 months after hospital care.
- Final consolidation may occur up to 4 months after the reference month (per Portaria SAES nº 1.110/2021, Brazilian Ministry of Health).
- Study period: 2021–2024.
- Access URL: https://datasus.saude.gov.br/

5.2 Clinical Search Data — Afya Whitebook®
- Source: Afya Whitebook®, a clinical decision-support platform widely used by Brazilian healthcare professionals.
- Data type: Metadata on physician search behavior (dengue-related queries), collected in real time and accessible retrospectively via structured query interface.
- Inclusion criterion: Only search data generated by verified licensed physicians with active professional credentials (CRM) were included.
- Population context: ~150,000 monthly active physician users as of end of 2024, out of 597,428 practicing physicians in Brazil (January 2024 census).
- Coverage: 46 IGRs had available clinical search data; 27 passed all quality filters.
- All data are anonymized in compliance with ethical standards.

5.3 Climate Data — INMET
- Source: Brazilian National Institute of Meteorology (Instituto Nacional de Meteorologia — INMET) automatic weather stations.
- Variables collected: Daily precipitation, maximum temperature, mean temperature, minimum temperature, mean relative humidity.
- Access URL: https://portal.inmet.gov.br/
- Study period: 2021–2024.

5.4 Software and Computing Environment
- Language: Python
- Key libraries: TensorFlow/Keras (LSTM modeling), pandas (data preprocessing), NumPy (numerical operations), scikit-learn (evaluation metrics), SHAP (model interpretability), statsmodels (SARIMAX benchmark).
- All preprocessing and modeling were conducted in a Python-based analytical environment.
Study Design
Integration of climate, physician clinical search, and hospitalization datasets from public sources.
Data preprocessing: weekly aggregation, imputation of missing values, and consolidation at the IGR level.
Derivation of additional climate indicators (42 climate-related variables total).
Variable screening using LSTM-based relevance evaluation.
Model training and interpretability analysis using SHAP (SHapley Additive exPlanations) values.
Set the forecast horizon to 8 weeks ahead. Implement a rolling sliding-window approach using recent historical sequences to predict subsequent hospitalization counts, enabling the model to learn short-term and seasonal temporal patterns.
Study Site
Geographic unit of analysis: Immediate Geographic Region (IGR), as defined by the Brazilian Institute of Geography and Statistics (IBGE).
An IGR is a group of neighboring municipalities organized around urban centers that function as local hubs for goods, services, and labor.
The current territorial division comprises 510 IGRs across Brazil.
IGRs represent an intermediate geographic scale between municipalities and states.
Final analytical sample: 27 IGRs retained after all filtering stages (see Section 15 for full list).
Study Variables
Outcome Variable: Primary outcome is the weekly number of hospitalizations for dengue fever per IGR, extracted from the SIH/SUS database.
Predictor Variables: Predictor variables span two domains: meteorological and digital health.
Meteorological Predictors: Temperature, precipitation, humidity, and season indicators with specific units and lagged values.
Digital Health Predictor: Normalized search rate — weekly number of clinical searches by physicians on Afya Whitebook®, normalized by the number of active physicians in the IGR in that week. Unit: searches per 10,000 active physicians
Data Preprocessing Procedure
Convert date fields to datetime format.
Standardize variable names across all station files.
Append station metadata (name, geographic coordinates, operational period) to each record.
Retain only observations between January 2021 and December 2024.
Exclude weather stations with more than 30 consecutive days of missing data.
For IGRs containing multiple meteorological stations, retain only the station with the highest percentage of data completeness.
Aggregate daily climate observations into weekly series.
Calculate weekly averages for temperature and relative humidity.
Calculate weekly sums for precipitation.
Identify missing weekly values in the climate time series.
Impute missing values using a centered moving average: calculate the mean using two weeks forward and two weeks backward (minimum of one available neighbor required).
Merge hospitalization data (SIH/SUS), climate data (INMET), and clinical search data (Afya Whitebook®) at the IGR–week level to form a unified analytic panel.
Verify data completeness across all three domains for each IGR.
1. Start from 573 INMET weather stations; 54 met the completeness threshold (≤ 30 consecutive days of missing data), yielding 46 IGRs — then cross-reference with 510 IGRs with dengue hospitalizations recorded in SIH/SUS (2021–2024).
2.  Exclude IGRs with fewer than 5 weekly hospitalizations in either the training or test set. This reduced the sample from 46 to 28 IGRs before the R² filter.
3. From the 28 IGRs with available clinical search data, one additional IGR was removed due to non-positive R².
4. Verify the final retained set of 27 IGRs: Alegre, Belo Horizonte, Campina Grande, Campos Dos Goytacazes, Catalão, Cruz Alta, Distrito Federal, Frederico Westphalen, Ijuí, Juiz De Fora, Linhares, Marília, Maringá, Oliveira, Passo Fundo, Passos, Pirapora, Porto Alegre, Ribeirão Preto, Rio De Janeiro, Salvador, Santa Cruz Do Sul, Santa Maria, São Miguel Do Oeste, São Paulo, Uberaba, Uberlândia.
Derive 42 climate-related variables from the clean weekly climate series, including:
Lagged indicators for precipitation, temperature, and humidity (lags 1–4 weeks).
Weekly change flags (binary indicators for week-over-week percentile changes above the 75th percentile).
Interaction terms between climate variables.
Extreme climate indicators (binary flags for values ≥95th percentile).
Seasonal dummy variables (Summer, Autumn, Winter, Spring).
Temperature category indicators.
Precipitation occurrence and pattern variables.
Feature Selection Procedure
For each of the four climatic categories (temperature, humidity, precipitation, seasonality), train a separate LSTM model using the hospitalization time series combined with one candidate climate variable at a time.
This strategy was intended to reduce redundancy among correlated climatic predictors before model fitting and to improve the interpretability of subsequent SHAP analyses.
Record the Root Mean Square Error (RMSE) for each single-variable model.
For each climatic category, retain the single variable yielding the lowest RMSE.
Result: Each IGR retains four climatic input variables — one from each category.
Additionally assess feature importance using SHAP values to quantify the relative contribution of climatic vs. digital health variables per IGR.
LSTM Model Training Procedure
Total 208 epidemiological weeks (2021–2024); first 4 weeks discarded (lagged predictors) → 204 usable weeks.
Split 70/30: training epidemiological week 5/2021 – epidemiological week 43/2023 (143 weeks); test epidemiological week 44/2023–epidemiological week 52/2024 (61 weeks).
Apply a walk-forward expanding-window validation approach during training to mimic real-time forecasting and assess model robustness across temporal segments.
LSTM networks are a special type of recurrent neural network (RNN) designed to overcome the vanishing/exploding gradient problem in long sequences.
An LSTM unit contains a memory cell with a Constant Error Carousel (CEC) preserving information across time steps, while multiplicative gates regulate information flow (input gate, forget gate, output gate).
Architecture parameters used in this study:
Input window (seq_len): 24 weeks
Prediction horizon (output_size): 8 weeks
Number of hidden layers: 2–4 (evaluated via grid search)
Hidden units: 20 or 40 (evaluated via grid search)
Learning rate: 0.01
Loss function: Mean Squared Error (MSE)
Hyperparameter Optimization
Perform a grid search evaluating combinations of: number of hidden layers (2, 3, 4) and hidden units per layer (20, 40).
Train each configuration in triplicate to account for stochastic variability during optimization.
Select the configuration with the lowest RMSE on the test set.
Model Variants
Train the following five model structures to evaluate the contribution of each data domain:
Hospitalization-only model (LSTM — Hospitalization): Trained exclusively on past hospitalization counts.
Clinical-search model (LSTM — Hospitalization + Clinical Search): Trained using hospitalization time series combined with Afya Whitebook® physician search data.
Climate model (LSTM — Hospitalization + Climate): Trained using hospitalization data and the best-performing climatic variable from each domain.
Integrated model (LSTM — Hospitalization + Clinical Search + Climate): Combines hospitalization data, climatic indicators, and real-time physician search behavior.
SARIMAX benchmark model (SARIMAX — Hospitalization): Classical time series benchmark trained only on hospitalization counts.
Ideal-Data vs. Real-World Scenarios
Ideal-data model: Predictor variables and hospitalization data are temporally aligned (no lag correction).
Real-world model: Hospitalization data are shifted forward by up to 8 weeks (~2 months) to reflect the average reporting lag observed in official SIH/SUS records.
Model Evaluation Procedure
Calculate the following metrics for each model and each IGR:
Root Mean Square Error (RMSE): sqrt( (1/n) * sum( (y_t - ŷ_t)^2 ) )
Mean Squared Error (MSE): (1/n) * sum( (y_t - ŷ_t)^2 )
Mean Absolute Error (MAE): (1/n) * sum( |y_t - ŷ_t| )
Coefficient of Determination (R²): R² = 1 – (RSS / TSS)
Where RSS = sum( (y_t – ŷ_t)^2 ) and TSS = sum( (y_t – ȳ)^2 ).
Assess normality of paired RMSE differences using the Shapiro–Wilk test. Apply a paired t-test when normality is supported; use the Wilcoxon signed-rank test otherwise. Statistical significance was reported at p < 0.05, with additional thresholds at p < 0.01 and p < 0.001.
Visually inspect predicted versus observed hospitalization counts to evaluate numerical accuracy and temporal alignment.
SHAP Interpretability Analysis
After training each LSTM model, compute SHAP (SHapley Additive exPlanations) values for the test set predictions.
Use SHAP values to identify the most influential predictors for each IGR.
Quantify the relative contribution of climatic variables versus digital health variables (Afya Whitebook® searches) to forecasting performance.
Perform SHAP analysis separately for the ideal-data and real-world model scenarios.
Expected Outputs
8-week-ahead dengue hospitalization forecasts for 27 IGRs across Brazil.
Performance metrics (RMSE, MSE, MAE, R²) for each model variant and IGR.
SHAP-based feature importance rankings per IGR and model.
Comparative performance analysis between LSTM model variants and the SARIMAX benchmark.
Comparative analysis between ideal-data and real-world forecasting scenarios.