New Brazilian Dengue Hospitalization Dataset Offers Weekly Granularity for AI Forecasting
Global: New Brazilian Dengue Hospitalization Dataset Offers Weekly Granularity for AI Forecasting
Researchers from an international team have released a publicly accessible dataset on Zenodo (DOI 10.5281/zenodo.18189192) that aggregates municipal-level dengue hospitalization records across Brazil. The data span from 1999 through 2021 and are intended to support more precise epidemiological forecasting using artificial intelligence models. The release, posted to arXiv in January 2026, seeks to address the limited temporal resolution of existing monthly datasets.
Construction and Temporal Disaggregation
The dataset, version v1.0.0, employs an interpolation protocol that converts monthly counts into epidemiological weeks while preserving the original monthly totals through a correction step. This approach enables researchers to retain aggregate accuracy while gaining finer‑grained weekly insight.
Validation Against High‑Resolution Reference Data
To assess the statistical and temporal validity of the disaggregation, the authors compared the weekly series against a high‑resolution reference dataset from the state of São Paulo collected in 2024. Three methods—linear interpolation, jittering, and cubic spline interpolation—were evaluated. According to the study, cubic spline interpolation demonstrated the closest alignment with the reference data, leading to its selection for the final weekly series.
Comprehensive Variable Set
Beyond hospitalization counts, the release bundles a broad array of explanatory variables commonly used in epidemiological and environmental modeling. These include demographic density, emissions of CH₄, CO₂, and NO₂, poverty and urbanization indices, maximum temperature, mean monthly precipitation, minimum relative humidity, and geographic coordinates (latitude and longitude). All variables follow the same weekly disaggregation scheme to ensure multivariate compatibility.
Documentation and Quality Metrics
The accompanying documentation details the dataset’s provenance, structure, file formats, licensing terms, and known limitations. Quality assessments are provided using metrics such as mean absolute error (MAE), root‑mean‑square error (RMSE), coefficient of determination (R²), Kullback‑Leibler divergence (KL), Jensen‑Shannon divergence (JSD), dynamic time warping (DTW), and the Kolmogorov‑Smirnov (KS) test.
Intended Use Cases
According to the authors, the dataset is suited for multivariate time‑series analysis, environmental health investigations, and the development of machine‑learning and deep‑learning models aimed at forecasting dengue outbreaks. Researchers are encouraged to follow the usage recommendations outlined in the release.
This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.
Ende der Übertragung