Exploiting synthetic data generation to enhance pollution prediction

Resumen

Pollution in urban areas is turning into a primary focus for local governments in developed nations around the globe. Lots of data are currently collected for this from smart developments provided with atmospheric and climatic sensors. A hot research line is now exploiting such data to extract patterns and predict pollution levels in such a way that countermeasures can be taken beforehand and exposure to harmful concentrations is avoided. However, a key issue is the lack of significant data, due to incomplete smart infrastructures or calibration problems in sensors. Dealing with this, in this paper we propose the exploitation of synthetic data generation to enhance pollution prediction based on limited data sources, concretely extending real measurements of two weeks to up to ten extra years. We present a data generation approach based on Generative Adversarial Networks (GANs), with a particular model focused on generating artificial pollution data, which is later exploited using different Machine Learning (ML) algorithms. Results indicate that the usage of synthetic data further improves prediction when used as the basis dataset to be later finetuned using real records. For 62% of pollutants this way to proceed in data mixing (among five different approaches) provides the best results in evaluations. Such effect is due to extra model robustness due to data regularization, and better generalization capabilities by avoiding sensor limitations in real deployments.

Publicación
Applied Soft Computing, Vol. 175, PP. 113076, DOI: https://doi.org/10.1016/j.asoc.2025.113076