Bugs in Citizen-Science Data: Robust Biodiversity AI Begins with Clean Images (Papers Track)

Nikita Gavrilov (Fontys University of Applied Science); Gerard Schouten (Fontys University of Applied Science); Georgiana Manolache (Fontys University of Applied Science)

Paper PDF Poster File Cite
Earth Observation & Monitoring Climate Science & Modeling Local and Indigenous Knowledge Systems Meta- and Transfer Learning

Abstract

Despite bold claims that AI will accelerate scientific discovery, domains like climate change research still face challenges in learning from real-world data. We propose a data preprocessing pipeline that addresses a key bottleneck in biodiversity monitoring: the lack of standardized image quality control in large-scale species datasets. As climate change drives shifts in ecosystems, accurate species identification is critical. Yet citizen science images, though rich in species diversity, are often noisy and inconsistent. We systematically filters such data using classical heuristics and Vision-Language Model (VLM)-based image quality assessment to detect poor composition, human presence, and multiple-species interference. Zero-shot benchmarks with state-of-the-art biodiversity fine-tuned foundation models on filtered datasets of visually similar plant species demonstrate that data quality significantly affects AI reliability. With this work, we highlight a core limitation in biodiversity AI and encourage broader exploration of quality-related bottlenecks in biodiversity monitoring. Code is available at the project website.