Balancing quantity and representativeness in constrained geospatial dataset design (Papers Track)

Livia Betti (University of Colorado at Boulder); Esther Rolf (CU Boulder)

Paper PDF Poster File Cite
Active Learning

Abstract

Effective geospatial machine learning (GeoML) relies on high-quality, large-scale datasets, yet geospatial data collection is often costly and logistically challenging. Creating new geospatial datasets frequently requires on-site labeling of data, including collecting data through surveys or scientific instruments, which leads to variable costs across different regions or groups. To address this, we propose a sampling method that jointly maximizes dataset size and representative composition with respect to cost constraints. We evaluate our proposed sampling method by training GeoML models on the selected subsets and comparing their performance to models trained on randomly sampled data. We find that our method leads to improved performance over standard data collection baselines. These findings provide guidance on when to prioritize representation or dataset size and highlight the need for further research into how sampling strategies can enhance model performance.