Automating the creation of LULC datasets for semantic segmentation (Tutorials Track)

Sambhav S Rohatgi (; Anthony Mucia (

Tutorial Slides PDF Recorded Talk NeurIPS 2022 Poster Topia Link Cite
Computer Vision & Remote Sensing Cities & Urban Planning Climate Science & Modeling Forests Land Use


High resolution and accurate Land Use and Land Cover mapping (LULC) datasets are increasingly important and can be widely used in monitoring climate change impacts in agriculture, deforestation, and the carbon cycle. These datasets represent physical classifications of land types and spatial information over the surface of the Earth. These LULC datasets can be leveraged in a plethora of research topics and industries to mitigate and adapt to environmental changes. High resolution urban mappings can be used to better monitor and estimate building albedo and urban heat island impacts, and accurate representation of forests and vegetation can even be leveraged to better monitor the carbon cycle and climate change through improved land surface modelling. The advent of machine learning (ML) based CV techniques over the past decade provides a viable option to automate LULC mapping. One impediment to this has been the lack of large ML datasets. Large vector datasets for LULC are available, but can’t be used directly by ML practitioners due to a knowledge gap in transforming the input into a dataset of paired satellite images and segmentation masks. We demonstrate a novel end-to-end pipeline for LULC dataset creation that takes vector land cover data and provides a training-ready dataset. We will use Sentinel-2 satellite imagery and the European Urban Atlas LULC data. The pipeline manages everything from downloading satellite data, to creating and storing encoded segmentation masks and automating data checks. We then use the resulting dataset to train a semantic segmentation model. The aim of the pipeline is to provide a way for users to create their own custom datasets using various combinations of multispectral satellite and vector data. In addition to presenting the pipeline, we aim to provide an introduction to multispectral imagery, geospatial data and some of the challenges in using it for ML.

Recorded Talk (direct link)