Data Mining

Workshop Papers

Venue Title
ICLR 2024 Using expired weather forecasts to supply 10 000y of data for accurate planning of a renewable European energy system (Papers Track)
Abstract and authors: (click to expand)

Abstract: Expanding renewable energy generation and electrifying heating to address climate change will heighten the exposure of our power systems to the variability of weather. Planning and assessing these future systems typically lean on past weather data. We spotlight the pitfalls of this approach---chiefly its reliance on what we claim is a limited weather record---and propose a novel approach: to evaluate these systems on two orders of magnitude more weather scenarios. By repurposing past ensemble weather predictions, we not only drastically expand the known weather distribution---notably its extreme tails---for traditional power system modeling but also unveil its potential to enable data-intensive self-supervised, diffusion-based and optimization ML techniques. Building on our methodology, we introduce a **dataset** collected from ECMWF ENS forecasts, encompassing power-system relevant variables over Europe, and detail the intricate process behind its assembly.

Authors: Petr Dolezal (AI4ER CDT, University of Cambridge); Emily Shuckburgh (University of Cambridge)

ICLR 2024 Reconstructing the Breathless Ocean with Spatio-Temporal Graph Learning (Papers Track)
Abstract and authors: (click to expand)

Abstract: The ocean is currently undergoing severe deoxygenation. Accurately reconstructing the breathless ocean is crucial for assessing and protecting marine ecosystem in response to climate change. Existing expert-dominated numerical simulations fail to catch up with the dynamic variation caused by global warming and human activities. Besides, due to the high-cost data collection, the historical observations are severely sparse, leading to big challenge for precise reconstruction. In this work, we propose OxyGenerator, the first spatio-temporal graph learning model, to reconstruct the global ocean deoxygenation from 1920 to 2023. Specifically, to address the heterogeneity across large temporal and spatial scales, we propose zoning-varying graph message-passing to capture the complex oceanographic correlations between missing values and sparse observations. Additionally, to further calibrate the uncertainty, we incorporate inductive bias from dissolved oxygen (DO) variations and chemical effects. Compared with in-situ DO observations, OxyGenerator significantly outperforms CMIP6 numerical simulations, reducing MAPE by 38.77%, demonstrating a promising potential to understand the ocean deoxygenation in data-driven manner.

Authors: Bin Lu (Shanghai Jiao Tong University); Ze Zhao (Shanghai Jiao Tong University); Luyu Han (Shanghai Jiao Tong University); Xiaoying Gan (Shanghai Jiao Tong University); Yuntao Zhou (Shanghai Jiao Tong University); Lei Zhou (Shanghai Jiao Tong Univ); Luoyi Fu (Shanghai Jiao Tong University); Xinbing Wang (Shanghai Jiao Tong University); Chenghu Zhou (Institute of Geographic Sciences and Natural Resources Research, CAS); Jing Zhang (Shanghai Jiao Tong University)

ICLR 2024 Model Failure or Data Corruption? Exploring Inconsistencies in Building Energy Ratings with Self-Supervised Contrastive Learning (Papers Track)
Abstract and authors: (click to expand)

Abstract: Building Energy Rating (BER) stands as a pivotal metric, enabling building owners, policymakers, and urban planners to understand the energy-saving potential through improving building energy efficiency. As such, enhancing buildings' BER levels is expected to directly contribute to the reduction of carbon emissions and promote climate improvement. Nonetheless, the BER assessment process is vulnerable to missing and inaccurate measurements. In this study, we introduce CLEAR, a data-driven approach designed to scrutinize the inconsistencies in BER assessments through self-supervised contrastive learning. We validated the effectiveness of CLEAR using a dataset representing Irish building stocks. Our experiments uncovered evidence of inconsistent BER assessments, highlighting measurement data corruption within this real-world dataset.

Authors: Qian Xiao (Trinity College Dublin); Dan Liu (Trinity College Dublin); Kevin Credit (Maynooth University)

ICLR 2024 Global Vegetation Modeling With Pre-Trained Weather Transformers (Papers Track)
Abstract and authors: (click to expand)

Abstract: Accurate vegetation models can produce further insights into the complex inter-action between vegetation activity and ecosystem processes. Previous research has established that long-term trends and short-term variability of temperature and precipitation affect vegetation activity. Motivated by the recent success of Transformer-based Deep Learning models for medium-range weather forecasting, we adapt the publicly available pre-trained FourCastNet to model vegetation activity while accounting for the short-term dynamics of climate variability. We investigate how the learned global representation of the atmosphere’s state can be transferred to model the normalized difference vegetation index (NDVI). Our model globally estimates vegetation activity at a resolution of 0.25◦ while relying only on meteorological data. We demonstrate that leveraging pre-trained weather models improves the NDVI estimates compared to learning an NDVI model from scratch. Additionally, we compare our results to other recent data-driven NDVI modeling approaches from machine learning and ecology literature. We further provide experimental evidence on how much data and training time is necessary to turn FourCastNet into an effective vegetation model. Code and models are available at

Authors: Pascal Janetzky (University Wuerzburg); Florian Gallusser (Universität Würzburg); Simon Hentschel (Julius-Maximilians-Universität of Würzburg); Andreas Hotho (University of Wuerzburg); Anna Krause (Universität Würzburg, Department of Computer Science, CHair X Data Science)

ICLR 2024 Valuation and Profit Allocation for Electric Vehicle Battery Data in a Data Market (Proposals Track)
Abstract and authors: (click to expand)

Abstract: This paper delves into the realm of electric vehicle (EV) battery data trading markets, focusing on data valuation and revenue allocation. In the face of fast-developing electric mobility, the safety of EV batteries becomes more and more important, driving the need for robust anomaly detection models. For newly found EV companies lacking extensive data, data markets offer a solution, facilitated by trading platforms. We shape this landscape, outline a transaction process involving data buyers, data sellers, and platforms. Our exploration extends to data valuation methodologies, encompassing the classic Shapley value and the least core algorithm. Considering the complicated mechanisms in EV battery, we unveil a deep learning framework for anomaly detection, treating EV batteries as dynamic systems. To explain data value from an economic perspective, we utilize a utility function considering the direct economic costs saved for the EV company to refine the evaluation process. Based on data value, we further propose revenue allocation schemes to allocate part of EV company's revenue to data sellers, offering diverse perspectives on fair and equitable profit distribution. A case study is conducted based on real world EV battery dataset to illustrate how the different revenue allocation schemes allocate payoffs to data sellers.

Authors: Junkang Chen (Peking University); Guannan He (Peking University)

ICLR 2024 Planning for Floods & Droughts: Intro to AI-Driven Hydrological Modeling (Tutorials Track)
Abstract and authors: (click to expand)

Abstract: This tutorial presents an AI-driven hydrological modeling approach to advance predictions of extreme hydrological events, including floods and droughts, which are of significant socioeconomic concerns. Traditionally, physics-based hydrological models have been the mainstay for simulating rainfall-runoff dynamics and forecasting streamflow. These models, while effective, are constrained by limitations in our systematic understanding and an inability to incorporate heterogeneous data. Recently, the surge in availability of multi-scale, multi-modal hydrological data has spurred the adoption of data-driven machine learning (ML) techniques. These methods have shown promising predictive performance. However, they often struggle with generalization and reliability, especially under climate change. This tutorial introduces physics-informed ML, by leveraging data and domain knowledge, to improve prediction accuracy and trustworthiness. We will delve into uncertainty quantification methods for probabilistic predictions that are vital for climate-resilient planning in managing floods and droughts. Participants will be guided through a comprehensive workflow, encompassing data analysis, model construction, and model evaluation. This tutorial is designed to elevate researchers’ understanding of hydrological systems and provide practitioners with robust, climate-resilient water management tools. These tools are instrumental in facilitating informed decision-making, crucial in the context of climate adaptation strategies. Participants will learn: ● Heterogeneous climate and hydrology data analysis ● State-of-the-art neural network models for rainfall-runoff modeling. ● ML model construction, training, validating, and testing ● Multiple ways to build a physics-informed ML model ● Uncertainty quantification in ML model predictions. All code and data will be publicly available for researchers/practitioners to build their own models.

Authors: Kshitij Tayal (Oak Ridge National Lab); Arvind Renganathan (University of Minnesota); Dan Lu (Oak Ridge National Laboratory)

NeurIPS 2023 Understanding Opinions Towards Climate Change on Social Media (Papers Track)
Abstract and authors: (click to expand)

Abstract: Social media platforms such as Twitter (now known as X) have revolutionized how the public engage with important societal and political topics. Recently, climate change discussions on social media became a catalyst for political polarization and the spreading of misinformation. In this work, we aim to understand how real world events influence the opinions of individuals towards climate change related topics on social media. To this end, we extracted and analyzed a dataset of 13.6 millions tweets sent by 3.6 million users from 2006 to 2019. Then, we construct a temporal graph from the user-user mentions network and utilize the Louvain community detection algorithm to analyze the changes in community structure around Conference of the Parties on Climate Change (COP) events. Next, we also apply tools from the Natural Language Processing literature to perform sentiment analysis and topic modeling on the tweets. Our work acts as a first step towards understanding the evolution of pro-climate change communities around COP events. Answering these questions helps us understand how to raise people's awareness towards climate change thus hopefully calling on more individuals to join the collaborative effort in slowing down climate change.

Authors: Yashaswi Pupneja (University of Montreal); Yuesong Zou (McGill University); Sacha Levy (Yale University); Shenyang Huang (Mila/McGill University)

ICLR 2023 Accuracy is not the only Metric that matters: Estimating the Energy Consumption of Deep Learning Models (Papers Track)
Abstract and authors: (click to expand)

Abstract: Modern machine learning models have started to consume incredible amounts of energy, thus incurring large carbon footprints (Strubell et al., 2019). To address this issue, we have created an energy estimation pipeline, which allows practitioners to estimate the energy needs of their models in advance, without actually running or training them. We accomplished this, by collecting high-quality energy data and building a first baseline model, capable of predicting the energy consumption of DL models by accumulating their estimated layer-wise energies.

Authors: Johannes Getzner (Technical University of Munich); Bertrand Charpentier (Technical University of Munich); Stephan Günnemann (Technical University of Munich)

ICLR 2023 A High-Resolution, Data-Driven Model of Urban Carbon Emissions (Papers Track) Best Pathway to Impact
Abstract and authors: (click to expand)

Abstract: Cities represent both a fundamental contributor to greenhouse (GHG) emissions and a catalyst for climate action. Many global cities have outlined sustainability and climate change mitigation plans, focusing on energy efficiency, shifting away from fossil fuels, and prioritizing environmental and social justice. To achieve broad-based and equitable carbon emissions reductions and sustainability goals, new data-driven methodologies are needed to identify and target efficiency and carbon reduction opportunities in the built environment at the building, neighborhood, and city-scale. Our methodology integrates data from numerous data sources and develops data-driven and physical models of energy use and carbon emissions from buildings and transportation to generate a high spatiotemporal resolution model of urban greenhouse gas emissions. The method and data tool are designed to support city leaders and urban policymakers with an unprecedented view of localized carbon emissions to enable data-driven and evidenced-based climate action.

Authors: Bartosz Bonczak (New York University); Boyeong Hong (New York University); Constantine E. Kontokosta (New York University)

ICLR 2023 Mining Effective Strategies for Climate Change Communication (Papers Track)
Abstract and authors: (click to expand)

Abstract: With the goal of understanding effective strategies to communicate about climate change, we build interpretable models to rank tweets related to climate change with respect to the engagement they generate. Our models are based on the Bradley-Terry model of pairwise comparison outcomes and use a combination of the tweets’ topic and metadata features to do the ranking. To remove confounding factors related to author popularity and minimise noise, they are trained on pairs of tweets that are from the same author and around the same time period and have a sufficiently large difference in engagement. The models achieve good accuracy on a held-out set of pairs. We show that we can interpret the parameters of the trained model to identify the topic and metadata features that contribute to high engagement. Among other observations, we see that topics related to climate projections, human cost and deaths tend to have low engagement while those related to mitigation and adaptation strategies have high engagement. We hope the insights gained from this study will help craft effective climate communication to promote engagement, thereby lending strength to efforts to tackle climate change.

Authors: Aswin Suresh (EPFL); Lazar Milikic (EPFL); Francis Murray (EPFL); Yurui Zhu (EPFL); Matthias Grossglauser (École Polytechnique Fédérale de Lausanne (EPFL))

ICLR 2023 Graph-Based Deep Learning for Sea Surface Temperature Forecasts (Papers Track)
Abstract and authors: (click to expand)

Abstract: Sea surface temperature (SST) forecasts help with managing the marine ecosystem and the aquaculture impacted by anthropogenic climate change. Numerical dynamical models are resource intensive for SST forecasts; machine learning (ML) models could reduce high computational requirements and have been in the focus of the research community recently. ML models normally require a large amount of data for training. Environmental data are collected on regularly-spaced grids, so early work mainly used grid-based deep learning (DL) for prediction. However, both grid data and the corresponding DL approaches have inherent problems. As geometric DL has emerged, graphs as a more generalized data structure and graph neural networks (GNNs) have been introduced to the spatiotemporal domains. In this work, we preliminarily explored graph re-sampling and GNNs for global SST forecasts, and GNNs show better one month ahead SST prediction than the persistence model in most oceans in terms of root mean square errors.

Authors: Ding Ning (University of Canterbury); Varvara Vetrova (University of Canterbury); Karin Bryan (University of Waikato)

ICLR 2023 Uncovering the Spatial and Temporal Variability of Wind Resources in Europe: A Web-Based Data-Mining Tool (Papers Track)
Abstract and authors: (click to expand)

Abstract: We introduce, a web-based data-mining visualization tool of the spatial and temporal variability of wind resources. It uses the latest open-access dataset of the daily wind capacity factor in 28 European countries between 1979 and 2019 and proposes several user-configurable visualizations of the temporal and spatial variations of the wind power capacity factor. The platform allows for a deep analysis of the distribution, the cross-country correlation, and the drivers of low wind power events. It offers an easy-to-use interface that makes it suitable for the needs of researchers and stakeholders. The tool is expected to be useful in identifying areas of high wind potential and possible challenges that may impact the large-scale deployment of wind turbines in Europe. Particular importance is given to the visualization of low wind power events and to the potential of cross-border cooperations in mitigating the variability of wind in the context of increasing reliance on weather-sensitive renewable energy sources.

Authors: Alban Puech (École Polytechnique); Jesse Read (Ecole Polytechnique)

NeurIPS 2022 Temperature impacts on hate speech online: evidence from four billion tweets (Papers Track)
Abstract and authors: (click to expand)

Abstract: Human aggression is no longer limited to the physical space but exists in the form of hate speech on social media. Here, we examine the effect of temperature on the occurrence of hate speech on Twitter and interpret the results in the context of climate change, human behavior and mental health. Employing supervised machine learning models, we identify hate speech in a data set of four billion geolocated tweets from over 750 US cities (2014 – 2020). We statistically evaluate the changes in daily hate tweets against changes in local temperature, isolating the temperature influence from confounding factors using binned panel-regression models. We find a low prevalence of hate tweets in moderate temperatures and observe sharp increases of up to 12% for colder and up to 22% for hotter temperatures, indicating that not only hot but also cold temperatures increase aggressive tendencies. Further, we observe that for extreme temperatures hate speech also increases as a percentage of total tweeting activity, crowding out non-hate speech. The quasi-quadratic shape of the temperature-hate tweet curve is robust across varying climate zones, income groups, religious and political beliefs. The prevalence of the results across climatic and socioeconomic splits points to limits in adaptation. Our results illuminate hate speech online as an impact channel through which temperature alters societal aggression.

Authors: Annika Stechemesser (Potsdam Insitute for Climate Impact Research); Anders Levermann (Potsdam Institute for Climate Impact Research); Leonie Wenz (Potsdam Institute for Climate Impact Research)

NeurIPS 2022 A Global Classification Model for Cities using ML (Papers Track)
Abstract and authors: (click to expand)

Abstract: This paper develops a novel data set for three key resources use; namely, food, water, and energy, for 9000 cities globally. The data set is then utilized to develop a clustering approach as a starting point towards a global classification model. This novel clustering approach aims to contribute to developing an inclusive view of resource efficiency for all urban centers globally. The proposed clustering algorithm is comprised of three steps: first, outlier detection to address specific city characteristics, then a Variational Autoencoder (VAE), and finally, Agglomerative Clustering (AC) to improve the classification results. Our results show that this approach is more robust and yields better results in creating delimited clusters with high Calinski-Harabasz Index scores and Silhouette Coefficient than other baseline clustering methods.

Authors: Doron Hazan (MIT); Mohamed Habashy (Massachusetts Institute of Technology); Mohanned ElKholy (Massachusetts Institute of Technology); Omer Mousa (American University in Cairo); Norhan M Bayomi (MIT Environmental Solutions Initiative); Matias Williams (Massachusetts Institute of Technology); John Fernandez (Massachusetts Institute of Technology)

NeurIPS 2022 Learning Surrogates for Diverse Emission Models (Papers Track)
Abstract and authors: (click to expand)

Abstract: Transportation plays a major role in global CO2 emission levels, a factor that directly connects with climate change. Roadway interventions that reduce CO2 emission levels have thus become a timely requirement. An integral need in assessing the impact of such roadway interventions is access to industry-standard programmatic and instantaneous emission models with various emission conditions such as fuel types, vehicle types, cities of interest, etc. However, currently, there is a lack of well-calibrated emission models with all these properties. Addressing these limitations, this paper presents 1100 programmatic and instantaneous vehicular CO2 emission models with varying fuel types, vehicle types, road grades, vehicle ages, and cities of interest. We hope the presented emission models will facilitate future research in tackling transportation-related climate impact. The released version of the emission models can be found here.

Authors: Edgar Ramirez Sanchez (MIT); Catherine H Tang (Massachusetts Institute of Technology); Vindula Jayawardana (MIT); Cathy Wu (MIT)

NeurIPS 2022 Modelling the performance of delivery vehicles across urban micro-regions to accelerate the transition to cargo-bike logistics (Proposals Track)
Abstract and authors: (click to expand)

Abstract: Light goods vehicles (LGV) used extensively in the last mile of delivery are one of the leading polluters in cities. Cargo-bike logistics has been put forward as a high impact candidate for replacing LGVs, with experts estimating over half of urban van deliveries being replaceable by cargo bikes, due to their faster speeds, shorter parking times and more efficient routes across cities. By modelling the relative delivery performance of different vehicle types across urban micro-regions, machine learning can help operators evaluate the business and environmental impact of adding cargo-bikes to their fleets. In this paper, we introduce two datasets, and present initial progress in modelling urban delivery service time (e.g. cruising for parking, unloading, walking). Using Uber’s H3 index to divide the cities into hexagonal cells, and aggregating OpenStreetMap tags for each cell, we show that urban context is a critical predictor of delivery performance.

Authors: Max C Schrader (University of Alabama); Navish Kumar (IIT Kharagpur); Nicolas Collignon (University of Edinburgh); Maria S Astefanoaei (IT University of Copenhagen); Esben Sørig (Kale Collective); Soonmyeong Yoon (Kale Collective); Kai Xu (University of Edinburgh); Akash Srivastava (MIT-IBM)

NeurIPS 2022 Forecasting Global Drought Severity and Duration Using Deep Learning (Proposals Track)
Abstract and authors: (click to expand)

Abstract: Drought detection and prediction is challenging due to the slow onset of the event and varying degrees of dependence on numerous physical and socio-economic factors that differentiate droughts from other natural disasters. In this work, we propose DeepXD (Deep learning for Droughts), a deep learning model with 26 physics-informed input features for SPI (Standardised Precipitation Index) forecasting to identify and classify droughts using monthly oceanic indices, global meteorological and vegetation data, location (latitude, longitude) and land cover for the years 1982 to 2018. In our work, we propose extracting features by considering the atmosphere and land moisture and energy budgets and forecasting global droughts on a seasonal and an annual scale at 1, 3, 6, 9, 12 and 24 months lead times. SPI helps us to identify the severity and the duration of the drought to classify them as meteorological, agricultural and hydrological.

Authors: Akanksha Ahuja (NOA); Xin Rong Chua (Centre for Climate Research Singapore)

NeurIPS 2022 Personalizing Sustainable Agriculture with Causal Machine Learning (Proposals Track) Best Paper: Proposals
Abstract and authors: (click to expand)

Abstract: To fight climate change and accommodate the increasing population, global crop production has to be strengthened. To achieve the "sustainable intensification" of agriculture, transforming it from carbon emitter to carbon sink is a priority, and understanding the environmental impact of agricultural management practices is a fundamental prerequisite to that. At the same time, the global agricultural landscape is deeply heterogeneous, with differences in climate, soil, and land use inducing variations in how agricultural systems respond to farmer actions. The "personalization" of sustainable agriculture with the provision of locally adapted management advice is thus a necessary condition for the efficient uplift of green metrics, and an integral development in imminent policies. Here, we formulate personalized sustainable agriculture as a Conditional Average Treatment Effect estimation task and use Causal Machine Learning for tackling it. Leveraging climate data, land use information and employing Double Machine Learning, we estimate the heterogeneous effect of sustainable practices on the field-level Soil Organic Carbon content in Lithuania. We thus provide a data-driven perspective for targeting sustainable practices and effectively expanding the global carbon sink.

Authors: Georgios Giannarakis (National Observatory of Athens); Vasileios Sitokonstantinou (National Observatory of Athens); Roxanne Suzette Lorilla (National Observatory of Athens); Charalampos Kontoes (National Observatory of Athens)

NeurIPS 2022 Disaster Risk Monitoring Using Satellite Imagery (Tutorials Track)
Abstract and authors: (click to expand)

Abstract: Natural disasters such as flood, wildfire, drought, and severe storms wreak havoc throughout the world, causing billions of dollars in damages, and uprooting communities, ecosystems, and economies. Unfortunately, flooding events are on the rise due to climate change and sea level rise. The ability to detect and quantify them can help us minimize their adverse impacts on the economy and human lives. Using satellites to study flood is advantageous since physical access to flooded areas is limited and deploying instruments in potential flood zones can be dangerous. We are proposing a hands-on tutorial to highlight the use of satellite imagery and computer vision to study natural disasters. Specifically, we aim to demonstrate the development and deployment of a flood detection model using Sentinel-1 satellite data. The tutorial will cover relevant fundamental concepts as well as the full development workflow of a deep learning-based application. We will include important considerations such as common pitfalls, data scarcity, augmentation, transfer learning, fine-tuning, and details of each step in the workflow. Importantly, the tutorial will also include a case study on how the application was used by authorities in response to a flood event. We believe this tutorial will enable machine learning practitioners of all levels to develop new technologies that tackle the risks posed by climate change. We expect to deliver the below learning outcomes: • Develop various deep learning-based computer vision solutions using hardware-accelerated open-source tools that are optimized for real-time deployment • Create an optimized pipeline for the machine learning development workflow • Understand different performance metrics for model evaluation that are relevant for real world datasets and data imbalances • Understand the public sector’s efforts to support climate action initiatives and point out where the audience can contribute

Authors: Kevin Lee (NVIDIA); Siddha Ganju (NVIDIA); Edoardo Nemni (UNOSAT)

NeurIPS 2022 Machine Learning for Predicting Climate Extremes (Tutorials Track)
Abstract and authors: (click to expand)

Abstract: Climate change has led to a rapid increase in the occurrence of extreme weather events globally, including floods, droughts, and wildfires. In the longer term, some regions will experience aridification while others will risk sinking due to rising sea levels. Typically, such predictions are done via weather and climate models that simulate the physical interactions between the atmospheric, oceanic, and land surface processes that operate at different scales. Due to the inherent complexity, these climate models can be inaccurate or computationally expensive to run, especially for detecting climate extremes at high spatiotemporal resolutions. In this tutorial, we aim to introduce the participants to machine learning approaches for addressing two fundamental challenges. We will walk the participants through a hands-on tutorial for predicting climate extremes relating to temperature and precipitation in 2 setups: (1) temporal forecasting: the goal is to predict climate variables into the future (both direct single step approaches and iterative approaches that roll out the model for several timesteps), and (2) spatial downscaling: the goal is to learn a mapping that transforms low-resolution outputs of climate models into high-resolution regional forecasts. Through introductory presentations and colab notebooks, we aim to expose the participants to (a) APIs for accessing and navigating popular repositories that host global climate data, such as the Copernicus data store, (b) identifying relevant datasets, including auxiliary data (e.g., other climate variables such as geopotential), (c) scripts for downloading and preprocessing relevant datasets, (d) algorithms for training machine learning models, (d) metrics for evaluating model performance, and (e) visualization tools for both the dataset and predicted outputs. The coding notebooks will be in Python. No prior knowledge of climate science is required.

Authors: Hritik Bansal (UCLA); Shashank Goel (University of California Los Angeles); Tung Nguyen (University of California, Los Angeles); Aditya Grover (UCLA)

AAAI FSS 2022 NADBenchmarks - a compilation of Benchmark Datasets for Machine Learning Tasks related to Natural Disasters
Abstract and authors: (click to expand)

Abstract: Climate change has increased the intensity, frequency, and duration of extreme weather events and natural disasters across the world. While the increased data on natural disasters improves the scope of machine learning(ML) for this field, progress is relatively slow. One bottleneck is the lack of benchmark datasets that would allow ML researchers to quantify their progress against a standard metric. The objective of this short paper is to explore the state of benchmark datasets for ML tasks related to natural disasters, categorizing the datasets according to the disaster management cycle. We compile a list of existing benchmark datasets that have been introduced in the past five years. We propose a web platform where researchers can search for benchmark datasets in this domain, and develop a preliminary version of such a platform using our compiled list. This paper is intended to aid researchers in finding benchmark datasets to train their ML models on, and provide general directions in for topics where they can contribute new benchmark datasets.

Authors: Adiba Proma (University of Rochester), Md Saiful Islam (University of Rochester), Stela Ciko (University of Rochester), Raiyan Abdul Baten (University of Rochester) and Ehsan Hoque (University of Rochester)

AAAI FSS 2022 Curator: Creating Large-Scale Curated Labelled Datasets using Self-Supervised Learning
Abstract and authors: (click to expand)

Abstract: Applying Machine learning to domains like Earth Sciences is impeded by the lack of labeled data, despite a large corpus of raw data available in such domains. For instance, training a wildfire classifier on satellite imagery requires curating a massive and diverse dataset, which is an expensive and time-consuming process that can span from weeks to months. Searching for relevant examples in over 40 petabytes of unlabelled data requires researchers to manually hunt for such images, much like finding a needle in a haystack. We present a no-code end-to-end pipeline, Curator, which dramatically minimizes the time taken to curate an exhaustive labeled dataset. Curator is able to search massive amounts of unlabelled data by combining self-supervision, scalable nearest neighbor search, and active learning to learn and differentiate image representations. The pipeline can also be readily applied to solve problems across different domains. Overall, the pipeline makes it practical for researchers to go from just one reference image to a comprehensive dataset in a diminutive span of time.

Authors: Tarun Narayanan (SpaceML), Ajay Krishnan (SpaceML), Anirudh Koul (Pinterest, SpaceML, FDL) and Siddha Ganju (NVIDIA, SpaceML, FDL)

AAAI FSS 2022 Rethinking Machine Learning for Climate Science: A Dataset Perspective
Abstract and authors: (click to expand)

Abstract: The growing availability of data sources is a predominant factor enabling the widespread success of machine learning (ML) systems across a wide range of applications. Typically, training data in such systems constitutes a source of ground-truth, such as measurements about a physical object (e.g., natural images) or a human artifact (e.g., natural language). In this position paper, we take a critical look at the validity of this assumption for datasets for climate science. We argue that many such climate datasets are uniquely biased due to the pervasive use of external simulation models (e.g., general circulation models) and proxy variables (e.g., satellite measurements) for imputing and extrapolating in-situ observational data. We discuss opportunities for mitigating the bias in the training and deployment of ML systems using such datasets. Finally, we share views on improving the reliability and accountability of ML systems for climate science applications.

Authors: Aditya Grover (UCLA)

NeurIPS 2021 High-resolution rainfall-runoff modeling using graph neural network (Papers Track)
Abstract and authors: (click to expand)

Abstract: Time-series modeling has shown great promise in recent studies using the latest deep learning algorithms such as LSTM (Long Short-Term Memory). These studies primarily focused on watershed-scale rainfall-runoff modeling or streamflow forecasting, but the majority of them only considered a single watershed as a unit. Although this simplification is very effective, it does not take into account spatial information, which could result in significant errors in large watersheds. Several studies investigated the use of GNN (Graph Neural Networks) for data integration by decomposing a large watershed into multiple sub-watersheds, but each sub-watershed is still treated as a whole, and the geoinformation contained within the watershed is not fully utilized. In this paper, we propose the GNRRM (Graph Neural Rainfall-Runoff Model), a novel deep learning model that makes full use of spatial information from high-resolution precipitation data, including flow direction and geographic information. When compared to baseline models, GNRRM has less over-fitting and significantly improves model performance. Our findings support the importance of hydrological data in deep learning-based rainfall-runoff modeling, and we encourage researchers to include more domain knowledge in their models.

Authors: Zhongrun Xiang (University of Iowa); Ibrahim Demir (The University of Iowa)

NeurIPS 2021 Towards Automatic Transformer-based Cloud Classification and Segmentation (Papers Track)
Abstract and authors: (click to expand)

Abstract: Clouds have been demonstrated to have a huge impact on the energy balance, temperature, and weather of the Earth. Classification and segmentation of clouds and coverage factors is crucial for climate modelling, meteorological studies, solar energy industry, and satellite communication. For example, clouds have a tremendous impact on short-term predictions or 'nowcasts' of solar irradiance and can be used to optimize solar power plants and effectively exploit solar energy. However even today, cloud observation requires the intervention of highly-trained professionals to document their findings, which introduces bias. To overcome these issues and contribute to climate change technology, we propose, to the best of our knowledge, the first two transformer-based models applied to cloud data tasks. We use the CCSD Cloud classification dataset and achieve 90.06% accuracy, outperforming all other methods. To demonstrate the robustness of transformers in this domain, we perform Cloud segmentation on SWIMSWG dataset and achieve 83.2% IoU, also outperforming other methods. With this, we signal a potential shift away from pure CNN networks.

Authors: Roshan Roy (Birla Institute of Technology and Science, Pilani); Ahan M R (BITS Pilani); Vaibhav Soni (MANIT Bhopal); Ashish Chittora (BITS Pilani)

NeurIPS 2021 A Risk Model for Predicting Powerline-induced Wildfires in Distribution System (Proposals Track)
Abstract and authors: (click to expand)

Abstract: The power grid is one of the most common causes of wildfires that result in tremendous economic loss and significant life risk. In this study, we propose to use machine learning techniques to build a risk model for predicting powerline-induced wildfires in distribution system. We collect weather, vegetation, and infrastructure data for all feeders in Pacific Gas & Electricity territory. This study will contribute to a deeper understanding of powerline-induced wildfire prediction and provide valuable suggestions for wildfire mitigation planning.

Authors: Mengqi Yao (University of California Berkeley)

NeurIPS 2021 Predicting Cascading Failures in Power Systems using Graph Convolutional Networks (Proposals Track)
Abstract and authors: (click to expand)

Abstract: Worldwide targets are set for the increase of renewable power generation in electricity networks on the way to combat climate change. Consequently, a secure power system that can handle the complexities resulted from the increased renewable power integration is crucial. One particular complexity is the possibility of cascading failures — a quick succession of multiple component failures that takes down the system and might also lead to a blackout. Viewing the prediction of cascading failures as a binary classification task, we explore the efficacy of Graph Convolution Networks (GCNs), to detect the early onset of a cascading failure. We perform experiments based on simulated data from a benchmark IEEE test system. Our preliminary findings show that GCNs achieve higher accuracy scores than other baselines which bodes well for detecting cascading failures. It also motivates a more comprehensive study of graph-based deep learning techniques for the current problem.

Authors: Tabia Ahmad (University of Strathclyde); Yongli Zhu (Texas A&M Universersity); Panagiotis Papadopoulos (University of Strathclyde)

ICML 2021 DroughtED: A dataset and methodology for drought forecasting spanning multiple climate zones (Papers Track)
Abstract and authors: (click to expand)

Abstract: Climate change exacerbates the frequency, duration and extent of extreme weather events such as drought. Previous attempts to forecast drought conditions using machine learning have focused on regional models which have two major limitations for national drought management: (i) they are trained on localised climate data and (ii) their architectures prevent them from being applied to new heterogeneous regions. In this work, we present a new large-scale dataset for training machine learning models to forecast national drought conditions, named DroughtED. The dataset consists of globally available meteorological features widely used for drought prediction, paired with location meta-data which has not previously been utilised for drought forecasting. Here we also establish a baseline on DroughtED and present the first research to apply deep learning models - Long Short-Term Memory (LSTMs) and Transformers - to predict county-level drought conditions across the full extent of the United States. Our results indicate that DroughtED enables deep learning models to learn cross-region patterns in climate data that contribute to drought conditions and models trained on DroughtED compare favourably to state-of-the-art drought prediction models trained on individual regions.

Authors: Christoph D Minixhofer (The University of Edinburgh); Mark Swan (The University of Edinburgh); Calum McMeekin (The University of Edinburgh); Pavlos Andreadis (The University of Edinburgh)

ICML 2021 Online LSTM Framework for Hurricane Trajectory Prediction (Papers Track)
Abstract and authors: (click to expand)

Abstract: Hurricanes are high-intensity tropical cyclones that can cause severe damages when the storms make landfall. Accurate long-range prediction of hurricane trajectories is an important but challenging problem due to the complex interactions between the ocean and atmosphere systems. In this paper, we present a deep learning framework for hurricane trajectory forecasting by leveraging the outputs from an ensemble of dynamical (physical) models. The proposed framework employs a temporal decay memory unit for imputing missing values in the ensemble member outputs, coupled with an LSTM architecture for dynamic path prediction. The framework is extended to an online learning setting to capture concept drift present in the data. Empirical results suggest that the proposed framework significantly outperforms various baselines including the official forecasts from U.S. National Hurricane Center (NHC).

Authors: Ding Wang (Michigan State University); Pang-Ning Tan (MSU)

ICML 2021 IowaRain: A Statewide Rain Event Dataset Based on Weather Radars and Quantitative Precipitation Estimation (Papers Track)
Abstract and authors: (click to expand)

Abstract: Effective environmental planning and management to address climate change could be achieved through extensive environmental modeling with machine learning and conventional physical models. In order to develop and improve these models, practitioners and researchers need comprehensive benchmark datasets that are prepared and processed with environmental expertise that they can rely on. This study presents an extensive dataset of rainfall events for the state of Iowa (2016-2019) acquired from the National Weather Service Next Generation Weather Radar (NEXRAD) system and processed by a quantitative precipitation estimation system. The dataset presented in this study could be used for better disaster monitoring, response and recovery by paving the way for both predictive and prescriptive modeling.

Authors: Muhammed A Sit (The University of Iowa); Bongchul Seo (IIHR—Hydroscience & Engineering, The University of Iowa); Ibrahim Demir (The University of Iowa)

NeurIPS 2020 Quantifying the presence of air pollutants over a road network in high spatio-temporal resolution (Papers Track)
Abstract and authors: (click to expand)

Abstract: Monitoring air pollution plays a key role when trying to reduce its impact on the environment and on human health. Traditionally, two main sources of information about the quantity of pollutants over a city are used: monitoring stations at ground-level (when available), and satellites' remote sensing. In addition to these two, other methods have been developed in the last years that aim at understanding how traffic emissions behave in space and time at a finer scale, taking into account the human mobility patterns. We present a simple and versatile framework for estimating the quantity of four air pollutants (CO2, NOx, PM, VOC) emitted by private vehicles moving on a road network, starting from raw GPS traces and information about vehicles' fuel type, and use this framework for analyses on how such pollutants distribute over the road networks of different cities.

Authors: Matteo Bohm (Sapienza University of Rome); Mirco Nanni (ISTI-CNR Pisa, Italy); Luca Pappalardo (ISTI)

NeurIPS 2020 FlowDB: A new large scale river flow, flash flood, and precipitation dataset (Papers Track)
Abstract and authors: (click to expand)

Abstract: Flooding results in 8 billion dollars of damage annually in the US and causes the most deaths of any weather related event. Due to climate change scientists expect more heavy precipitation events in the future. However, no current datasets exist that contain both hourly precipitation and river flow data. We introduce a novel hourly river flow and precipitation dataset and a second subset of flash flood events with damage estimates and injury counts. Using these datasets we create two challenges (1) general stream flow forecasting and (2) flash flood damage estimation. We also create a public benchmark and an Python package to enable easy adding of new models . Additionally, in the future we aim to augment our dataset with snow pack data and soil index moisture data to improve predictions

Authors: Isaac Godfried (CoronaWhy)

NeurIPS 2020 EarthNet2021: A novel large-scale dataset and challenge for forecasting localized climate impacts (Papers Track)
Abstract and authors: (click to expand)

Abstract: Climate change is global, yet its concrete impacts can strongly vary between different locations in the same region. Seasonal weather forecasts currently operate at the mesoscale (> 1 km). For more targeted mitigation and adaptation, modelling impacts to < 100 m is needed. Yet, the relationship between driving variables and Earth’s surface at such local scales remains unresolved by current physical models. Large Earth observation datasets now enable us to create machine learning models capable of translating coarse weather information into high-resolution Earth surface forecasts encompassing localized climate impacts. Here, we define high-resolution Earth surface forecasting as video prediction of satellite imagery conditional on mesoscale weather forecasts. Video prediction has been tackled with deep learning models. Developing such models requires analysis-ready datasets. We introduce EarthNet2021, a new, curated dataset containing target spatio-temporal Sentinel 2 satellite imagery at 20 m resolution, matched with high-resolution topography and mesoscale (1.28 km) weather variables. With over 32000 samples it is suitable for training deep neural networks. Comparing multiple Earth surface forecasts is not trivial. Hence, we define the EarthNetScore, a novel ranking criterion for models forecasting Earth surface reflectance. For model intercomparison we frame EarthNet2021 as a challenge with four tracks based on different test sets. These allow evaluation of model validity and robustness as well as model applicability to extreme events and the complete annual vegetation cycle. In addition to forecasting directly observable weather impacts through satellite-derived vegetation indices, capable Earth surface models will enable downstream applications such as crop yield prediction, forest health assessments, coastline management, or biodiversity monitoring. Find data, code, and how to participate at .

Authors: Christian Requena-Mesa (Computer Vision Group, Friedrich Schiller University Jena; DLR Institute of Data Science, Jena; Max Planck Institute for Biogeochemistry, Jena); Vitus Benson (Max-Planck-Institute for Biogeochemistry); Jakob Runge (Institute of Data Science, German Aerospace Center (DLR)); Joachim Denzler (Computer Vision Group, Friedrich Schiller University Jena, Germany); Markus Reichstein (Max Planck Institute for Biogeochemistry, Jena; Michael Stifel Center Jena for Data-Driven and Simulation Science, Jena)

NeurIPS 2020 Emerging Trends of Sustainability Reporting in the ICT Industry: Insights from Discriminative Topic Mining (Papers Track)
Abstract and authors: (click to expand)

Abstract: The Information and Communication Technologies (ICT) industry has a considerable climate change impact and accounts for approximately 3 percent of global carbon emissions. Despite the increasing availability of sustainability reports provided by ICT companies, we still lack a systematic understanding of what has been disclosed at an industry level. In this paper, we make the first major effort to use modern unsupervised learning methods to investigate the sustainability reporting themes and trends of the ICT industry over the past two decades. We build a cross-sector dataset containing 22,534 environmental reports from 1999 to 2019, of which 2,187 are ICT specific. We then apply CatE, a text embedding based topic modeling method, to mine specific keywords that ICT companies use to report on climate change and energy. As a result, we identify (1) important shifts in ICT companies' climate change narratives from physical metrics towards climate-related disasters, (2) key organizations with large influence on ICT companies, and (3) ICT companies' increasing focus on data center and server energy efficiency.

Authors: Lin Shi (Stanford University); Nhi Truong Vu (Stanford University)