Data Mining

Data Extraction and Modelling from Plant Trait Literature

Richard Reeve (University of Glasgow); Neil A. Brummitt (Natural History Museum); Claire L. Harris (Biomathematics and Statistics Scotland); Ana Claudia Araujo (Natural History Museum); Ben Scott (Natural History Museum); Christina Cobbold (University of Glasgow); Glenn Marion (Biomathematics & Statistics Scotland), 2023

Venue	Title
ICLR 2025	ClimateChat: Designing Data and Methods for Instruction Tuning LLMs to Answer Climate Change Queries (Papers Track) Abstract and authors: (click to expand) Abstract: As the issue of global climate change becomes increasingly severe, the demand for research in climate science continues to grow. Natural language processing technologies, represented by Large Language Models (LLMs), have been widely applied to climate change-specific research, providing essential information support for decision-makers and the public. Some studies have improved model performance on relevant tasks by constructing climate change-related instruction data and instruction-tuning LLMs. However, current research remains inadequate in efficiently producing large volumes of high-precision instruction data for climate change, which limits further development of climate change LLMs. This study introduces an automated method for constructing instruction data. The method generates instructions using facts and background knowledge from documents and enhances the diversity of the instruction data through web scraping and the collection of seed instructions. Using this method, we constructed a climate change instruction dataset, named ClimateChat-Corpus, which was used to fine-tune open-source LLMs, resulting in an LLM named ClimateChat. Evaluation results show that ClimateChat significantly improves performance on climate change question-and-answer tasks. Additionally, we evaluated the impact of different base models and instruction data on LLM performance and demonstrated its capability to adapt to a wide range of climate change scientific discovery tasks, emphasizing the importance of selecting an appropriate base model for instruction tuning. This research provides valuable references and empirical support for constructing climate change instruction data and training climate change-specific LLMs. Authors: zhou chen (Tsinghua University); Xiao Wang (Tsinghua University); Liao Yuanhong (Tsinghua University); Ming Lin (Tsinghua University); Yuqi Bai (Tsinghua University)
ICLR 2025	FabAgent: An LLM-based Agentic Optimization Framework for Design of Sustainable Fabrics (Papers Track) Abstract and authors: (click to expand) Abstract: The fashion industry emits an estimated four billion tons of CO2 annually and nearly one-third of this is due to the choice of fibers used in clothing. Despite the critical role of fiber selection, limited research exists on the design of optimal fiber blends because of a lack of available datasets on fiber properties. This paper introduces FabAgent, the first large language model (LLM) based agentic optimization framework to discover novel sustainable fabric blends. FabAgent provides a scalable way to extract information from scientific publications and the Internet, compiling a structured data set of 101 fabric materials with 24 attributes each, making this one of the most comprehensive raw material data sets for sustainable clothing design. Next, FabAgent uses multi-objective evolutionary optimization to explore Pareto optimal solutions over a large design space of possible blends, balancing sustainability, durability, comfort, and cost, while accommodating constraints on allowable yarn compositions. The optimal blend found by FabAgent substantially outperforms many commercially available blends in leading fashion brands such as Banana Republic, Giorgio Armani, GAP, and Nike: a 30.46–52.71% improvement in environmental sustainability, 15.40–92.21% improvement in cost efficiency, and 68.29-83.49% improvement in comfort. Authors: Anusha Narayan (The Nueva School)
ICLR 2025	Predicting extreme weather impacts on physical activity and sleep patterns using real-world data from wrist-worn accelerometers (Papers Track) Abstract and authors: (click to expand) Abstract: The increasing frequency of extreme weather events, such as heat waves, is among the most pressing consequences of climate change, with profound implications for human health and well-being. Despite increasing incidence of extreme weather events globally, there is a lack of understanding on the impact of hot weather on health outcomes. In this study, we utilized machine learning techniques to explore how variations in outdoor temperature influence physical activity and sleep patterns, two critical determinants of physical and mental health. Using data from 90,434 participants in the UK Biobank, recorded via wrist-worn accelerometers, linked with meteorological data from the UK Met Office, we analysed the relationship between outdoor temperature (5°C to 30°C) and daily magnitudes and durations of a) physical activity and b) sleep, whilst adjusting for sociodemographic, clinical, lifestyle, seasonality, precipitation, and regional variables. Our results reveal that moderate-to-vigorous physical activity (MVPA) increases with temperature, reaching its peak at 25°C, but plateaus thereafter. Conversely, sedentary behaviour and sleep disturbances significantly intensify as temperatures reach 30. Here tested in UK settings, our approach is generalisable to other climatic regions and determinants of health and should be further investigated in regions with high climate-vulnerability. These findings emphasize the role of machine learning in identifying health risks associated with climate change and underscore the necessity of climate-adaptive public health strategies to mitigate these effects. Authors: Sara khalid (University of Oxford)
ICLR 2025	Predicting out-of-domain performance under geographic distribution shifts (Papers Track) Abstract and authors: (click to expand) Abstract: In machine learning for geographic data, we often observe differences in data availability and distribution shifts across distinct geographic units, e.g., continents. This is a common challenge in remote sensing tasks, such as crop yield forecasting or flood mapping. In many of these scenarios, we have models trained on a data-rich region and apply domain adaptation to transfer predictive capabilities to the target region. However, the effectiveness of domain transfer can suffer from distribution shifts, posing critical challenges for model deployment. In this work, we show that, even in the absence of labels, certain domain distance measures, based on image and location embeddings, can serve as a proxy measure for transfer performance. We further highlight this capacity on a set of real-world geographic adaptation datasets, spatial splits for domains, and models for adaptation training. Authors: Haoran Zhang (Harvard University); Konstantin Klemmer (Microsoft Research); Esther Rolf (University of Colorado, Boulder); David Alvarez-Melis (Harvard University)
ICLR 2025	Large Language Models as a New Modality for Generalizable Earth Data Monitoring (Papers Track) Abstract and authors: (click to expand) Abstract: Earth observation data are critical for monitoring progress toward Sustainable Development Goals (SDGs), yet persistent challenges in accessibility, integration of multimodal data, and geographic bias hinder comprehensive global assessments. While satellite imagery paired with machine learning (SIML) offers cost-effective monitoring, it struggles with socioeconomic indicators, data inequity, and spatial biases. This paper presents a novel framework leveraging large language models (LLMs) as a complementary modality to address these limitations. By extracting geospatial knowledge from pretrained LLMs through structured prompting—encoding coordinates into rich, task-agnostic embeddings—we enable efficient prediction of diverse earth monitoring indicators using linear regression. Evaluated on 25 global tasks spanning from climate metrics (e.g., temperature) to socioeconomic variables (e.g., poverty rates), our method outperforms state-of-the-art SIML approaches, achieving higher accuracy and sample efficiency. Notably, LLM-derived representations exhibit reduced geographic bias compared to existing methods and inherently capture socioeconomic contexts that form semantically meaningful clusters aligned with regional development patterns. Authors: Tong Nie (Tongji University); Junlin He (The Hong Kong Polytechnic University); Wei Ma (The Hong Kong Polytechnic University)
ICLR 2025	Palimpsest: Bill of Materials Prediction - A Case Study with Solid State Drives (Papers Track) Abstract and authors: (click to expand) Abstract: Accurately quantifying product carbon footprints (PCFs) is critical for organizations to measure environmental impacts and develop decarbonization strategies. However, traditional methods require Bills of Materials (BOMs) data as a key input for PCF estimation, which is time-intensive and limits scalability. We present Palimpsest, an automated BOM generation algorithm given product specification as input using Large Language Models (LLMs) and a reference dataset. Palimpsest extracts data from teardown reports to build a BOM repository, retrieves reference products based on an their attribute list, generates BOMs by systematically modifying reference BOMs based on attribute differences, and standardizes the output to enable automated PCF estimation. We also introduce a novel impact-based evaluation framework that compares predicted BOMs with ground truth, focusing on the accuracy in carbon impact. We benchmark our model against a naive LLM solution and a traditional PCF estimation approach for solid state drives and find it outperforms these methods with a weighted F1 of 99.5%. By streamlining and automating BOM prediction, our method reduces the manual effort required for PCF estimation, driving progress toward net-zero emissions targets across industries. Authors: Anran Wang (Amazon); Zaid Thanawala (Amazon); Harsh Gupta (Amazon); Jeremie Hakian (Amazon); Jared Kramer (Amazon); Kommy Weldemariam (Amazon); Bharathan Balaji (Amazon)
ICLR 2025	Uncertainty-Aware Deep Learning Framework for Forecasting Coastal Water Level in Virginia Beach (Papers Track) Abstract and authors: (click to expand) Abstract: Coastal areas like Virginia Beach, USA, are increasingly vulnerable to flooding. To mitigate the impact of flooding, it is crucial for the City of Virginia Beach to have reliable 72-hour-ahead (3 days) forecasts of water levels at key gauge locations. To support this effort, several sensors have been installed throughout the city to monitor water levels and other environmental parameters such as wind speed, precipitation, and atmospheric pressure. Leveraging sensor data from one of these locations, we developed an uncertainty-aware deep learning model to forecast water levels. We employed deep quantile regression (DQR) to quantify variability in the predictions and examined the performance of three different model architectures. In addition to exclusively including historical data, we investigated the improvement wind forecasts provide to the accuracy of 72-hour-ahead water level predictions. The results show a twelvefold improvement in the flood forecast for a real flooding event. Authors: Md Mahmudul Hasan (Thomas Jefferson National Accelerator Facility); Malachi Schram (Thomas Jefferson National Accelerator Facility); Sridhar Katragadda (City of Virginia Beach); Diana McSpadden (Thomas Jefferson National Accelerator Facility); Alisa N. Udomvisawakul (City of Virginia Beach); Heather Richter (Old Dominion University); Frank Liu (Old Dominion University)
ICLR 2025	Multivariate LSTM-Based Forecasting for Renewable Energy: Enhancing Climate Change Mitigation (Papers Track) Abstract and authors: (click to expand) Abstract: The increasing integration of renewable energy sources (RESs) into modern power systems presents significant opportunities but also notable challenges, primarily due to the inherent variability of RES generation. Accurate forecasting of RES generation is crucial for maintaining the reliability, stability, and economic efficiency of power system operations. Traditional approaches, such as deterministic methods and stochastic programming, frequently depend on representative scenarios generated through clustering techniques like K-means. However, these methods may fail to fully capture the complex temporal dependencies and non-linear patterns within RES data. This paper introduces a multivariate Long Short-Term Memory (LSTM)-based network designed to forecast RESs generation using their real-world historical data. The proposed model effectively captures long-term dependencies and interactions between different RESs, utilizing historical data from both local and neighboring areas to enhance predictive accuracy. In the case study, we showed that the proposed forecasting approach results in lower CO2 emissions, and a more reliable supply of electric loads. Authors: Farshid Kamrani (Carleton University); Kristen Schell (Carleton University)
ICLR 2025	Large Language Models for Monitoring Dataset Mentions in Climate Research (Papers Track) Abstract and authors: (click to expand) Abstract: Effective climate change research relies on diverse datasets to inform mitigation and adaptation strategies and policies. However, the ways these datasets are cited, used, and distributed remain poorly understood. This paper presents a machine learning framework that automates the detection and classification of dataset mentions in climate research papers. Leveraging large language models (LLMs), we generate a weakly supervised dataset through zero-shot extraction, quality assessment via an LLM-as-a-Judge, and refinement by a reasoning agent. The Phi-3.5-mini instruct model is pre-fine-tuned on this dataset, followed by fine-tuning on a smaller manually annotated subset to specialize in extracting data mentions. At inference, a ModernBERT-based classifier filters for dataset mentions, optimizing computational efficiency. Evaluated on a held-out manually annotated sample, our fine-tuned model outperforms NuExtract-v1.5 and GLiNER-large-v2.1 in dataset extraction accuracy. As a framework for monitoring dataset mentions in research papers, this approach enhances transparency, identifies data gaps, and enables researchers, funders, and policymakers to improve data discoverability and usage, leading to more informed decision-making. Authors: Aivin Solatorio (The World Bank); Rafael Macalaba (The World Bank); James Liounis (The World Bank)
ICLR 2025	Heterogenous graph neural networks for species distribution modeling (Papers Track) Abstract and authors: (click to expand) Abstract: Species distribution models (SDMs) are necessary for measuring and predicting occurrences and habitat suitability of species and their relationship with environmental factors. We introduce a novel presence-only SDM with graph neural networks (GNN). In our model, species and locations are treated as two distinct node sets, and the learning task is predicting detection records as the edges that connect locations to species. Using GNN for SDM allows us to model fine-grained interactions between species and the environment. We evaluate the potential of this methodology on the six-region dataset compiled by National Center for Ecological Analysis and Synthesis (NCEAS) for benchmarking SDMs. For each of the regions, the heterogeneous GNN model is comparable to or outperforms previously-benchmarked single-species SDMs as well as a feed-forward neural network baseline model. Authors: Lauren Harrell (Google Research); Christine Kaeser-Chen (Google DeepMind); Burcu Karagol Ayan (Google DeepMind); Keith Anderson (Google DeepMind); Michelangelo Conserva (Google Research); Elise Kleeman (Google Research); Maxim Neumann (Google DeepMind); Matt Overlan (Google DeepMind); Melissa Chapman (Google Research); Drew Purves (Google DeepMind)
ICLR 2025	Tracking ESG Disclosures of European Companies with Retrieval-Augmented Generation (Proposals Track) Abstract and authors: (click to expand) Abstract: Corporations play a crucial role in mitigating climate change and accelerating progress toward environmental, social, and governance (ESG) objectives. However, structured information on the current state of corporate ESG efforts remains limited. In this paper, we propose a machine learning framework based on a retrieval-augmented generation (RAG) pipeline to track ESG indicators from N=9,200 corporate reports. Our analysis includes ESG indicators from 600 of the largest listed corporations in Europe between 2014 and 2023. We focus on two key dimensions: first, we identify gaps in corporate sustainability reporting in light of existing standards. Second, we provide comprehensive bottom-up estimates of key ESG indicators across European industries. Our findings enable policymakers and financial markets to effectively assess corporate ESG transparency and track progress toward global sustainability objectives. Authors: Kerstin Forster (LMU Munich & Munich Center for Machine Learning); Victor Wagner (LMU Munich & Sustainability Reporting Navigator); Lucas Elias Keil (University of Cologne & Sustainability Reporting Navigator); Maximilian A. Müller (University of Cologne & Sustainability Reporting Navigator); Thorsten Sellhorn (LMU Munich & Sustainability Reporting Navigator); Stefan Feuerriegel (LMU Munich & Munich Center for Machine Learning)
ICLR 2025	From Rumors to Risk: Mapping and Modeling Climate-Disaster Misinformation (Proposals Track) Abstract and authors: (click to expand) Abstract: Recent years have seen a surge in climate-disaster misinformation, with social media amplifying unfounded claims in the lead-up to and aftermath of major disasters. This misinformation has hindered disaster preparation and recovery while fueling harassment against meteorologists and government officials, eroding trust in scientific institutions. While tools exist for analyzing general climate-change misinformation, current datasets often overlook the rapidly shifting narratives tied to specific events like wildfires, floods, or hurricanes. This proposal addresses that gap by developing a dynamic, evolving dataset on climate-disaster misinformation. Built through targeted social media data collection and rigorous labeling, the dataset will adapt alongside AI/ML advancements through iterative feedback from model performance and emerging trends. This openly accessible resource will enable researchers and practitioners to refine detection algorithms, design interventions, and inform crisis communication strategies—ensuring both data and models remain aligned with the shifting misinformation landscape. Ultimately, this work seeks to clarify key drivers of misinformation propagation and support more effective climate disaster response. Authors: Tristan Ballard (Independent)
NeurIPS 2024	AI-Driven Predictive Modeling of PFAS Contamination in Aquatic Ecosystems: Exploring A Geospatial Approach (Papers Track) Abstract and authors: (click to expand) Abstract: Per- and polyfluoroalkyl substances (PFAS), a class of synthetic fluorinated compounds termed “forever chemicals”, have garnered significant attention due to their persistence, widespread environmental presence, bioaccumulative properties, and associated risks for human health. Their presence in aquatic ecosystems highlights the link between human activity and the hydrological cycle. They also disrupt aquatic life, interfere with gas exchange, and disturb the carbon cycle, contributing to greenhouse gas emissions and exacerbating climate change. Federal agencies, state governments and non-government research and public interest organizations have emphasized the need for documenting the sites and the extent of PFAS contamination. However, the time-consuming and expensive nature of data collection and analysis poses challenges. It hinders the rapid identification of locations at high risk of PFAS contamination, which may then require further sampling or remediation. To address this data limitation, our study leverages a novel geospatial dataset, machine learning models including frameworks such as Random Forest, IBM-NASA's Prithvi and UNet, and geospatial analysis to predict regions with high PFAS concentrations in surface water. Using fish data from the National Rivers and Streams Assessment (NRSA) dataset by the Environmental Protection Agency (EPA), our analysis suggests the potential value of machine learning based models for targeted deployment of sampling investigations and remediation efforts. Authors: Jowaria Khan (University of Michigan); David Andrews (Environmental Working Group); Kaley Beins (Environmental Working Group); Sydney Evans (Environmental Working Group); Alexa Friedman (Environmental Working Group); Elizabeth Bondi-Kelly (MIT)
NeurIPS 2024	Towards Using Machine Learning to Generatively Simulate EV Charging in Urban Areas (Papers Track) Abstract and authors: (click to expand) Abstract: This study addresses the challenge of predicting electric vehicle (EV) charging profiles in urban locations with limited data. Utilizing a neural network architecture, we aim to uncover latent charging profiles influenced by spatio-temporal factors. Our model focuses on peak power demand and daily load shapes, providing insights into charging behavior. Our results indicate significant impacts from the type of Basic Administrative Units on predicted load curves, which contributes to the understanding and optimization of EV charging infrastructure in urban settings and allows Distribution System Operators (DSO) to more efficiently plan EV charging infrastructure expansion. Authors: Marek Miltner (Stanford University; Czech Technical University); Jakub Zíka (CTU); Daniel Vašata (Czech Technical University in Prague, Faculty of Information Technology); Artem Bryksa (CTU); Magda Friedjungová (Czech Technical University in Prague, Faculty of Information Technology); Ondřej Štogl (CTU); Ram Rajagopal (Stanford University); Oldřich Starý (CTU)
NeurIPS 2024	Classification of Snow Depth Measurements for tracking plant phenological shifts in Alpine regions (Papers Track) Abstract and authors: (click to expand) Abstract: Ground-based snow depth measurements are often realized using ultrasonic or laser technologies, which by their nature measure the height of any underlying object, whether it is snow or vegetation in snow-free periods. We propose a machine learning approach to the automated classification of snow depth measurements into a snow cover class and a class corresponding to everything else, which takes into account both the temporal context and the dependencies between snow depth and other sensor measurements. Through a series of experiments we demonstrate that our approach simplifies the detection of seasonal snowmelt and corresponding onset of plant growth, which we used to assess climate-change related phenological shifts in otherwise rather poorly monitored high alpine regions. Authors: Jan Svoboda (WSL Institute for Snow and Avalanche Research SLF); Michael Zehnder (WSL Institute for Snow and Avalanche Research SLF); Marc Ruesch (WSL Institute for Snow and Avalanche Research SLF); David Liechti (WSL Institute for Snow and Avalanche Research SLF); Corinne Jones (Swiss Data Science Center); Michele Volpi (Swiss Data Science Center, ETH Zurich); Christian Rixen (WSL Institute for Snow and Avalanche Research SLF); Jürg Schweizer (WSL Institute for Snow and Avalanche Research SLF)
NeurIPS 2024	Critical misalignments between climate action and sustainable development goals revealed (Papers Track) Abstract and authors: (click to expand) Abstract: A mere 12 percent of the Sustainable Development Goals (SDGs) is currently on track to meet the 2030 deadline in a world under climate change. Since their launch in 2015, the 2030 Agenda for Sustainable Development and the Paris Agreement have suffered persistent mismatches, which limit the potential for mutual gains. We use Artificial Intelligence (AI) to assess the degree and type of alignment between the Nationally Determined Contributions (NDCs) and the SDGs. While high income countries tackle the energy-infrastructure-community nexus in term of opportunity, lower income countries make climate impacts more explicit and center their trade-offs around the water-energy-food nexus. These two approaches mark different development trajectories and have non-negligible implications on international financial flow architecture and climate governance. Authors: Francesca Larosa (Royal Institute for Technology); Sergio Hoyas (Universitat Politècnica de València); Fermin Mallor Franco (Royal Institute of Technology); J. Alberto Conejero (Universitat Politècnica de València); Javier García-Martinez (University of Alicante); Francesco Fuso Nerini (Royal Institute of Technology); Ricardo Vinuesa (KTH Royal Institute of Technology)
NeurIPS 2024	Large language model co-pilot for transparent and trusted life cycle assessment comparisons (Proposals Track) Abstract and authors: (click to expand) Abstract: Intercomparing life cycle assessments (LCA), a common type of sustainability and climate model, is difficult due to basic differences in fundamental assumptions, especially in the goal and scope definition stage. This complicates decision-making and the selection of climate-smart policies, as it becomes difficult to compare optimal products and processes between different studies. To aid policymakers and LCA practitioners alike, we plan to leverage large language models (LLM) to build a database containing documented assumptions for LCAs across the agricultural sector, with a case study on livestock management. The articles for this database are identified in a systematic literature search, then processed to extract relevant assumptions about the goal and scope definition of the LCA and inserted into a vector database. We then leverage this database to develop an AI co-pilot by augmenting LLMs with retrieval augmented generation to be used by stakeholders and LCA practitioners alike. This co-pilot will accrue two major benefits: 1) enhance the decision-making process through facilitating comparisons among LCAs to enable policymakers to adopt data-driven climate policies and 2) encourage the use of common assumptions by LCA practitioners. Ultimately, we hope to create a foundational model for LCA tasks that can plug-in with existing open source LCA software and tools. Authors: Nathan Preuss (Cornell University); Fengqi You (Cornell University)
ICLR 2024	Using expired weather forecasts to supply 10 000y of data for accurate planning of a renewable European energy system (Papers Track) Abstract and authors: (click to expand) Abstract: Expanding renewable energy generation and electrifying heating to address climate change will heighten the exposure of our power systems to the variability of weather. Planning and assessing these future systems typically lean on past weather data. We spotlight the pitfalls of this approach---chiefly its reliance on what we claim is a limited weather record---and propose a novel approach: to evaluate these systems on two orders of magnitude more weather scenarios. By repurposing past ensemble weather predictions, we not only drastically expand the known weather distribution---notably its extreme tails---for traditional power system modeling but also unveil its potential to enable data-intensive self-supervised, diffusion-based and optimization ML techniques. Building on our methodology, we introduce a dataset collected from ECMWF ENS forecasts, encompassing power-system relevant variables over Europe, and detail the intricate process behind its assembly. Authors: Petr Dolezal (AI4ER CDT, University of Cambridge); Emily Shuckburgh (University of Cambridge)
ICLR 2024	Reconstructing the Breathless Ocean with Spatio-Temporal Graph Learning (Papers Track) Abstract and authors: (click to expand) Abstract: The ocean is currently undergoing severe deoxygenation. Accurately reconstructing the breathless ocean is crucial for assessing and protecting marine ecosystem in response to climate change. Existing expert-dominated numerical simulations fail to catch up with the dynamic variation caused by global warming and human activities. Besides, due to the high-cost data collection, the historical observations are severely sparse, leading to big challenge for precise reconstruction. In this work, we propose OxyGenerator, the first spatio-temporal graph learning model, to reconstruct the global ocean deoxygenation from 1920 to 2023. Specifically, to address the heterogeneity across large temporal and spatial scales, we propose zoning-varying graph message-passing to capture the complex oceanographic correlations between missing values and sparse observations. Additionally, to further calibrate the uncertainty, we incorporate inductive bias from dissolved oxygen (DO) variations and chemical effects. Compared with in-situ DO observations, OxyGenerator significantly outperforms CMIP6 numerical simulations, reducing MAPE by 38.77%, demonstrating a promising potential to understand the ocean deoxygenation in data-driven manner. Authors: Bin Lu (Shanghai Jiao Tong University); Ze Zhao (Shanghai Jiao Tong University); Luyu Han (Shanghai Jiao Tong University); Xiaoying Gan (Shanghai Jiao Tong University); Yuntao Zhou (Shanghai Jiao Tong University); Lei Zhou (Shanghai Jiao Tong Univ); Luoyi Fu (Shanghai Jiao Tong University); Xinbing Wang (Shanghai Jiao Tong University); Chenghu Zhou (Institute of Geographic Sciences and Natural Resources Research, CAS); Jing Zhang (Shanghai Jiao Tong University)
ICLR 2024	Model Failure or Data Corruption? Exploring Inconsistencies in Building Energy Ratings with Self-Supervised Contrastive Learning (Papers Track) Abstract and authors: (click to expand) Abstract: Building Energy Rating (BER) stands as a pivotal metric, enabling building owners, policymakers, and urban planners to understand the energy-saving potential through improving building energy efficiency. As such, enhancing buildings' BER levels is expected to directly contribute to the reduction of carbon emissions and promote climate improvement. Nonetheless, the BER assessment process is vulnerable to missing and inaccurate measurements. In this study, we introduce CLEAR, a data-driven approach designed to scrutinize the inconsistencies in BER assessments through self-supervised contrastive learning. We validated the effectiveness of CLEAR using a dataset representing Irish building stocks. Our experiments uncovered evidence of inconsistent BER assessments, highlighting measurement data corruption within this real-world dataset. Authors: Qian Xiao (Trinity College Dublin); Dan Liu (Trinity College Dublin); Kevin Credit (Maynooth University)
ICLR 2024	Global Vegetation Modeling With Pre-Trained Weather Transformers (Papers Track) Abstract and authors: (click to expand) Abstract: Accurate vegetation models can produce further insights into the complex inter-action between vegetation activity and ecosystem processes. Previous research has established that long-term trends and short-term variability of temperature and precipitation affect vegetation activity. Motivated by the recent success of Transformer-based Deep Learning models for medium-range weather forecasting, we adapt the publicly available pre-trained FourCastNet to model vegetation activity while accounting for the short-term dynamics of climate variability. We investigate how the learned global representation of the atmosphere’s state can be transferred to model the normalized difference vegetation index (NDVI). Our model globally estimates vegetation activity at a resolution of 0.25◦ while relying only on meteorological data. We demonstrate that leveraging pre-trained weather models improves the NDVI estimates compared to learning an NDVI model from scratch. Additionally, we compare our results to other recent data-driven NDVI modeling approaches from machine learning and ecology literature. We further provide experimental evidence on how much data and training time is necessary to turn FourCastNet into an effective vegetation model. Code and models are available at https://github.com/LSX-UniWue/Global-Ecosystem-Modeling. Authors: Pascal Janetzky (University Wuerzburg); Florian Gallusser (Universität Würzburg); Simon Hentschel (Julius-Maximilians-Universität of Würzburg); Andreas Hotho (University of Wuerzburg); Anna Krause (Universität Würzburg, Department of Computer Science, CHair X Data Science)
ICLR 2024	Valuation and Profit Allocation for Electric Vehicle Battery Data in a Data Market (Proposals Track) Abstract and authors: (click to expand) Abstract: This paper delves into the realm of electric vehicle (EV) battery data trading markets, focusing on data valuation and revenue allocation. In the face of fast-developing electric mobility, the safety of EV batteries becomes more and more important, driving the need for robust anomaly detection models. For newly found EV companies lacking extensive data, data markets offer a solution, facilitated by trading platforms. We shape this landscape, outline a transaction process involving data buyers, data sellers, and platforms. Our exploration extends to data valuation methodologies, encompassing the classic Shapley value and the least core algorithm. Considering the complicated mechanisms in EV battery, we unveil a deep learning framework for anomaly detection, treating EV batteries as dynamic systems. To explain data value from an economic perspective, we utilize a utility function considering the direct economic costs saved for the EV company to refine the evaluation process. Based on data value, we further propose revenue allocation schemes to allocate part of EV company's revenue to data sellers, offering diverse perspectives on fair and equitable profit distribution. A case study is conducted based on real world EV battery dataset to illustrate how the different revenue allocation schemes allocate payoffs to data sellers. Authors: Junkang Chen (Peking University); Guannan He (Peking University)
ICLR 2024	Planning for Floods & Droughts: Intro to AI-Driven Hydrological Modeling (Tutorials Track) Abstract and authors: (click to expand) Abstract: This tutorial presents an AI-driven hydrological modeling approach to advance predictions of extreme hydrological events, including floods and droughts, which are of significant socioeconomic concerns. Traditionally, physics-based hydrological models have been the mainstay for simulating rainfall-runoff dynamics and forecasting streamflow. These models, while effective, are constrained by limitations in our systematic understanding and an inability to incorporate heterogeneous data. Recently, the surge in availability of multi-scale, multi-modal hydrological data has spurred the adoption of data-driven machine learning (ML) techniques. These methods have shown promising predictive performance. However, they often struggle with generalization and reliability, especially under climate change. This tutorial introduces physics-informed ML, by leveraging data and domain knowledge, to improve prediction accuracy and trustworthiness. We will delve into uncertainty quantification methods for probabilistic predictions that are vital for climate-resilient planning in managing floods and droughts. Participants will be guided through a comprehensive workflow, encompassing data analysis, model construction, and model evaluation. This tutorial is designed to elevate researchers’ understanding of hydrological systems and provide practitioners with robust, climate-resilient water management tools. These tools are instrumental in facilitating informed decision-making, crucial in the context of climate adaptation strategies. Participants will learn: ● Heterogeneous climate and hydrology data analysis ● State-of-the-art neural network models for rainfall-runoff modeling. ● ML model construction, training, validating, and testing ● Multiple ways to build a physics-informed ML model ● Uncertainty quantification in ML model predictions. All code and data will be publicly available for researchers/practitioners to build their own models. Authors: Kshitij Tayal (Oak Ridge National Lab); Arvind Renganathan (University of Minnesota); Dan Lu (Oak Ridge National Laboratory)
NeurIPS 2023	Understanding Opinions Towards Climate Change on Social Media (Papers Track) Abstract and authors: (click to expand) Abstract: Social media platforms such as Twitter (now known as X) have revolutionized how the public engage with important societal and political topics. Recently, climate change discussions on social media became a catalyst for political polarization and the spreading of misinformation. In this work, we aim to understand how real world events influence the opinions of individuals towards climate change related topics on social media. To this end, we extracted and analyzed a dataset of 13.6 millions tweets sent by 3.6 million users from 2006 to 2019. Then, we construct a temporal graph from the user-user mentions network and utilize the Louvain community detection algorithm to analyze the changes in community structure around Conference of the Parties on Climate Change (COP) events. Next, we also apply tools from the Natural Language Processing literature to perform sentiment analysis and topic modeling on the tweets. Our work acts as a first step towards understanding the evolution of pro-climate change communities around COP events. Answering these questions helps us understand how to raise people's awareness towards climate change thus hopefully calling on more individuals to join the collaborative effort in slowing down climate change. Authors: Yashaswi Pupneja (University of Montreal); Yuesong Zou (McGill University); Sacha Levy (Yale University); Shenyang Huang (Mila/McGill University)
ICLR 2023	Accuracy is not the only Metric that matters: Estimating the Energy Consumption of Deep Learning Models (Papers Track) Abstract and authors: (click to expand) Abstract: Modern machine learning models have started to consume incredible amounts of energy, thus incurring large carbon footprints (Strubell et al., 2019). To address this issue, we have created an energy estimation pipeline, which allows practitioners to estimate the energy needs of their models in advance, without actually running or training them. We accomplished this, by collecting high-quality energy data and building a first baseline model, capable of predicting the energy consumption of DL models by accumulating their estimated layer-wise energies. Authors: Johannes Getzner (Technical University of Munich); Bertrand Charpentier (Technical University of Munich); Stephan Günnemann (Technical University of Munich)
ICLR 2023	A High-Resolution, Data-Driven Model of Urban Carbon Emissions (Papers Track) Best Pathway to Impact Abstract and authors: (click to expand) Abstract: Cities represent both a fundamental contributor to greenhouse (GHG) emissions and a catalyst for climate action. Many global cities have outlined sustainability and climate change mitigation plans, focusing on energy efficiency, shifting away from fossil fuels, and prioritizing environmental and social justice. To achieve broad-based and equitable carbon emissions reductions and sustainability goals, new data-driven methodologies are needed to identify and target efficiency and carbon reduction opportunities in the built environment at the building, neighborhood, and city-scale. Our methodology integrates data from numerous data sources and develops data-driven and physical models of energy use and carbon emissions from buildings and transportation to generate a high spatiotemporal resolution model of urban greenhouse gas emissions. The method and data tool are designed to support city leaders and urban policymakers with an unprecedented view of localized carbon emissions to enable data-driven and evidenced-based climate action. Authors: Bartosz Bonczak (New York University); Boyeong Hong (New York University); Constantine E. Kontokosta (New York University)
ICLR 2023	Mining Effective Strategies for Climate Change Communication (Papers Track) Abstract and authors: (click to expand) Abstract: With the goal of understanding effective strategies to communicate about climate change, we build interpretable models to rank tweets related to climate change with respect to the engagement they generate. Our models are based on the Bradley-Terry model of pairwise comparison outcomes and use a combination of the tweets’ topic and metadata features to do the ranking. To remove confounding factors related to author popularity and minimise noise, they are trained on pairs of tweets that are from the same author and around the same time period and have a sufficiently large difference in engagement. The models achieve good accuracy on a held-out set of pairs. We show that we can interpret the parameters of the trained model to identify the topic and metadata features that contribute to high engagement. Among other observations, we see that topics related to climate projections, human cost and deaths tend to have low engagement while those related to mitigation and adaptation strategies have high engagement. We hope the insights gained from this study will help craft effective climate communication to promote engagement, thereby lending strength to efforts to tackle climate change. Authors: Aswin Suresh (EPFL); Lazar Milikic (EPFL); Francis Murray (EPFL); Yurui Zhu (EPFL); Matthias Grossglauser (École Polytechnique Fédérale de Lausanne (EPFL))
ICLR 2023	Graph-Based Deep Learning for Sea Surface Temperature Forecasts (Papers Track) Abstract and authors: (click to expand) Abstract: Sea surface temperature (SST) forecasts help with managing the marine ecosystem and the aquaculture impacted by anthropogenic climate change. Numerical dynamical models are resource intensive for SST forecasts; machine learning (ML) models could reduce high computational requirements and have been in the focus of the research community recently. ML models normally require a large amount of data for training. Environmental data are collected on regularly-spaced grids, so early work mainly used grid-based deep learning (DL) for prediction. However, both grid data and the corresponding DL approaches have inherent problems. As geometric DL has emerged, graphs as a more generalized data structure and graph neural networks (GNNs) have been introduced to the spatiotemporal domains. In this work, we preliminarily explored graph re-sampling and GNNs for global SST forecasts, and GNNs show better one month ahead SST prediction than the persistence model in most oceans in terms of root mean square errors. Authors: Ding Ning (University of Canterbury); Varvara Vetrova (University of Canterbury); Karin Bryan (University of Waikato)
ICLR 2023	Uncovering the Spatial and Temporal Variability of Wind Resources in Europe: A Web-Based Data-Mining Tool (Papers Track) Abstract and authors: (click to expand) Abstract: We introduce REmap-eu.app, a web-based data-mining visualization tool of the spatial and temporal variability of wind resources. It uses the latest open-access dataset of the daily wind capacity factor in 28 European countries between 1979 and 2019 and proposes several user-configurable visualizations of the temporal and spatial variations of the wind power capacity factor. The platform allows for a deep analysis of the distribution, the cross-country correlation, and the drivers of low wind power events. It offers an easy-to-use interface that makes it suitable for the needs of researchers and stakeholders. The tool is expected to be useful in identifying areas of high wind potential and possible challenges that may impact the large-scale deployment of wind turbines in Europe. Particular importance is given to the visualization of low wind power events and to the potential of cross-border cooperations in mitigating the variability of wind in the context of increasing reliance on weather-sensitive renewable energy sources. Authors: Alban Puech (École Polytechnique); Jesse Read (Ecole Polytechnique)
NeurIPS 2022	Temperature impacts on hate speech online: evidence from four billion tweets (Papers Track) Abstract and authors: (click to expand) Abstract: Human aggression is no longer limited to the physical space but exists in the form of hate speech on social media. Here, we examine the effect of temperature on the occurrence of hate speech on Twitter and interpret the results in the context of climate change, human behavior and mental health. Employing supervised machine learning models, we identify hate speech in a data set of four billion geolocated tweets from over 750 US cities (2014 – 2020). We statistically evaluate the changes in daily hate tweets against changes in local temperature, isolating the temperature influence from confounding factors using binned panel-regression models. We find a low prevalence of hate tweets in moderate temperatures and observe sharp increases of up to 12% for colder and up to 22% for hotter temperatures, indicating that not only hot but also cold temperatures increase aggressive tendencies. Further, we observe that for extreme temperatures hate speech also increases as a percentage of total tweeting activity, crowding out non-hate speech. The quasi-quadratic shape of the temperature-hate tweet curve is robust across varying climate zones, income groups, religious and political beliefs. The prevalence of the results across climatic and socioeconomic splits points to limits in adaptation. Our results illuminate hate speech online as an impact channel through which temperature alters societal aggression. Authors: Annika Stechemesser (Potsdam Insitute for Climate Impact Research); Anders Levermann (Potsdam Institute for Climate Impact Research); Leonie Wenz (Potsdam Institute for Climate Impact Research)
NeurIPS 2022	A Global Classification Model for Cities using ML (Papers Track) Abstract and authors: (click to expand) Abstract: This paper develops a novel data set for three key resources use; namely, food, water, and energy, for 9000 cities globally. The data set is then utilized to develop a clustering approach as a starting point towards a global classification model. This novel clustering approach aims to contribute to developing an inclusive view of resource efficiency for all urban centers globally. The proposed clustering algorithm is comprised of three steps: first, outlier detection to address specific city characteristics, then a Variational Autoencoder (VAE), and finally, Agglomerative Clustering (AC) to improve the classification results. Our results show that this approach is more robust and yields better results in creating delimited clusters with high Calinski-Harabasz Index scores and Silhouette Coefficient than other baseline clustering methods. Authors: Doron Hazan (MIT); Mohamed Habashy (Massachusetts Institute of Technology); Mohanned ElKholy (Massachusetts Institute of Technology); Omer Mousa (American University in Cairo); Norhan M Bayomi (MIT Environmental Solutions Initiative); Matias Williams (Massachusetts Institute of Technology); John Fernandez (Massachusetts Institute of Technology)
NeurIPS 2022	Learning Surrogates for Diverse Emission Models (Papers Track) Abstract and authors: (click to expand) Abstract: Transportation plays a major role in global CO2 emission levels, a factor that directly connects with climate change. Roadway interventions that reduce CO2 emission levels have thus become a timely requirement. An integral need in assessing the impact of such roadway interventions is access to industry-standard programmatic and instantaneous emission models with various emission conditions such as fuel types, vehicle types, cities of interest, etc. However, currently, there is a lack of well-calibrated emission models with all these properties. Addressing these limitations, this paper presents 1100 programmatic and instantaneous vehicular CO2 emission models with varying fuel types, vehicle types, road grades, vehicle ages, and cities of interest. We hope the presented emission models will facilitate future research in tackling transportation-related climate impact. The released version of the emission models can be found here. Authors: Edgar Ramirez Sanchez (MIT); Catherine H Tang (Massachusetts Institute of Technology); Vindula Jayawardana (MIT); Cathy Wu (MIT)
NeurIPS 2022	Modelling the performance of delivery vehicles across urban micro-regions to accelerate the transition to cargo-bike logistics (Proposals Track) Abstract and authors: (click to expand) Abstract: Light goods vehicles (LGV) used extensively in the last mile of delivery are one of the leading polluters in cities. Cargo-bike logistics has been put forward as a high impact candidate for replacing LGVs, with experts estimating over half of urban van deliveries being replaceable by cargo bikes, due to their faster speeds, shorter parking times and more efficient routes across cities. By modelling the relative delivery performance of different vehicle types across urban micro-regions, machine learning can help operators evaluate the business and environmental impact of adding cargo-bikes to their fleets. In this paper, we introduce two datasets, and present initial progress in modelling urban delivery service time (e.g. cruising for parking, unloading, walking). Using Uber’s H3 index to divide the cities into hexagonal cells, and aggregating OpenStreetMap tags for each cell, we show that urban context is a critical predictor of delivery performance. Authors: Max C Schrader (University of Alabama); Navish Kumar (IIT Kharagpur); Nicolas Collignon (University of Edinburgh); Maria S Astefanoaei (IT University of Copenhagen); Esben Sørig (Kale Collective); Soonmyeong Yoon (Kale Collective); Kai Xu (University of Edinburgh); Akash Srivastava (MIT-IBM)
NeurIPS 2022	Forecasting Global Drought Severity and Duration Using Deep Learning (Proposals Track) Abstract and authors: (click to expand) Abstract: Drought detection and prediction is challenging due to the slow onset of the event and varying degrees of dependence on numerous physical and socio-economic factors that differentiate droughts from other natural disasters. In this work, we propose DeepXD (Deep learning for Droughts), a deep learning model with 26 physics-informed input features for SPI (Standardised Precipitation Index) forecasting to identify and classify droughts using monthly oceanic indices, global meteorological and vegetation data, location (latitude, longitude) and land cover for the years 1982 to 2018. In our work, we propose extracting features by considering the atmosphere and land moisture and energy budgets and forecasting global droughts on a seasonal and an annual scale at 1, 3, 6, 9, 12 and 24 months lead times. SPI helps us to identify the severity and the duration of the drought to classify them as meteorological, agricultural and hydrological. Authors: Akanksha Ahuja (NOA); Xin Rong Chua (Centre for Climate Research Singapore)
NeurIPS 2022	Personalizing Sustainable Agriculture with Causal Machine Learning (Proposals Track) Best Paper: Proposals Abstract and authors: (click to expand) Abstract: To fight climate change and accommodate the increasing population, global crop production has to be strengthened. To achieve the "sustainable intensification" of agriculture, transforming it from carbon emitter to carbon sink is a priority, and understanding the environmental impact of agricultural management practices is a fundamental prerequisite to that. At the same time, the global agricultural landscape is deeply heterogeneous, with differences in climate, soil, and land use inducing variations in how agricultural systems respond to farmer actions. The "personalization" of sustainable agriculture with the provision of locally adapted management advice is thus a necessary condition for the efficient uplift of green metrics, and an integral development in imminent policies. Here, we formulate personalized sustainable agriculture as a Conditional Average Treatment Effect estimation task and use Causal Machine Learning for tackling it. Leveraging climate data, land use information and employing Double Machine Learning, we estimate the heterogeneous effect of sustainable practices on the field-level Soil Organic Carbon content in Lithuania. We thus provide a data-driven perspective for targeting sustainable practices and effectively expanding the global carbon sink. Authors: Georgios Giannarakis (National Observatory of Athens); Vasileios Sitokonstantinou (National Observatory of Athens); Roxanne Suzette Lorilla (National Observatory of Athens); Charalampos Kontoes (National Observatory of Athens)
NeurIPS 2022	Disaster Risk Monitoring Using Satellite Imagery (Tutorials Track) Abstract and authors: (click to expand) Abstract: Natural disasters such as flood, wildfire, drought, and severe storms wreak havoc throughout the world, causing billions of dollars in damages, and uprooting communities, ecosystems, and economies. Unfortunately, flooding events are on the rise due to climate change and sea level rise. The ability to detect and quantify them can help us minimize their adverse impacts on the economy and human lives. Using satellites to study flood is advantageous since physical access to flooded areas is limited and deploying instruments in potential flood zones can be dangerous. We are proposing a hands-on tutorial to highlight the use of satellite imagery and computer vision to study natural disasters. Specifically, we aim to demonstrate the development and deployment of a flood detection model using Sentinel-1 satellite data. The tutorial will cover relevant fundamental concepts as well as the full development workflow of a deep learning-based application. We will include important considerations such as common pitfalls, data scarcity, augmentation, transfer learning, fine-tuning, and details of each step in the workflow. Importantly, the tutorial will also include a case study on how the application was used by authorities in response to a flood event. We believe this tutorial will enable machine learning practitioners of all levels to develop new technologies that tackle the risks posed by climate change. We expect to deliver the below learning outcomes: • Develop various deep learning-based computer vision solutions using hardware-accelerated open-source tools that are optimized for real-time deployment • Create an optimized pipeline for the machine learning development workflow • Understand different performance metrics for model evaluation that are relevant for real world datasets and data imbalances • Understand the public sector’s efforts to support climate action initiatives and point out where the audience can contribute Authors: Kevin Lee (NVIDIA); Siddha Ganju (NVIDIA); Edoardo Nemni (UNOSAT)
NeurIPS 2022	Machine Learning for Predicting Climate Extremes (Tutorials Track) Abstract and authors: (click to expand) Abstract: Climate change has led to a rapid increase in the occurrence of extreme weather events globally, including floods, droughts, and wildfires. In the longer term, some regions will experience aridification while others will risk sinking due to rising sea levels. Typically, such predictions are done via weather and climate models that simulate the physical interactions between the atmospheric, oceanic, and land surface processes that operate at different scales. Due to the inherent complexity, these climate models can be inaccurate or computationally expensive to run, especially for detecting climate extremes at high spatiotemporal resolutions. In this tutorial, we aim to introduce the participants to machine learning approaches for addressing two fundamental challenges. We will walk the participants through a hands-on tutorial for predicting climate extremes relating to temperature and precipitation in 2 setups: (1) temporal forecasting: the goal is to predict climate variables into the future (both direct single step approaches and iterative approaches that roll out the model for several timesteps), and (2) spatial downscaling: the goal is to learn a mapping that transforms low-resolution outputs of climate models into high-resolution regional forecasts. Through introductory presentations and colab notebooks, we aim to expose the participants to (a) APIs for accessing and navigating popular repositories that host global climate data, such as the Copernicus data store, (b) identifying relevant datasets, including auxiliary data (e.g., other climate variables such as geopotential), (c) scripts for downloading and preprocessing relevant datasets, (d) algorithms for training machine learning models, (d) metrics for evaluating model performance, and (e) visualization tools for both the dataset and predicted outputs. The coding notebooks will be in Python. No prior knowledge of climate science is required. Authors: Hritik Bansal (UCLA); Shashank Goel (University of California Los Angeles); Tung Nguyen (University of California, Los Angeles); Aditya Grover (UCLA)
AAAI FSS 2022	NADBenchmarks - a compilation of Benchmark Datasets for Machine Learning Tasks related to Natural Disasters Abstract and authors: (click to expand) Abstract: Climate change has increased the intensity, frequency, and duration of extreme weather events and natural disasters across the world. While the increased data on natural disasters improves the scope of machine learning(ML) for this field, progress is relatively slow. One bottleneck is the lack of benchmark datasets that would allow ML researchers to quantify their progress against a standard metric. The objective of this short paper is to explore the state of benchmark datasets for ML tasks related to natural disasters, categorizing the datasets according to the disaster management cycle. We compile a list of existing benchmark datasets that have been introduced in the past five years. We propose a web platform where researchers can search for benchmark datasets in this domain, and develop a preliminary version of such a platform using our compiled list. This paper is intended to aid researchers in finding benchmark datasets to train their ML models on, and provide general directions in for topics where they can contribute new benchmark datasets. Authors: Adiba Proma (University of Rochester), Md Saiful Islam (University of Rochester), Stela Ciko (University of Rochester), Raiyan Abdul Baten (University of Rochester) and Ehsan Hoque (University of Rochester)
AAAI FSS 2022	Curator: Creating Large-Scale Curated Labelled Datasets using Self-Supervised Learning Abstract and authors: (click to expand) Abstract: Applying Machine learning to domains like Earth Sciences is impeded by the lack of labeled data, despite a large corpus of raw data available in such domains. For instance, training a wildfire classifier on satellite imagery requires curating a massive and diverse dataset, which is an expensive and time-consuming process that can span from weeks to months. Searching for relevant examples in over 40 petabytes of unlabelled data requires researchers to manually hunt for such images, much like finding a needle in a haystack. We present a no-code end-to-end pipeline, Curator, which dramatically minimizes the time taken to curate an exhaustive labeled dataset. Curator is able to search massive amounts of unlabelled data by combining self-supervision, scalable nearest neighbor search, and active learning to learn and differentiate image representations. The pipeline can also be readily applied to solve problems across different domains. Overall, the pipeline makes it practical for researchers to go from just one reference image to a comprehensive dataset in a diminutive span of time. Authors: Tarun Narayanan (SpaceML), Ajay Krishnan (SpaceML), Anirudh Koul (Pinterest, SpaceML, FDL) and Siddha Ganju (NVIDIA, SpaceML, FDL)
AAAI FSS 2022	Rethinking Machine Learning for Climate Science: A Dataset Perspective Abstract and authors: (click to expand) Abstract: The growing availability of data sources is a predominant factor enabling the widespread success of machine learning (ML) systems across a wide range of applications. Typically, training data in such systems constitutes a source of ground-truth, such as measurements about a physical object (e.g., natural images) or a human artifact (e.g., natural language). In this position paper, we take a critical look at the validity of this assumption for datasets for climate science. We argue that many such climate datasets are uniquely biased due to the pervasive use of external simulation models (e.g., general circulation models) and proxy variables (e.g., satellite measurements) for imputing and extrapolating in-situ observational data. We discuss opportunities for mitigating the bias in the training and deployment of ML systems using such datasets. Finally, we share views on improving the reliability and accountability of ML systems for climate science applications. Authors: Aditya Grover (UCLA)
NeurIPS 2021	High-resolution rainfall-runoff modeling using graph neural network (Papers Track) Abstract and authors: (click to expand) Abstract: Time-series modeling has shown great promise in recent studies using the latest deep learning algorithms such as LSTM (Long Short-Term Memory). These studies primarily focused on watershed-scale rainfall-runoff modeling or streamflow forecasting, but the majority of them only considered a single watershed as a unit. Although this simplification is very effective, it does not take into account spatial information, which could result in significant errors in large watersheds. Several studies investigated the use of GNN (Graph Neural Networks) for data integration by decomposing a large watershed into multiple sub-watersheds, but each sub-watershed is still treated as a whole, and the geoinformation contained within the watershed is not fully utilized. In this paper, we propose the GNRRM (Graph Neural Rainfall-Runoff Model), a novel deep learning model that makes full use of spatial information from high-resolution precipitation data, including flow direction and geographic information. When compared to baseline models, GNRRM has less over-fitting and significantly improves model performance. Our findings support the importance of hydrological data in deep learning-based rainfall-runoff modeling, and we encourage researchers to include more domain knowledge in their models. Authors: Zhongrun Xiang (University of Iowa); Ibrahim Demir (The University of Iowa)
NeurIPS 2021	Towards Automatic Transformer-based Cloud Classification and Segmentation (Papers Track) Abstract and authors: (click to expand) Abstract: Clouds have been demonstrated to have a huge impact on the energy balance, temperature, and weather of the Earth. Classification and segmentation of clouds and coverage factors is crucial for climate modelling, meteorological studies, solar energy industry, and satellite communication. For example, clouds have a tremendous impact on short-term predictions or 'nowcasts' of solar irradiance and can be used to optimize solar power plants and effectively exploit solar energy. However even today, cloud observation requires the intervention of highly-trained professionals to document their findings, which introduces bias. To overcome these issues and contribute to climate change technology, we propose, to the best of our knowledge, the first two transformer-based models applied to cloud data tasks. We use the CCSD Cloud classification dataset and achieve 90.06% accuracy, outperforming all other methods. To demonstrate the robustness of transformers in this domain, we perform Cloud segmentation on SWIMSWG dataset and achieve 83.2% IoU, also outperforming other methods. With this, we signal a potential shift away from pure CNN networks. Authors: Roshan Roy (Birla Institute of Technology and Science, Pilani); Ahan M R (BITS Pilani); Vaibhav Soni (MANIT Bhopal); Ashish Chittora (BITS Pilani)
NeurIPS 2021	A Risk Model for Predicting Powerline-induced Wildfires in Distribution System (Proposals Track) Abstract and authors: (click to expand) Abstract: The power grid is one of the most common causes of wildfires that result in tremendous economic loss and significant life risk. In this study, we propose to use machine learning techniques to build a risk model for predicting powerline-induced wildfires in distribution system. We collect weather, vegetation, and infrastructure data for all feeders in Pacific Gas & Electricity territory. This study will contribute to a deeper understanding of powerline-induced wildfire prediction and provide valuable suggestions for wildfire mitigation planning. Authors: Mengqi Yao (University of California Berkeley)
NeurIPS 2021	Predicting Cascading Failures in Power Systems using Graph Convolutional Networks (Proposals Track) Abstract and authors: (click to expand) Abstract: Worldwide targets are set for the increase of renewable power generation in electricity networks on the way to combat climate change. Consequently, a secure power system that can handle the complexities resulted from the increased renewable power integration is crucial. One particular complexity is the possibility of cascading failures — a quick succession of multiple component failures that takes down the system and might also lead to a blackout. Viewing the prediction of cascading failures as a binary classification task, we explore the efficacy of Graph Convolution Networks (GCNs), to detect the early onset of a cascading failure. We perform experiments based on simulated data from a benchmark IEEE test system. Our preliminary findings show that GCNs achieve higher accuracy scores than other baselines which bodes well for detecting cascading failures. It also motivates a more comprehensive study of graph-based deep learning techniques for the current problem. Authors: Tabia Ahmad (University of Strathclyde); Yongli Zhu (Texas A&M Universersity); Panagiotis Papadopoulos (University of Strathclyde)
ICML 2021	DroughtED: A dataset and methodology for drought forecasting spanning multiple climate zones (Papers Track) Abstract and authors: (click to expand) Abstract: Climate change exacerbates the frequency, duration and extent of extreme weather events such as drought. Previous attempts to forecast drought conditions using machine learning have focused on regional models which have two major limitations for national drought management: (i) they are trained on localised climate data and (ii) their architectures prevent them from being applied to new heterogeneous regions. In this work, we present a new large-scale dataset for training machine learning models to forecast national drought conditions, named DroughtED. The dataset consists of globally available meteorological features widely used for drought prediction, paired with location meta-data which has not previously been utilised for drought forecasting. Here we also establish a baseline on DroughtED and present the first research to apply deep learning models - Long Short-Term Memory (LSTMs) and Transformers - to predict county-level drought conditions across the full extent of the United States. Our results indicate that DroughtED enables deep learning models to learn cross-region patterns in climate data that contribute to drought conditions and models trained on DroughtED compare favourably to state-of-the-art drought prediction models trained on individual regions. Authors: Christoph D Minixhofer (The University of Edinburgh); Mark Swan (The University of Edinburgh); Calum McMeekin (The University of Edinburgh); Pavlos Andreadis (The University of Edinburgh)
ICML 2021	Online LSTM Framework for Hurricane Trajectory Prediction (Papers Track) Abstract and authors: (click to expand) Abstract: Hurricanes are high-intensity tropical cyclones that can cause severe damages when the storms make landfall. Accurate long-range prediction of hurricane trajectories is an important but challenging problem due to the complex interactions between the ocean and atmosphere systems. In this paper, we present a deep learning framework for hurricane trajectory forecasting by leveraging the outputs from an ensemble of dynamical (physical) models. The proposed framework employs a temporal decay memory unit for imputing missing values in the ensemble member outputs, coupled with an LSTM architecture for dynamic path prediction. The framework is extended to an online learning setting to capture concept drift present in the data. Empirical results suggest that the proposed framework significantly outperforms various baselines including the official forecasts from U.S. National Hurricane Center (NHC). Authors: Ding Wang (Michigan State University); Pang-Ning Tan (MSU)
ICML 2021	IowaRain: A Statewide Rain Event Dataset Based on Weather Radars and Quantitative Precipitation Estimation (Papers Track) Abstract and authors: (click to expand) Abstract: Effective environmental planning and management to address climate change could be achieved through extensive environmental modeling with machine learning and conventional physical models. In order to develop and improve these models, practitioners and researchers need comprehensive benchmark datasets that are prepared and processed with environmental expertise that they can rely on. This study presents an extensive dataset of rainfall events for the state of Iowa (2016-2019) acquired from the National Weather Service Next Generation Weather Radar (NEXRAD) system and processed by a quantitative precipitation estimation system. The dataset presented in this study could be used for better disaster monitoring, response and recovery by paving the way for both predictive and prescriptive modeling. Authors: Muhammed A Sit (The University of Iowa); Bongchul Seo (IIHR—Hydroscience & Engineering, The University of Iowa); Ibrahim Demir (The University of Iowa)
NeurIPS 2020	Quantifying the presence of air pollutants over a road network in high spatio-temporal resolution (Papers Track) Abstract and authors: (click to expand) Abstract: Monitoring air pollution plays a key role when trying to reduce its impact on the environment and on human health. Traditionally, two main sources of information about the quantity of pollutants over a city are used: monitoring stations at ground-level (when available), and satellites' remote sensing. In addition to these two, other methods have been developed in the last years that aim at understanding how traffic emissions behave in space and time at a finer scale, taking into account the human mobility patterns. We present a simple and versatile framework for estimating the quantity of four air pollutants (CO2, NOx, PM, VOC) emitted by private vehicles moving on a road network, starting from raw GPS traces and information about vehicles' fuel type, and use this framework for analyses on how such pollutants distribute over the road networks of different cities. Authors: Matteo Bohm (Sapienza University of Rome); Mirco Nanni (ISTI-CNR Pisa, Italy); Luca Pappalardo (ISTI)
NeurIPS 2020	FlowDB: A new large scale river flow, flash flood, and precipitation dataset (Papers Track) Abstract and authors: (click to expand) Abstract: Flooding results in 8 billion dollars of damage annually in the US and causes the most deaths of any weather related event. Due to climate change scientists expect more heavy precipitation events in the future. However, no current datasets exist that contain both hourly precipitation and river flow data. We introduce a novel hourly river flow and precipitation dataset and a second subset of flash flood events with damage estimates and injury counts. Using these datasets we create two challenges (1) general stream flow forecasting and (2) flash flood damage estimation. We also create a public benchmark and an Python package to enable easy adding of new models . Additionally, in the future we aim to augment our dataset with snow pack data and soil index moisture data to improve predictions Authors: Isaac Godfried (CoronaWhy)
NeurIPS 2020	EarthNet2021: A novel large-scale dataset and challenge for forecasting localized climate impacts (Papers Track) Abstract and authors: (click to expand) Abstract: Climate change is global, yet its concrete impacts can strongly vary between different locations in the same region. Seasonal weather forecasts currently operate at the mesoscale (> 1 km). For more targeted mitigation and adaptation, modelling impacts to < 100 m is needed. Yet, the relationship between driving variables and Earth’s surface at such local scales remains unresolved by current physical models. Large Earth observation datasets now enable us to create machine learning models capable of translating coarse weather information into high-resolution Earth surface forecasts encompassing localized climate impacts. Here, we define high-resolution Earth surface forecasting as video prediction of satellite imagery conditional on mesoscale weather forecasts. Video prediction has been tackled with deep learning models. Developing such models requires analysis-ready datasets. We introduce EarthNet2021, a new, curated dataset containing target spatio-temporal Sentinel 2 satellite imagery at 20 m resolution, matched with high-resolution topography and mesoscale (1.28 km) weather variables. With over 32000 samples it is suitable for training deep neural networks. Comparing multiple Earth surface forecasts is not trivial. Hence, we define the EarthNetScore, a novel ranking criterion for models forecasting Earth surface reflectance. For model intercomparison we frame EarthNet2021 as a challenge with four tracks based on different test sets. These allow evaluation of model validity and robustness as well as model applicability to extreme events and the complete annual vegetation cycle. In addition to forecasting directly observable weather impacts through satellite-derived vegetation indices, capable Earth surface models will enable downstream applications such as crop yield prediction, forest health assessments, coastline management, or biodiversity monitoring. Find data, code, and how to participate at www.earthnet.tech . Authors: Christian Requena-Mesa (Computer Vision Group, Friedrich Schiller University Jena; DLR Institute of Data Science, Jena; Max Planck Institute for Biogeochemistry, Jena); Vitus Benson (Max-Planck-Institute for Biogeochemistry); Jakob Runge (Institute of Data Science, German Aerospace Center (DLR)); Joachim Denzler (Computer Vision Group, Friedrich Schiller University Jena, Germany); Markus Reichstein (Max Planck Institute for Biogeochemistry, Jena; Michael Stifel Center Jena for Data-Driven and Simulation Science, Jena)
NeurIPS 2020	Emerging Trends of Sustainability Reporting in the ICT Industry: Insights from Discriminative Topic Mining (Papers Track) Abstract and authors: (click to expand) Abstract: The Information and Communication Technologies (ICT) industry has a considerable climate change impact and accounts for approximately 3 percent of global carbon emissions. Despite the increasing availability of sustainability reports provided by ICT companies, we still lack a systematic understanding of what has been disclosed at an industry level. In this paper, we make the first major effort to use modern unsupervised learning methods to investigate the sustainability reporting themes and trends of the ICT industry over the past two decades. We build a cross-sector dataset containing 22,534 environmental reports from 1999 to 2019, of which 2,187 are ICT specific. We then apply CatE, a text embedding based topic modeling method, to mine specific keywords that ICT companies use to report on climate change and energy. As a result, we identify (1) important shifts in ICT companies' climate change narratives from physical metrics towards climate-related disasters, (2) key organizations with large influence on ICT companies, and (3) ICT companies' increasing focus on data center and server energy efficiency. Authors: Lin Shi (Stanford University); Nhi Truong Vu (Stanford University)

Data Mining

Innovation Grants

Workshop Papers