Data Gaps (Beta)

Artificial intelligence (AI) and machine learning (ML) offer a powerful suite of tools to accelerate climate change mitigation and adaptation across different sectors. However, the lack of high-quality, easily accessible, and standardized data often hinders the impactful use of AI/ML for climate change applications.

In this project, Climate Change AI, with the support of Google DeepMind, aims to identify and catalog critical data gaps that impede AI/ML applications in addressing climate change, and lay out pathways for filling these gaps. In particular, we identify candidate improvements to existing datasets, as well as "wishes" for new datasets whose creation would enable specific ML-for-climate use cases. We hope that researchers, practitioners, data providers, funders, policymakers, and others will join the effort to address these critical data gaps.

This project is currently in its beta phase, with ongoing improvements to content and usability. The information provided is not exhaustive, and may contain errors. We encourage you to provide input and contributions via the routes listed below, or by emailing us at datagaps@climatechange.ai. We are grateful to the many stakeholders and interviewees who have already provided input.

Use Case Gap Types Sectors
Accelerating and improving weather forecasting: Near-term (< 24 hours)
Details (click to expand)

Accurate near-term (< 24 hours ahead) weather forecasting is critical for climate change mitigation (e.g., solar panel deployment) and adaptation (e.g., crisis management during disasters), with applications requiring high spatial and temporal resolution of temperature, precipitation, wind, and cloud coverage.

Machine learning can help make these forecasts more computationally efficient and accurate while maintaining or improving the high resolution needed for climate applications.

The main data gaps include limited geographic coverage (primarily US-centric data), extremely large data volumes that are difficult to transfer and process, and inconsistent data formats from different sources.

Addressing these gaps requires expanding coverage to global regions (especially the Global South), providing cloud-based computational resources alongside the data, and developing standardized formats for multi-source data integration.

DatasetData Gap Summary
Automated surface observation system (ASOS)

Data volume is large and only data specific to the US is available.

Give feedback
High-Resolution Rapid Refresh (HRRR) weather forecast

Data volume is large, and only data covering the US is available.

Give feedback
Regularly gridded high-resolution atmospheric observations

An enhanced version of ERA5 with higher granularity and fidelity is needed. Many surface observations and remote sensing data are available but underutilized for developing such a dataset.

Give feedback
Satellite imagery – Multi-Radar/Multi-Sensor System

Obtaining and integrating radar data from various sources is challenging due to access restrictions, format inconsistencies, and limited global coverage.

Give feedback
Accelerating building energy models
Details (click to expand)

Building energy modeling (also called building performance simulation) is key across an array of use cases that can help reduce energy demand in buildings, including architectural design, heating, ventilation, and air conditioning (typically abbreviated HVAC) design and control, building performance rating, and building stock analysis. 

Traditional building energy modeling, such as the software EnergyPlus relies on detailed physics models with significant computational complexity and processing time.. Machine learning models can significantly enhance evaluation by providing fast emulators for these models based on synthetic and real-world data, enabling faster prototyping and optimization of building design and operations along multiple comfort, consumption, and environmental objectives. 

Traditional models and ML-based emulation both require precise inputs about the building design, its usage, as well as the physical and environmental conditions surrounding it. However, information building usage and design are often kept in silos, while information about the surroundings are, when available, dispersed across various datasets. There are very few benchmarks gathering all information for given buildings.

Closing these gaps involves releasing anonymized usage data, working on building bridges between relevant datasets, and developing benchmark datasets. This may enable testing models across more geographies and building types to reduce existing biases and uncertainties attached to building energy models.

DatasetData Gap Summary
Benchmark datasets for building energy modeling

Benchmark datasets for building energy modeling are few, are mostly available in the US, and cover a limited range of building types. The variables provided in such datasets are not always precise and comprehensive enough to test models adequately.

Give feedback
Computational fluid dynamics simulation for building energy models

Despite its usefulness in ventilation studies for new construction, CFD simulations are computationally expensive making it difficult to include in the early phase of the design process where building morphosis can be optimized to reduce future operational consumption associated with building lighting, heating, and cooling. Simulations require accurate input information with respect to material properties that may not be present in traditional urban building types. Output of models  require the integration of domain knowledge to interpret results from large volumes of synthetic data for different wind directions becoming challenging to manage. Future data collection with respect to simulation output verification can benefit surrogate or proxy approaches to computationally expensive Navier-Stokes equations, and coverage is often restricted to modern building approaches, leaving out passive building techniques known as vernacular architecture from indigenous communities from being taken into design consideration.

Give feedback
Residential daylight performance metric (DPM) data

While daylight performance metric (DPM) evaluation is an important step in the planning of commercial buildings, residential buildings do not have a similar focus, which is unusual given that most new building construction occurs within the residential sector. Residential DPMs often lack metrics associated with direct sunlight access, rely on annual averages for seasons, and utilize fixed occupancy schedules that are overly simplified for residential spaces. Additionally, illuminance metrics and thresholds utilized in commercial spaces do not translate well for residential spaces where people may prefer higher or lower illuminances depending on their location and lifestyles. Lastly, DPM optimization is based on operational metrics and assumptions on illumination in a space and its effects on the resulting thermal comfort and operational consumption of a traditional urban residential spaces, vernacular architecture which is specific to a local region and culture may not share similar objectives, preferring more indoor-outdoor transitional spaces, earthen materials, and less focus on windows and incident natural sunlight.

Give feedback
Accelerating data-driven generation of climate simulations
Details (click to expand)

Climate simulation using physics-based Earth system models is computationally intensive and time-consuming, limiting the exploration of different climate scenarios. 

ML can accelerate this process by creating surrogate models that approximate complex Earth system model simulations, enabling rapid generation of climate projections under various greenhouse gas emission scenarios.

Current ML approaches are limited by the availability of diverse training data from multiple climate models, with most datasets featuring only single-model simulations or inconsistent data structures across models.

Addressing these gaps requires standardizing data formats across climate models, making high-volume data more accessible through cloud-based solutions, and improving model quality to reduce biases and uncertainties in simulations.Closing these data gaps would enable more robust ML emulators capable of producing reliable climate projections at a fraction of the computational cost, accelerating climate research and supporting more informed policy decisions.

DatasetData Gap Summary
ClimateBench v1.0 (benchmark dataset for earth system models)

The dataset currently includes simulations from only one Earth system model, limiting the diversity of training data and potentially affecting the robustness and generalizability of ML emulators trained on it.

Give feedback
ClimateSet (ML-ready earth system model inputs/outputs)

No significant data gap identified yet.

Give feedback
CMIP6 (earth system model intercomparison data)

The dataset faces three key challenges: its large volume makes access and processing difficult with standard computational infrastructure; lack of uniform structure across models complicates multi-model analysis; and inherent biases and uncertainties in the simulations affect reliability.

Give feedback
Accelerating distribution-side hosting capacity estimations
Details (click to expand)

Transitioning power grids from carbon-based generation to renewable sources requires restructuring from unidirectional to bidirectional energy networks, which stresses existing systems—especially at the low-voltage distribution level. The hosting capacity of distribution feeders determines how much distributed renewable generation can be safely integrated without triggering safety equipment or compromising power quality.

Traditional methods for assessing distribution network hosting capacity rely on computationally expensive power flow simulations that are difficult to perform in real-time. Machine learning models can serve as surrogate models by capturing spatio-temporal patterns across multiple data streams, enabling real-time hosting capacity estimation and accelerated scenario evaluation through reinforcement learning.

A significant data gap is the limited availability of real distribution feeder data, requiring researchers to rely on simulations that may not accurately reflect actual grid conditions due to differences in load patterns, environmental factors, and DER penetration levels.

Distribution system operators, utilities, and researchers can collaborate to improve data sharing while protecting sensitive information, thereby enabling more accurate hosting capacity assessments and facilitating higher renewable energy integration in distribution networks.

DatasetData Gap Summary
Distribution system simulators

While OpenDSS and GridLab-D provide valuable simulation capabilities, their utility is limited by challenges in obtaining verification data from real distribution circuits, aggregating necessary input data from multiple sources, and navigating usage rights for proprietary utility data. Closing these gaps through improved utility-researcher partnerships and data sharing protocols would significantly enhance the accuracy of hosting capacity assessments, enabling greater renewable energy integration in distribution networks.

Give feedback
Accelerating post-disaster damage assessments
Details (click to expand)

Post-disaster evaluations are crucial for identifying vulnerabilities exposed by climate-related events, which is essential for enhancing resilience and informing climate adaptation strategies.

ML can help by rapidly identifying and quantifying damage, such as structural collapse or vegetation loss, thereby improving response and recovery efforts.

Current datasets for ML-based damage assessment face significant geographic bias and granularity issues, limiting their effectiveness in global contexts and for detailed damage classification.

Expanding geographic coverage beyond North America and enhancing damage severity classifications would enable more accurate and globally applicable ML damage assessment models, improving disaster response worldwide.

DatasetData Gap Summary
Financial loss datasets related to the impacts of disasters

Financial loss data for disasters is primarily proprietary and inaccessible to researchers, limiting the development of comprehensive disaster impact assessment models.

Give feedback
Satellite imagery

Satellite imagery for disaster assessment faces challenges with temporal currency and spatial resolution, with public datasets having insufficient resolution for accurate damage assessment and commercial high-resolution options being prohibitively expensive.

Give feedback
xBD Dataset (pre- and post-disaster satellite imagery)

The xBD dataset has two significant limitations: it is geographically biased toward North America and lacks granular damage severity classification, limiting its global applicability and assessment precision.

Give feedback
Accelerating the design of new carbon-absorbing materials
Details (click to expand)

Carbon sequestration through absorption methods can effectively reduce CO2 levels in the atmosphere. Engineered molecules, carbon sorbents, can be designed to selectively bind to CO2. Traditionally, these molecules require in-lab experimentation, which can be time- and resource-intensive because they necessitate replication to identify adsorbent characteristics. Additionally, the search space of possible molecules can be very large and non-trivial to explore directly through experiment. 

Machine learning can significantly accelerate materials discovery by systematically generating and evaluating candidate molecule properties based on structure, thereby facilitating rapid iteration.

There is a lack of openly-accessible lab measurements to train ML simulation models.

Multiple initiatives could be taken to close this gap, including creating industry-research data sharing initiatives or establishing mandatory data sharing requirements for scientific publications.

DatasetData Gap Summary
Lab measurements of material property and carbon absorption

The major challenge is that data is not shared with the public.

Give feedback
Assessing rooftop solar photovoltaic potential
Details (click to expand)

Accelerating residential solar PV deployment is essential for decarbonizing energy systems, yet systematically assessing rooftop solar PV potential at scale remains a significant challenge.

Machine learning helps with analyzing aerial imagery and other data to estimate solar potential automatically, enabling faster and more targeted solar deployment. An example of this is the Google Sunroof project.

Key data gaps include limited high-resolution imagery, incomplete rooftop metadata, and scarce historical data, which reduce model accuracy and coverage.

Addressing these gaps through standardized data collection, integration of diverse sources, and better validation can help scale and improve this use case.

DatasetData Gap Summary
Building stock – from cadaster and aerial imagery

This use case requires 3D models of buildings that include roof geometries (surfaces and angles), which only few datasets, mostly in Europe, provide currently.

Give feedback
JRC PVGIS (solar radiation data)

This dataset does not have major data gaps for this use case, but there are some approximations and other errors in the data to be considered.

Give feedback
Automating individual re-identification for wildlife
Details (click to expand)

Identifying individual animals within wildlife populations is critical for monitoring endangered species, understanding their behaviors, and developing effective conservation strategies for biodiversity preservation. 

Computer vision and machine learning techniques enable automatic individual identification at scale, helping researchers track specific animals over time without invasive tagging methods.

The scarcity of publicly available and well-annotated datasets poses a significant challenge for applying ML in wildlife identification, with the most valuable data scattered across individual research labs or organizations rather than centralized repositories.

Addressing this requires fostering a culture of data sharing in the ecological community through incentives like financial rewards and recognition for data collectors, while establishing standardized pipelines and infrastructures to aggregate existing annotated data for model training.

DatasetData Gap Summary
Camera trap wildlife image collections

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Give feedback
Drone imagery

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Give feedback
Environmental DNA (eDNA)

A significant challenge for eDNA-based monitoring is the incomplete barcoding reference databases, limiting the ability to accurately identify species from genetic material. Initiatives like the BIOSCAN project are actively working to address this gap by expanding reference collections for diverse taxonomic groups, particularly for understudied regions and species.

Give feedback
Enabling 2D to 3D shape recovery and pose estimation of animals
Details (click to expand)

3D shape recovery and pose estimation refer to the reconstruction of the 3D shapes and poses of animals from 2D images. This information can provide non-invasive insights into animals’ health, age, or reproductive status in their natural environment, which are important for biodiversity monitoring. 

ML-based computer vision techniques have been used to construct more accurate estimations of 3D animal shapes and poses. 

However, there is a lack of open annotated datasets to train models.

More efforts going into the curation and release of such datasets could be pivotal towards unlocking this use case.

DatasetData Gap Summary
Camera trap wildlife image collections

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity studies. Addressing this gap requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Give feedback
Enabling assessments of rest area capacity and use for electric truck charging
Details (click to expand)

Electric trucks will play a critical role in decarbonizing freight transport, but they require reliable access to break-time and overnight charging infrastructure. Rest areas along highways are suitable locations for this infrastructure, yet their capacity and utilization is not known at scale. Utilization rates would be needed to understand infrastructure and space needs and constraints. 

There is currently little data on the capacity and use of rest areas, and ML could help generate such information at scale from limited samples e.g. using remote sensing on satellite imagery.

The lack of ground truth data and the difficulties in accessing high-resolution satellite imagery are some of the most important bottlenecks for this use case.

Initiating projects with highway operators, who may find an economic interest in developing this monitoring technique, may help address both issues. 

DatasetData Gap Summary
Occupancy data from rest areas from cameras

This data is generally not shared and only accessible for few rest areas.

Give feedback
Satellite imagery

Satellite imagery for obtaining truck counts requires high-resolution imagery (here, both high temporal and spatial resolution matter) that is cloud-free over several kilometers. Usual cloud-free products are not suitable, because the time stamp attached to the image is important, and one image should cover several kilometers of a street or highway.

Give feedback
Enabling inference of city-level transportation mode shares
Details (click to expand)

Knowing modal shares in cities is crucial for climate change mitigation because it helps identify how people travel, the extent to which low-carbon options like walking, cycling, and public transit are being used, and study what influences such choices to design effective policy interventions. 

ML has the potential to predict modal share based on city characteristics, complementing traditional transportation surveys that are infrequent and not available for all cities. This enables new opportunities for tracking modal shifts and linking them to various policies and measures.

City-level modal share and EUROSTAT socio-economic data face challenges with inconsistent methodologies, outdated information, changing boundaries, and missing values, limiting their reliability, comparability, and usability.

More efforts for harmonizing data collection procedures would reduce the need for harmonization and increase the robustness of the data.

DatasetData Gap Summary
City-level transportation mode share data

There are issues with the quality of the data and consistent time series: there are no datasets with data for multiple cities that were produced with the same methodology, that are directly usable and highly trustworthy for scientific research.

Give feedback
EUROSTAT city-level socio-economic data

EUROSTAT city-level socio-economic data faces challenges with inconsistent time series due to changing boundaries, incomplete validation and aggregation issues, and missing values that limit its reliability and usability.

Give feedback
Enabling non-intrusive electricity load monitoring
Details (click to expand)

Non-intrusive load monitoring (NILM) is critical for disaggregating building electricity consumption into individual appliance profiles, enabling targeted energy efficiency strategies, demand response, and better supply/demand matching to reduce carbon emissions and maintain grid stability. 

AI techniques can analyze patterns in aggregate electricity data to identify individual appliance signatures without requiring separate meters for each device, providing cost-effective insights for both consumers and utilities.

The effectiveness of AI-based NILM is hindered by insufficient training data that represents diverse appliance types, usage patterns, and building characteristics across different regions, limiting model accuracy and generalizability in real-world settings.

Utilities, researchers, and manufacturers can collaborate to create standardized, privacy-preserving datasets through controlled data collection campaigns and by developing synthetic data generation techniques that capture the diversity of appliance signatures and usage patterns.

DatasetData Gap Summary
Pecan Street (appliance-level consumption data)

Pecan Street DataPort requires non-academic and academic users to purchase access via licensing which varies depending on the building data features requested. Coverage area of data is primarily concentrated in the Mueller planned housing community in Austin, Texas–a modern built environment which is not representative of older historical buildings that may be in need of energy efficient upgrades and retrofits. In customer segmentation studies and consumer-in-the-loop load consumption modeling, annual socio-demographic survey data may be too coarse and not provide insight into behavioral effects of household members on consumption profiles with time.

Give feedback
Sub-metered appliance-level data

 For accurate NILM studies, benchmark datasets are required to include not only consumption but local power generation (e.g., from rooftop solar), as it can affect the overall aggregate load observed at the building level. While some datasets may include generation information, most studies do not take rooftop solar generation into account. Additionally, devices that can behave both as a load and generator such as electric vehicles or stationary batteries were also not included. The majority of building types are single family housing units limiting the diversity of representation. Furthermore, most datasets are no longer maintained following study close.

Give feedback
Enabling predictions of materials optimised for filtration, catalysis, electrics & magnetics
Details (click to expand)

A key climate-related challenge is the reduction of greenhouse gas emissions and the development of sustainable industrial processes. This includes improving gas and liquid filtration (e.g., using zeolites and MOFs for hydrogen separation and tailings reclamation), advancing catalysis to minimize industrial waste, replacing carbon-intensive systems through electrification and novel material design for carbon capture.

Artificial Intelligence accelerates progress by predicting novel material structures and their properties through generative models and machine learning, enabling faster discovery of effective materials. It also reduces the computational cost of traditional simulations (like DFT) and refines interatomic potentials for molecular dynamics, making material optimization and synthesis route identification significantly more efficient.

Synthesis experimental data, both positive and negative, is necessary to train algorithms, but negative data is not publicly available.

Research on this use case would be facilitated if industrial organizations performing such experiments could share negative results.

DatasetData Gap Summary
Negative experimental synthesis data

Negative – not only positive – synthesis experimental data is necessary to train algorithms, but such data is not publicly available.

Give feedback
Enhancing bias-correction of climate projections
Details (click to expand)

Climate projections provide essential information about future climate conditions, guiding critical mitigation and adaptation efforts such as disaster risk assessments and power grid optimization. 

ML enhances the accuracy of these projections by bias-correcting forecasts from physics-based climate models like CMIP6, learning relationships between historical simulations and observed ground truth data. 

Large uncertainties in climate projections and inconsistent data formats across models create significant barriers for developing robust ML bias-correction methods. 

Improved model ensemble techniques and standardized data formats can enhance projection reliability and enable more effective climate risk planning.

DatasetData Gap Summary
CMIP6 (earth system model intercomparison data)

Large uncertainties in future climate projections limit confidence in bias-correction applications. The massive data volume and inconsistent formats across models—including variable naming conventions, resolutions, and file structures—hinder effective multi-model analysis. Improved model evaluation frameworks and data standardization efforts can enhance projection reliability and streamline ML model development.

Give feedback
ECMWF ERA5 Atmospheric Reanalysis

ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.

Give feedback
Ground-Based Weather Station Observations

Irregular spatial distribution and point-based measurements require extensive preprocessing to create gridded datasets suitable for ML applications. Limited station density in many regions, especially over oceans and remote areas, constrains bias-correction accuracy. Enhanced observation networks and improved interpolation techniques can provide more comprehensive spatial coverage for model validation.

Give feedback
Enhancing bias-correction of weather forecasts
Details (click to expand)

Accurate weather forecasts are critical for agriculture, disaster preparedness, and energy management, yet physics-based numerical weather models contain systematic biases that reduce forecast reliability, especially for extreme weather events.

ML can improve forecast accuracy by post-processing outputs from numerical weather prediction models and learning to correct the systematic biases inherent in physics-based forecasting systems.

The primary data gap is limited public access to high-resolution real-time forecast data, as most operational forecast products are costly and proprietary, hindering development of bias-correction algorithms.

Increased data sharing partnerships between meteorological agencies and research institutions, along with development of accessible benchmark datasets, could democratize access to high-quality forecast data and accelerate ML-based improvements to weather prediction.

DatasetData Gap Summary
ECMWF ENS (global 9-km 15-day ahead weather model ensemble)

Same as HRES, the biggest challenge of ENS is that only a portion of it is available to the public for free.

Give feedback
ECMWF HRES (global 9-km 10-day ahead weather model)

Limited public access to real-time high-resolution forecasts and computational challenges from large data volumes restrict ML model development and validation for operational weather bias correction.

Give feedback
Ground-Based Weather Station Observations

Sparse spatial coverage, restricted data access in many regions, and the need for gridding point measurements limit the effectiveness of station observations for training and validating ML bias-correction models.

Give feedback
Enhancing digital reconstructions of the environment
Details (click to expand)

Digital reconstruction of the environment using remote sensing data is crucial for understanding habitat conditions and their impacts on wildlife, enabling more effective conservation strategies in the face of climate change.

ML enhances this process by efficiently analyzing large volumes of data from multiple sources, producing more detailed and accurate environmental reconstructions.

A key data gap is the limited availability of high-resolution imagery, with most high-quality data being commercial and not freely accessible, particularly affecting studies that require detailed environmental monitoring.

Fostering a data-sharing culture through incentives for collectors, creating standardized annotation pipelines, and making commercial high-resolution satellite imagery more accessible would significantly advance ML-enabled environmental monitoring for biodiversity conservation.

DatasetData Gap Summary
Drone imagery

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Give feedback
Environmental DNA (eDNA)

 One gap in data is the incomplete barcoding reference databases.

Give feedback
Satellite imagery

Satellite images provide environmental information for habitat monitoring. Combined with other data, e.g. bioacoustic data, they have been used to model and predict species distribution, richness, and interaction with the environment. High-resolution images are needed but most of them are not open to the public for free.

Give feedback
Enhancing energy policy and market analysis
Details (click to expand)

Energy transition policies require comprehensive data on generation, emissions, and financial performance across power systems, but fragmented government datasets make evidence-based policymaking challenging.

AI and data fusion techniques can integrate scattered regulatory data from utilities and energy companies to create analysis-ready datasets that inform carbon pricing, renewable incentives, and grid modernization policies.

Inconsistent data formats, missing identifiers, and poor documentation across government agencies create significant barriers for automated data processing and analysis.

Standardized reporting formats, improved documentation, and centralized data platforms could enable more effective AI-driven policy analysis and accelerate evidence-based energy transitions.

DatasetData Gap Summary
The Public Utility Data Liberation (PUDL)

Government energy datasets suffer from inconsistent formats, missing documentation, and aggregation challenges that prevent ready analysis. Key gaps include complex pre-processing requirements due to format changes, limited documentation maintenance, and missing weather and transmission data. Standardized reporting formats across agencies, improved documentation practices, and expanded data collection could significantly enhance the utility of integrated energy datasets for policy analysis.

Give feedback
Enhancing estimations of methane emissions from rice paddies
Details (click to expand)

Rice paddies are a major source of global anthropogenic methane emissions. Accurate quantification of CH₄ emissions, especially how they vary with different agricultural practices, is crucial for addressing climate change. 

ML can enhance methane emission estimation by automatically processing and analyzing remote-sensing data, leading to more efficient assessments.

Currently, there is a lack of direct observation of methane emissions from rice paddies that could be used to train ML models.

Real-world data collection is needed to unlock this use case.

DatasetData Gap Summary
Direct measurement of methane emission of rice paddies

There is a lack of direct observation of methane emissions from rice paddies.

Give feedback
Enhancing marine wildlife detection and species classification
Details (click to expand)

Marine wildlife detection and species classification are crucial for understanding the impacts of climate change on marine ecosystems. These tasks involve identifying and categorizing different marine species. 

ML can significantly enhance these efforts by automatically processing large volumes of data from diverse sources, improving accuracy and efficiency in monitoring and analyzing marine life.

Current bottlenecks due to data availability include the lack of sufficient labeled data and the lack of open data. Regarding existing data, enabling broader data sharing is the most critical challenge to address. A lot of ocean data is collected, there are massive gaps in coverage, with heavy biases towards coastal regions. Collecting data from the deep ocean is technologically challenging and financial incentives are lacking. High seas fall outside national jurisdictions, so data collection often occurs only through mining companies, military operations, or ad hoc research expeditions. The absence of marine protected areas on high seas and the migratory nature of species like phytoplankton further complicate data collection. 

Open-source databases containing labeled data and label editors such as FathomNet can increase the amount of relevant data for training ML models. Initiatives like the Ocean Biodiversity Information System (OBIS) and, Integrated Ocean Observing System (IOOS) contribute to data availability more broadly. Data collection efforts may strategically target places where biodiversity is large, but currently available data is sparse. Financial tools or regulations could incentivize data collection.

DatasetData Gap Summary
FathomNet (marine wildlife annotated imagery)

The biggest challenge for the developer of FathomNet is enticing different institutions to contribute their data.

Give feedback
Enhancing power grid-vegetation management for wildfire risk mitigation
Details (click to expand)

Vegetation encroachment near high-voltage transmission lines can lead to outages and pose major fire risks, compromising the safety and reliability of the power grid and potentially igniting dangerous wildfires that release stored carbon and endanger wildlife.

Machine learning, especially computer vision applied to remote sensing imagery and historic management records, can accelerate vegetation management by identifying overgrowth areas and tracking dynamic seasonal vegetation growth near grid infrastructure.

Key data gaps include limited access to proprietary utility data, sparse LiDAR captures leading to incomplete scans, insufficient temporal and spatial coverage, and preprocessing requirements for imagery from multiple sensor platforms.

Solutions include establishing partnerships with utilities for data sharing, coordinating multiple robot/UAV inspection trips for improved coverage, developing preprocessing pipelines for diverse sensor data, and implementing regular monitoring schedules to capture seasonal vegetation changes.

DatasetData Gap Summary
Aerial power line corridor inspection data

UAV imagery for vegetation management near power lines requires partnerships with private companies and utilities for access. LiDAR data is often sparse with partial line scans resulting in poor data quality. Coverage is typically limited to specific rights-of-way, requiring continuous monitoring to track vegetation growth over time.

Give feedback
Power line robot inspection imagery

Grid inspection robot imagery requires coordination with local utilities foraccess, multiple robot trips for complete coverage, image preprocessing to remove ambient artifacts, position and location calibration, and may be limited by camera resolution for detecting subtle degradation patterns.

Give feedback
Enhancing the scalability and robustness of building stock assessments
Details (click to expand)

Rapid decarbonization of the building sector requires understanding the current composition and differentials in energy performance across buildings to guide the deployment of solutions, including insulation retrofits, heat pump installations, or district heating system provision. 

ML can support the deployment of large-scale building stock models that include increasingly granular information on buildings, e.g., their geometry, construction period, or materials, or energy performance certificates. Relevant existing applications of ML include inferring building characteristics across geographies as proxies for missing measured data, or processing satellite imagery to identify buildings in regions where up-to-date cadastral data is not openly accessible.

Reliable building-level information on energy performance – and strong predictors of it, such as precise floor space or walls’ insulation – remains very limited, even in high-income countries. This causes noise and uncertainty in assessments and makes training ML models that generalize well across geographies difficult.  

These data gaps can be alleviated by increased data releases from governments, energy operators, and real estate companies; regulations requiring the disclosure of energy data and the usage of standards to harmonize measurement of approaches across countries can also play a big role.

DatasetData Gap Summary
Building energy performance certificates

Energy Performance Certificate (EPC) datasets face major gaps in aggregation, provenance, documentation, missing components, structure, and timeliness. Differences in formats and methodologies across countries, limited metadata, outdated records, and missing key attributes can affect their usability.

Give feedback
Building stock – from cadaster and aerial imagery

Datasets tend to face gaps in obtainability, reliability, usability, and sufficiency. These include challenges in finding and interpreting data due to inconsistent naming, poor documentation, variable quality, limited geographic and temporal coverage, and inconsistent data models requiring manual aggregation.

Give feedback
Building stock – satellite-derived

Building datasets generated through large-scale ML extraction, such as Microsoft ML buildings, face reliability and sufficiency issues due to limited validation, positional inaccuracies, and inferred heights with low accuracy. Usability is also hindered by missing documentation on methodologies and input imagery, while datasets with coarse raster resolution or missing key attributes like usage type or age reduce the data’s applicability for detailed energy analyses.

Give feedback
Material intensity data

Material intensity coefficient datasets face key gaps in aggregation, provenance, documentation, granularity, and timeliness. These issues stem from inconsistent formats, missing metadata, outdated or high-level data, and limited transparency on how values are derived, all of which can hinder reliable, comparative use in material and emissions modeling.

Give feedback
TABULA building typology

TABULA suffers from limited granularity, insufficient volume, and outdated information. They often provide only one representative value per archetype, lack typological diversity across countries, and include parameters with questionable accuracy.

Give feedback
Enhancing wind power grid integration and stability
Details (click to expand)

The integration of low-inertia distributed energy resources like wind power into the grid creates critical stability and reliability challenges, particularly for maintaining system frequency at nominal levels to prevent damage and blackouts.

AI and machine learning can enhance wind power’s contribution to grid stability by optimizing synthetic inertial and primary frequency response capabilities through advanced modeling and control strategies.

Key data gaps include limited accessibility to simulation tools, insufficient temporal granularity in models that operate on hourly rather than sub-hourly scales, and reliability concerns due to the lack of real-world validation data for model outputs.

Grid operators and research institutions can collaborate to improve model accessibility, increase temporal resolution to capture sub-hourly dynamics, and validate simulations with operational data, enabling more effective AI-driven solutions for grid stability as renewable penetration increases.

DatasetData Gap Summary
NREL Wind Active Power Control Simulation Tools

Access to NREL’s FESTIV model requires special permission, limiting broader research applications. The model’s hourly temporal resolution cannot capture sub-hourly dynamics critical for frequency response and system stability. Additionally, the simulation-based approach requires validation with real-world operational data to ensure accuracy for practical grid applications.

Give feedback
Facilitating grid reliability events analysis
Details (click to expand)

Due to rapid fluctuations in power generation, renewables introduce variability into the grid. These signals are capable of triggering safety monitoring systems related to grid stability. Power grid control centers receive multiple streams of data from these systems (e.g. alarms, sensors, and field reports) that are semi-structured and arriving at a high volume. For operators, these alarm triggers and associated data can be overwhelming to rationalize, reduce, and contextualize to diagnose grid conditions. 

ML can assist in interpreting these data to better understand the sequence of events leading up to an incident as well as to identify and detect the causes behind system disturbances affecting grid reliability.

Access to grid reliability data remains limited, the amount of preprocessing needed constitutes a hurdle, and not all alarm triggers have been validated, also possibly resulting in noise. 

More open data releases and open community work regarding data preprocessing can help further advance this use case.

DatasetData Gap Summary
EPRI10 (transmission control center alarm and operational data set)

Access to EPRI10 grid alarm data is currently limited within EPRI. Data gaps with respect to usability are the result of redundancies in grid alarm codes, requiring significant preprocessing and analysis of code ids, alarm priority, location, and timestamps. Alarm codes can vary based on sensor, asset and line. Resulting actions as a consequence of alarm trigger events need field verifications to assess the presence of fault or non-fault events.

Give feedback
Facilitating disaster risk assessments
Details (click to expand)

As climate change progresses, extreme weather events and related hazards are expected to become more frequent and severe. To effectively address these challenges, robust disaster risk assessment and management are crucial. This involves better mapping of which population and assets are subject to given risks. 

ML can be used to facilitate disaster risk assessments by helping analyze satellite imagery and geographic data in order to pinpoint vulnerable areas and produce more detailed risk maps. By this, ML can overcome some limitations of traditional ground surveys that are time- and cost-intensive.

There is a general lack of data from the Global South where, for many regions, collection capabilities are lower while climate impacts are forecasted to be disproportionally high. Existing data are typically incomplete, even in most high-income countries, limiting the depth of potential analyses and generating uncertainties in assessments, for example, about monetary losses due to disasters.

Closing these data gaps involves inter alia deploying ML techniques that perform well in the Global South, collecting high-quality data involving local knowledge in a variety of contexts, and making the best remote sensing and cadaster data available to these efforts.  

DatasetData Gap Summary
Building stock – from cadaster and aerial imagery

These datasets are mainly available in rich countries from Europe, North America, and Asia, leaving large parts of the world with timely challenges involving their building stock without appropriate data for detailed assessments. The lack of attributes beyond the location and shape of the buildings also limits the breadth of applications. ML may help by inferring missing attributes, given the availability of sufficient training data. Research efforts in particular in Europe, including EUBUCCO (eubucco.com) or the Digital Building Stock Model by the Joint Research Centre of the European Commission (https://data.jrc.ec.europa.eu/collection/id-00382), are addressing several of the existing data gaps.

Give feedback
Building stock – satellite-derived

These datasets are typically released at a scale that makes their validation complex and partial, implying potentially large uncertainties in the data. The lack of attributes beyond the location and shape of the buildings also limits the breadth of applications. ML may help by inferring missing attributes, given the availability of sufficient training data.

Give feedback
Digital elevation model

Very high-resolution reference data is currently not freely open to the public.

Give feedback
Financial loss datasets related to the impacts of disasters

Financial loss data is typically proprietary and not publicly accessible.

Give feedback
Natural hazards forecasts

The resolution of current natural hazard forecast data is not sufficient for effective physical risk assessment.

Give feedback
OpenStreetMap (land use map)

The quality of OpenStreetMap is very variable in terms of coverage of geometries e.g. buildings and attributes. Roads are better mapped than buildings in general. The very permissive data model from OpenStreetMap enables users to provide a variety of information, but it is often not well harmonized. Recent corporate editing efforts have increased dramatically the coverage in previously poorly mapped regions.

Give feedback
Population and asset exposure to natural hazards

Accessibility and reliability are the most significant challenges with exposure data.

Give feedback
Facilitating fault detection in low voltage distribution grids
Details (click to expand)

The low-voltage distribution portion of the grid directly supplies power to consumers. As consumers integrate more distributed energy resources and dynamic loads (such as electric vehicles), low-voltage distribution systems are susceptible to power quality issues that can affect the stability and reliability of the grid. Fault-inducing harmonics can be challenging to monitor, diagnose, and control due to the number of nodes/buses that connect various grid assets and the short distances between them. 

Machine learning methods can recognize patterns to automate fault diagnoses agnostic to specific line thresholds and topologies. If integrated into advanced monitoring systems, detecting and localizing faults can accelerate adaptive protection and network reconfiguration efforts to ensure reliability and stability.

Data gaps for this use case include lack of coverage (spatial and temporal), noise in the data and high data volume.

New data collection and further analyses of existing data to better understand its pitfalls have the potential to help mitigate the existing gaps for this use case.

DatasetData Gap Summary
Micro-synchrophasors (µPMU data)

For effective fault localization using µPMU data, it is crucial that distribution circuit models supplied by utility partners or Distribution System Operators (DSOs) are accurate, though they often lack detailed phase identification and impedance values, leading to imprecise fault localization and time series contextualization. Noise and errors from transformers can degrade µPMU measurement accuracy. Integrating additional sensor data affects data quality and management due to high sampling rates and substantial data volumes, necessitating automated analysis strategies. Cross-verifying µPMU time series data, especially around fault occurrences, with other data sources is essential due to the continuous nature of µPMU data capture. Additionally, µPMU installations can be costly, limiting deployments to pilot projects over a small network. Furthermore, latencies in the monitoring system due to communication and computational delays challenge real-time data processing and analysis.

Give feedback
Facilitating forest restoration monitoring
Details (click to expand)

Efforts are being made to restore ecosystems like forests and mangroves. 

ML can be used to monitor biodiversity changes before and after restoration efforts, in order to quantify their effectiveness and outcomes.

A significant data gap is the lack of standardized protocols to guide data collection for restoration projects, making it difficult to consistently assess biodiversity outcomes using ML across different restoration initiatives.

Developing standardized data collection protocols, fostering a culture of data sharing, and implementing incentives for data collectors would enable more effective ML applications, leading to better assessment of restoration successes and failures on a global scale.

DatasetData Gap Summary
Camera trap wildlife image collections

There is a lack of standardized protocol to guide data collection for restoration projects, including which variables to measure and how to measure them. Developing such protocols would standardize data collection practices, enabling consistent assessment and comparison of restoration successes and failures on a global scale.

Give feedback
Drone imagery

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Give feedback
Passive acoustic monitoring for biodiversity assessment

There is a lack of standardized protocol to guide data collection for restoration projects, including which variables to measure and how to measure them. Developing such protocols would standardize data collection practices, enabling consistent assessment and comparison of restoration successes and failures on a global scale.

Give feedback
Facilitating the detection of climate-induced ecosystem changes
Details (click to expand)

Climate change is causing significant alterations in ecosystems worldwide, threatening biodiversity and ecosystem services that are critical for both nature and human well-being. 

Machine learning can analyze complex ecological data from multiple sources to detect climate change impacts, identify vulnerable regions, and inform targeted conservation efforts. 

Key data gaps include insufficient high-resolution climate and biodiversity data, restricted access to ground survey data, and limited institutional capacity to process collected data efficiently. 

Addressing these gaps requires establishing decentralized monitoring networks, improving data accessibility through legislative reforms, and developing sustainable funding models for long-term ecosystem monitoring initiatives.

DatasetData Gap Summary
Camera trap wildlife image collections

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Give feedback
Ground survey of land use and land management

Access to comprehensive ground survey data is restricted due to institutional barriers and privacy concerns, limiting its availability for ecosystem change analysis.

Give feedback
Historical climate observations

For highly biodiverse regions, like tropical rainforests, there is a lack of high-resolution data that captures the spatial heterogeneity of climate, hydrology, soil, and the relationship with biodiversity.

Give feedback
Passive acoustic monitoring for biodiversity assessment

The major data gap is the limited institutional capacity to process and analyze data promptly to inform decision-making.

Give feedback
Hybrid ML-physics climate models for enhanced simulations
Details (click to expand)

Physics-based climate models incorporate numerous complex components that are computationally intensive, which limits the spatial resolution achievable in climate simulations. 

ML models can emulate these physical processes, providing a more efficient alternative to traditional methods, enabling faster simulations and enhanced model performance.

The most significant data gaps are the enormous volume of climate data, which creates challenges for storage, transfer, and processing, and insufficient granularity in existing datasets to resolve fine-scale physical processes like turbulence.

Developing improved computational infrastructure for handling large datasets and creating ultra-high-resolution benchmark simulations would significantly enhance hybrid climate modeling capabilities.

DatasetData Gap Summary
ClimSim (benchmark data for hybrid ML-physics research)

ClimSim faces challenges with its large data volume, making downloading and processing difficult for most ML practitioners, and its resolution is insufficient to resolve some fine-scale physical processes critical for accurate climate modeling.

Give feedback
DYAMOND (global atmospheric circulation model intercomparison data)

DYAMOND faces similar challenges to ClimSim: its large volume creates processing difficulties, and its resolution, while high, remains insufficient for resolving fine-scale atmospheric processes needed for accurate climate modeling.

Give feedback
ECMWF ERA5 Atmospheric Reanalysis

While ERA5 is widely used due to its good structure and global coverage, users face significant challenges with downloading times that can take days to months, and the sheer data volume presents processing difficulties for many users. 

Give feedback
Large-eddy simulations (atmospheric processes)

These simulations are essential for resolving turbulent processes that current climate models cannot capture, but they require significant computational resources and are not readily available as benchmark datasets for the wider research community.

Give feedback
Regularly gridded high-resolution atmospheric observations

While conceptually needed, this dataset does not exist in the form required. An enhanced version of ERA5 with higher resolution and fidelity would significantly improve ML model training and validation. 

Give feedback
Improving assessments of climate impacts on public health
Details (click to expand)

Climate change poses significant threats to public health through heat waves, extreme weather events, changing disease patterns, and air quality degradation, making it crucial to understand these relationships for effective health system preparedness.

Machine learning can analyze complex relationships between climate variables and health outcomes to predict disease outbreaks, assess vulnerability patterns, and inform public health interventions and adaptation strategies.

Key data gaps include the structural incompatibility between gridded climate data and tabular health records, limited accessibility of health datasets due to privacy restrictions, and lack of centralized platforms for discovering relevant climate-health data sources.

Creating standardized data integration frameworks, establishing secure health data sharing protocols, and developing centralized climate-health data repositories can enable more effective ML-driven public health preparedness and climate adaptation planning.

DatasetData Gap Summary
Historical climate observations

Climate data accessibility and integration challenges limit ML applications in climate-health research. Data exists in diverse formats that require significant preprocessing, and researchers without climate expertise struggle to identify appropriate datasets for their specific health applications.

Give feedback
Public health data

Limited accessibility and poor documentation of health datasets restrict their use in climate-health ML applications. Privacy concerns and institutional barriers prevent broader data sharing, while inconsistent documentation makes existing datasets difficult to use effectively.

Give feedback
Improving battery management systems
Details (click to expand)

Battery storage is crucial for transitioning to renewable energy and electrifying transportation, with efficiency and lifetime directly impacting these sustainability efforts.

Machine learning can improve battery management systems by accurately estimating state of charge (SoC), state of health (SoH), and remaining useful life (RUL), and optimizing charging and discharging strategies.

Key data gaps include oversimplified battery models that don’t account for real-world operating conditions and insufficient validation data from physical battery systems in diverse operational environments.

Enhancing model complexity and collecting comprehensive real-world performance data can significantly improve battery management predictions, leading to extended battery lifetimes, more efficient energy use, and accelerated adoption of electric vehicles and renewable energy storage.

DatasetData Gap Summary
Equivalent circuit models

While ECMs enable real-time battery SoC predictions due to their computational efficiency, they often oversimplify real-life operating conditions which limits the accuracy of SoH and RUL estimates. Additionally, verification with data from physical battery systems is required to validate simulated outcomes and improve prediction reliability across diverse operational environments.

Give feedback
Improving estimations of forest carbon stock
Details (click to expand)

Forests are one of Earth’s major carbon sinks, making accurate estimation of forest carbon stocks essential for climate change mitigation efforts and carbon accounting. 

ML can help by providing more precise and large-scale estimates of forest carbon through the analysis of satellite imagery, LiDAR data, and ground surveys.

Ground truth data for forest carbon stock estimation is often limited in geographical coverage and temporal frequency due to the high costs and labor-intensive nature of manual data collection. Additionally, remotely sensed data (satellite, airborne LiDAR) requires significant domain expertise for proper preprocessing and interpretation.

Governments and research institutions can address these gaps by investing in more comprehensive ground survey programs, making airborne LiDAR data more widely available, and developing standardized preprocessing tools for non-experts to utilize remote sensing data effectively.

DatasetData Gap Summary
Ground-survey based forest inventory data

Manual collection results in data quality issues and limited spatial coverage, requiring improved collection protocols and integration with remote sensing to expand usability.

Give feedback
LiDAR point cloud – airbone

Limited geographical coverage due to high collection costs, combined with the need for domain expertise to process the complex point cloud data, restricts the use of this high-value data source.

Give feedback
Satellite imagery – GEDI LiDAR

Quality uncertainties in GEDI data affect carbon stock estimation reliability, requiring validation methods and calibration procedures to improve measurement accuracy.

Give feedback
Satellite imagery – PALSAR radar images

Domain expertise is needed to preprocess this data, limiting its accessibility to researchers and practitioners without specialized knowledge in radar imagery interpretation.

Give feedback
Improving long-term extreme heat prediction
Details (click to expand)

Extreme heat events are becoming more frequent and intense due to climate change, posing serious risks to human health, infrastructure, and ecosystems worldwide.

Machine learning can improve long-term extreme heat prediction by identifying complex patterns in climate data and enhancing the accuracy and resolution of projections beyond what traditional physics-based models can achieve.

Working with climate projection datasets presents significant challenges due to their massive size, which requires substantial computational resources for storage, transfer, and processing, limiting accessibility for many researchers and stakeholders.

Cloud computing providers, research institutions, and funding agencies can collaborate to develop accessible platforms and tools for efficiently managing large climate datasets, enabling broader use of AI for extreme heat prediction and adaptation planning.

DatasetData Gap Summary
NEX-GDDP-CMIP6 (Global daily downscaled long-term climate projections)

The dataset’s massive size (petabytes of data) creates significant barriers for access, transfer, and analysis, requiring specialized computing infrastructure and technical expertise that many researchers lack. Additionally, efficiently extracting relevant extreme heat information from this comprehensive climate dataset presents computational and methodological challenges.

Give feedback
Improving offshore wind power forecasting: short-to long-term (3 hours–1 year)
Details (click to expand)

Wind forecasting can allow for resource assessment studies for offshore energy production, wind resource mapping, and wind farm modeling.

Machine learning can potentially improve spatio-temporal forecasts at different horizons, provided the availability of high-quality training data.

Current data gaps include coverage gaps, noisy data, and difficulties in accessing data.

Efforts to get more of such data out of silos, mainly from energy companies, may help alleviate this gap.

DatasetData Gap Summary
NREL NOW23 (wind data)

This is numerically modeled data.

Give feedback
NREL WIND toolkit (wind and weather)

The data is outdated and only a proxy of actual meteorological conditions.

Give feedback
Ocean observations from floating infrastructure (FINO3)

Due to the location, FINO platform sensors are prone to failures of measurement sensors due to adverse outdoor conditions such as high wind and high waves. Often, when sensors fail, manual recalibration or repair is nontrivial requiring weather conditions to be amenable for human intervention. This can directly affect the data quality which can last for several weeks to a season. Coverage is also constrained to the platform and associated wind farm location.

Give feedback
Offshore wind data from masts and LiDAR

The spatiotemporal coverage of the offshore windspeed mast data is restricted to the dimensions of the platform/tower itself as well as the time of construction. Depending on the data provider access to the data may require the signing of a non-disclosure agreement.

Give feedback
WHOI Martha’s Vineyard Coastal Observatory (wind speed and direction)

The data only contains measurements close to coastline, constraining its applicability for offshore wind applications in deep sea.

Give feedback
Wind Forecast Improvement Project 3 (wind data)

This data is not yet available.

Give feedback
Improving offshore wind power nowcasting (10 min)
Details (click to expand)

Wind nowcasting can enable estimations of the active power generated by wind farms in the absence of curtailment and facilitate operations, potentially making them more efficient.

Machine learning can potentially improve such very short-term spatio-temporal forecasts, given the availability of high-quality training data.

High-resolution wind data, including wind power measurements (i.e. SCADA data from turbines) and wind field measurements (i.e. wind velocity, pressure, temperature, etc.), measured at wind farms currently remains limited to a few datasets. 

Efforts to get such data out of silos, mainly from energy companies, may help alleviate this gap.

DatasetData Gap Summary
Offshore wind farm operation data (Orsted)

Data can be accessed by requesting access via the Orsted form.  Sufficiency of the dataset is constrained by volume where only a finite amount of short term off-shore wind farms exist to which expanding the coverage area, volume and time granularity of data to under 10 minutes may enable transient detection from generated active power.

Give feedback
Improving power grid optimization
Details (click to expand)

Optimal Power Flow (OPF) is used to find the cheapest way to generate electricity while meeting demand and staying within system limits like voltage and line capacity. Traditionally, OPF is a complex math problem solved separately for AC and DC systems. As more renewable energy is added, the grid is shifting toward hybrid AC/DC systems to better handle long-distance power flow and new challenges like two-way power movement. 

Changes in the grid due to renewable sources make OPF harder to solve. ML can be used to approximate OPF problems in order to allow them to be solved at greater speed, scale, and fidelity.

Data gaps for this use case are numerous and mainly across usability, reliability, and sufficiency. 

Closing these gaps requires an array of gap-specific actions; further industry engagement may have a significant impact on many of the gaps.

DatasetData Gap Summary
Grid2Op and PandaPower (power systems simulation outputs))

Grid2Op faces several data gaps related to usability, reliability, and coverage. Key issues include poor documentation, limited customization options (especially for reward functions and cascading failure scenarios), and a lack of support for multi-agent setups. The framework also lacks realistic system dynamics, fine time resolution, and flexible backend modeling, making it challenging to use for advanced research or real-world grid simulations without significant modification. These gaps can hinder the framework’s ability to accurately train reinforcement learning agents and simulate real-world power grid behavior.

Give feedback
Optimal power flow simulation outputs

Traditional OPF simulation software may require the purchase of licenses for advanced features and functionalities. To simulate more complex systems or regions, additional data regarding energy infrastructure, region-specific load demand, and renewable generation may be needed to conduct studies. OPF simulation output would require verification and performance evaluation to assess results in practice. Increasing the granularity of the simulation model by increasing the number of buses, limits, or additional parameters increases the complexity of the OPF problem, thereby increasing the computational time and resources required.

Give feedback
Power Grid Lib (optimal power flow benchmark library)

While network datasets are open source, maintenance of the repository requires continuous curation and collection of more complex benchmark data to enable diverse AC-OPF simulation and scenario studies. Industry engagement can assist in developing more realistic data though such data without cooperative effort may be hard to find.

Give feedback
Improving short-term electricity load forecasting
Details (click to expand)

Short-term load forecasting is critical for utilities to balance power demand with supply. Utilities need accurate forecasts (e.g. on the scale of hours, days, weeks, up to a month) to plan, schedule, and dispatch energy while decreasing costs and avoiding service interruptions. 

ML is well suited to handle large amounts of data such as historical electricity load data, weather forecasts, and continuous streams of advanced metering infrastructure (AMI) data, from which it may capture non-linearities which traditional linear models often struggle with.

Several data gaps for this use case resolve around the difficulty to access varied data due inter alia to privacy concerns and lack of willingness from private actors to share data for research.

ML can help the development of synthetic, privacy-preserving datasets that can accelerate research in this space.

DatasetData Gap Summary
Advanced metering infrastructure data

AMI data is challenging to obtain without pilot study partnerships with utilities since data collection on individual building consumer behavior can infringe upon customer privacy, especially at the residential level. Granularity of time series data can also vary depending on the level of access to the data, whether it be aggregated and anonymized or based on the resolution of readings and system. Additionally, the coverage of data will be limited to utility pilot test service areas, thereby restricting the scope and scale of demand studies.

Give feedback
Building data genome project (hourly building-level metered data)

While the dataset has clear documentation, the lack of diversity in building types necessitates further development of the dataset to include other types of buildings, as well as expanding coverage to areas and times beyond those currently available.

Give feedback
Faraday (Synthetic smart meter data)

Despite the model being trained on actual AMI data, synthetic data generation requires a priori knowledge with respect to building type, efficiency, and presence of low-carbon technologies. Furthermore, since the dataset is trained on UK building data, AMI time series generated may not be an accurate representation of load demand in regions outside the UK. Finally, since data is synthetically generated, studies will require validation and verification using real data or aggregated per substation level data to assess its effectiveness.

Give feedback
Improving solar power forecasting: long-term (>24 hours)
Details (click to expand)

Accurately forecasting solar power generation beyond 24 hours is critical for energy market pricing, investment decisions, and coordinating renewable energy sources in an increasingly decarbonized grid. 

Machine learning approaches can improve longer-term solar forecasting by combining weather predictions, historical generation data, and other relevant variables to create more accurate models than traditional methods. 

The primary data gaps include limited geographic coverage of existing datasets, reliance on simulated rather than measured data, and quality concerns when adapting models to specific regions. 

Expanding data collection networks, validating simulated data with real measurements, and creating standardized datasets for diverse regions would enable more reliable ML-based solar forecasting systems that could significantly improve grid stability and accelerate renewable energy adoption.

DatasetData Gap Summary
NREL solar power data for integration studies

While valuable for renewable energy integration studies, this dataset has limitations in geographic coverage (limited to the US), temporal scope (only 2006 data), and relies on simulated rather than measured PV outputs. Addressing these gaps would enable more accurate and globally applicable ML-based solar forecasting models.

Give feedback
Improving solar power forecasting: medium-term (6-24 hours)
Details (click to expand)

Medium-term solar forecasting (6-24 hours ahead) is essential for efficient grid management, especially as solar power integration increases, impacting energy markets, demand response, and microgrid operations.

Machine learning techniques can significantly improve these forecasts by integrating satellite data with weather predictions and historical patterns to provide more accurate solar irradiance estimates.

A key data gap is the inconsistency in satellite data resolutions and coverage, alongside challenges in processing multispectral data and accurately modeling how different cloud types affect ground irradiance.

Combining satellite observations with ground-based measurements and developing standardized preprocessing approaches would substantially improve forecast accuracy, enabling better grid management and renewable energy integration.

DatasetData Gap Summary
Satellite imagery

Satellite remote sensing data for solar forecasting faces several challenges: variability in spatial and temporal resolution across different satellite sources, complex preprocessing requirements for multispectral data, and the need to accurately translate cloud observations into ground-level irradiance predictions. Improving granularity through supplementation with ground-based measurements and developing standardized preprocessing pipelines would significantly enhance forecast accuracy for grid management applications.

Give feedback
Improving solar power forecasting: nowcasting/very-short-term (0-30min)
Details (click to expand)

Very-short-term solar power forecasting is critical for grid stability and efficiency as sudden changes in solar irradiance (ramp events) can cause abrupt fluctuations in power generation. 

AI techniques can analyze cloud dynamics through segmentation and classification to predict solar irradiance attenuation, enabling more accurate forecasting for real-time electricity markets, dispatch of other generating sources, and energy storage control.

Key data gaps include limited spatial coverage of ground monitoring stations, insufficient time resolution for sub-5-minute forecasting, challenges with large data volumes from sensor networks, and data quality issues related to sensor calibration.

Expanding sensor networks to diverse environments, implementing AI-based data compression and quality control, and integrating multi-source data can close these gaps, ultimately enabling more reliable integration of solar power into electricity grids.

DatasetData Gap Summary
DOE Atmospheric Radiation Measurement research facility data products

ARM data presents challenges with data volume management, measurement verification (especially for aerosol composition), limited spatial coverage (ARM sites only), and sensor calibration issues. Solutions include AI-based data compression, enhanced aerosol composition measurements, collaboration with partner networks to expand coverage, and automated quality control.

Give feedback
NIST campus photovoltaic arrays and weather station data

The dataset has limited spatial coverage (Gaithersburg, MD only) and is no longer maintained after July 2017, limiting its usefulness for current applications.

Give feedback
Solcast (global solar forecasting and historical solar irradiance data)

Solcast data is only accessible through academic or research institutions, uses coarse elevation models, has limited coverage (33 global sites), and provides data at 5-60 minute intervals, insufficient for very-short-term forecasting.

Give feedback
SRRL TSI-880 (sky imagery)

Data coverage is limited by camera locations, temporal resolution is restricted to 10-minute increments, and image resolution is limited to 352x288 24-bit jpeg images.

Give feedback
SWINySEG (sky imagery)

The dataset provides valuable annotated data for cloud detection and segmentation but is limited to Singapore, has an insufficient volume of samples (especially nighttime images), and restricts commercial use.

Give feedback
Improving solar power forecasting: short-term (30 min-6 hours)
Details (click to expand)

Solar irradiance forecasting at hourly intervals is critical for managing intermittent solar energy resources and ensuring grid stability and reliability. 

Machine learning approaches can enhance forecasting accuracy by leveraging multiple data sources, including measured irradiance, PV inverter outputs, and meteorological variables. 

Important data gaps include limited spatial coverage, with most high-quality data concentrated in specific regions, and inconsistent temporal resolution that affects forecasting precision. 

By expanding sensor networks globally and harmonizing data collection standards, forecasting models can better support real-time energy management, demand response, and grid stability across diverse geographical areas.

DatasetData Gap Summary
NOAA's SOLRAD Network Solar Radiation Data

While NOAA’s SOLRAD is an excellent data source for long term solar irradiance and climate studies,  it has limitations for short-term solar forecasting applications. Key gaps include lower quality hourly averages compared to native resolution data, and limited geographic coverage with only nine monitoring stations across the United States. These constraints impact the effectiveness of forecasting for real-time energy management, grid stability, and market operations.

Give feedback
NREL Physical Solar Model Solar Radiation Database

While NSRDB offers global coverage using satellite-derived data, several challenges exist. The dataset requires periodic recalculation and updating to remain current, with unbalanced temporal coverage favoring the United States. Satellite-based estimations may be inaccurate in regions with frequent cloud cover, snow, or bright surfaces, requiring ground-based verification. Additionally, data derived from satellite imagery may need preprocessing to account for parallax effects and field-of-view issues that aren’t fully addressed in the higher-level FARMS products.

Give feedback
NREL SRRL Baseline Measurement System for Multi-Variable Solar Research

While NREL’S SRRL BMS provides real-time joint variable data from ground-based sensors, its coverage is limited to the single location in Golden, CO in the United States. The diverse sensor network requires regular maintenance, and instrument malfunctions or calibration issues may lead to data inaccuracies if not promptly detected and addressed, affecting the reliability of solar forecasting applications.

Give feedback
SMA Solar Technology PV System Performance database

The SMA PV monitoring system requires user profile creation and specific system access requests, with documentation primarily in German creating potential language barriers. Data representation is geographically unbalanced with stronger coverage in Germany, Netherlands, and Australia despite its global presence. Additionally, only a subset of systems includes energy storage data, which would be valuable for comprehensive distributed energy resource load forecasting studies.

Give feedback
SOLETE Hybrid Solar-Wind Generation dataset

While SOLETE offers valuable data for joint wind-solar distributed energy resource forecasting, several sufficiency gaps limit its application. The dataset’s 15-month temporal coverage doesn’t capture long-term seasonal variations, and it monitors only a single wind turbine and PV array, limiting analyseis of multi-source generation coordination. Additionally, maintenance schedule and system downtime data are missing, which would enhance realistic system dynamic modeling. Supplementing with external data sources or simulation could address these limitations.

Give feedback
Improving terrestrial wildlife detection and species classification
Details (click to expand)

Terrestrial wildlife detection and species classification are essential for understanding the impacts of climate change on terrestrial ecosystems. 

ML can greatly improve these efforts by automatically processing large volumes of data from diverse sources, enhancing the accuracy and efficiency of monitoring and analyzing terrestrial species.

The primary data gaps include insufficient or imbalanced publicly available annotated datasets of all modalities and challenges, in the case of bioacoustic data, with sharing large-volume data due to storage limitations and high costs.

Solutions include developing affordable data hosting platforms, incentivizing data sharing through recognition and funding, and establishing standardized protocols for data integration.

DatasetData Gap Summary
Biodiversity images and recordings – community science data

One challenge with community science data is biases in geographic and taxonomic representativity: while community science data can provide broader coverage than formal survey data, but is highly biased and the biases are not documented.. Data tends to be concentrated in accessible areas and often focuses on charismatic or commonly encountered species. This limits the generalizability of ML models that can be built from training on this data.

Give feedback
Camera trap wildlife image collections

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Give feedback
Drone imagery

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Give feedback
Environmental DNA (eDNA)

eDNA is an emerging new technique in biodiversity monitoring. There are still many issues impeding the application of eDNA-based tools. One important gap in data is the incompleteness of barcoding reference databases.

Give feedback
Museum specimens

The majority of the world’s museum specimens remain undigitized, creating a significant barrier to using these records in machine learning applications for biodiversity monitoring and climate change research.

Give feedback
Passive acoustic monitoring for biodiversity assessment

The first and foremost challenge of bioacoustic data is its sheer volume, which makes data sharing especially challenging. Solutions are needed for cheaper and more reliable data hosting and sharing platforms. 

Additionally, there’s a significant shortage of large and diverse annotated datasets, which is even more severe than image data such as camera trap, drone, and crowd-sourced images.

Give feedback
Satellite imagery

Some commercial high-resolution satellite images can also be used to identify large animals such as whales, but those images are not open to the public.

Give feedback
Interpolating city-wide bicycle volumes from limited count data
Details (click to expand)

The usage of bicycles as a commuting mode in cities is important both for climate change mitigation (modal shift from emitting modes like cars to active mobility) and for public health reasons (activity from biking improves health).

ML can interpolate patterns from limited bicycle count data to provide city-wide volume estimates, utilizing available count data and combining it with additional data sources such as infrastructure data, thereby offering an alternative to costly, widespread sensor deployment.

Bicycle count and infrastructure data can suffer from limited access, poor coverage, and inconsistent quality, causing hurdles when training machine learning models.   

Creating centralized platforms and standardizing cycling data collection and sharing would improve data quality and accessibility, enabling more robust analyses in support of sustainable urban planning.

DatasetData Gap Summary
Bicycle count data – permanent sensing

Several data gaps limit the effectiveness of bicycle count data from permanent sensing. Events like construction can compromise reliability, while obtainability is hindered by limited accessibility outside major cities and the lack of centralized platforms for finding and aggregating data. Additionally, coverage is often insufficient to provide a clear picture across the city, with data collected at only a few locations within a city.

Give feedback
Bicycle count data – temporary sensing

Bicycle counts from temporary sensing face several obtainability and sufficiency challenges. Accessibility is often limited outside major cities due to a lack of capacity to publish data, and users must contact individual cities to access it, as no central platforms exist. The data itself is often insufficient (collected at only a few locations and for short periods), limiting its usefulness. Additionally, limited documentation about why a count was conducted at a specific time and location reduces the usability of the data by omitting important contextual information.

Give feedback
Bike infrastructure data – from city administrations

City-provided bike infrastructure data is often hard to find, with no central platforms and limited sharing beyond major cities. Even when available, it may lack clear documentation on what has been implemented or omit detailed features of the infrastructure.

Give feedback
Historical climate observations

It is worth noting that there are no major data gaps for this use case for cities where the other necessary data sources are available.

Give feedback
OpenStreetMap (land use map)

Bike infrastructure data in OpenStreetMap faces reliability and usability issues due to a lack of validation and inconsistent naming conventions, requiring extensive pre-processing. Elements like bike parking are often missing, reducing data completeness. Cities and shared tools can help address these gaps.

Give feedback
Strava GPS-based cycling data

Strava GPS cycling data offers both the highest temporal and spatial resolution and the most comparable source across cities. However, it is accessible to cities but less so to academics and mainly represents specific user groups, limiting its coverage of the overall cycling population.

Give feedback
Mapping existing solar photovoltaic systems
Details (click to expand)

Mapping existing solar photovoltaic (PV) systems involves identifying and geolocating installed solar panels using data sources like satellite imagery, aerial photography, utility records, or crowd-sourced databases. This information helps track solar adoption, estimate local renewable energy capacity, and identify gaps or opportunities for further deployment. Accurate maps of solar PV systems are essential for climate change mitigation, as they inform policy, grid planning, and progress monitoring toward clean energy goals.

ML can segment remote sensing imagery towards building comprehensive databases of PV systems by identifying their locations, estimating their size, inferring their capacity, and approximating their installation age.

Solar PV data suffers from uneven global coverage, inconsistent formats, and limited attribute detail, compounded by data quality issues and lack of historical or timely updates; while some datasets like MaStR and USPVDB offer authoritative info, they remain geographically or scope limited, and satellite imagery faces challenges with large volumes and access to historical high-res data.

To close these gaps, efforts could focus on expanding and standardizing data collection globally—encouraging local contributions and harmonizing tagging schemes in OSM; further integrating multiple data sources (official registries, satellite imagery, crowdsourced data) to improve completeness and accuracy; and regularly updating datasets with clear versioning to track changes over time.

DatasetData Gap Summary
Marktstammdatenregister (solar photovoltaic data)

Although MaStR is one of the best datasets for solar PVs, substantial errors exist in the data, e.g. in terms of temporal or position accuracy.

Give feedback
NREL Solar panel PV system dataset

The solar panel PV system dataset excluded third-party-owned systems and systems with battery backup. Since data was self-reported, component cost data may be imprecise. The dataset also has a data gap in timeliness as some of the data includes historical data, which may not reflect current pricing and costs of PV systems.

Give feedback
OpenStreetMap (land use map)

OpenStreetMap’s solar PV data suffers from uneven global coverage and missing critical attributes, limiting its utility for comprehensive energy assessments. Additionally, inconsistent tagging and lack of quality control hinder data usability, reliability, and integration.

Give feedback
Satellite imagery

Data gaps for this use case may stem from the need for large data volumes and high-resolution historical imagery.

Give feedback
US large-scale solar photovoltaic database (USPVDB)

Only the US are covered in this dataset and coverage in the US is not complete.

Give feedback
Modeling effects of soil processes on soil organic carbon
Details (click to expand)

Understanding the causal relationship between soil organic carbon and soil management or farming practices is crucial for enhancing agricultural productivity and evaluating agriculture-based climate mitigation strategies. 

ML can significantly contribute to this understanding by integrating data from diverse sources to provide more precise spatial and temporal analyses. 

The insufficient data coverage and granularity of soil organic carbon measurements severely limit the development of well-generalized ML models for accurately predicting soil carbon dynamics. 

Expanding monitoring networks and developing cost-effective measurement technologies, combined with better data standardization across different collection efforts, would enable more effective ML applications for soil carbon management and climate-smart agriculture.

DatasetData Gap Summary
Emission dataset compiled from FAO statistics

Data is extrapolated from statistics on a national level. It is unknown how accurate this data is when focusing on local information

Give feedback
Simulated variables from process-based models of soil organic carbon dynamics

Data collection is extremely expensive for some variables, leading to the use of simulated variables. Unfortunately, simulated values have large uncertainties due to the assumptions and simplifications made within simulation models.

Give feedback
Soil measurements – NorthWyke Farms platform

The common and biggest challenges for use cases involving soil organic carbon is the insufficiency of data and the lack of high granularity data.

Give feedback
Soil Survey Geographic Database (SSURGO)

In general, there is insufficient soil organic carbon data for training a well-generalized ML model (both in terms of coverage and granularity).

Give feedback
Optimizing electrified bus fleet in urban vehicle-to-grid systems
Details (click to expand)

Diesel-powered school buses contribute significant carbon emissions and air pollution in urban areas, while electric bus adoption faces high upfront costs that challenge school district budgets.

AI-powered optimization systems can manage electric school bus charging and discharging schedules to create virtual power plants, offsetting electrification costs through grid services revenue.

Key data gaps include inconsistent bus fleet reporting across states, limited access to proprietary charging profiles, and fragmented charge station data that prevent comprehensive fleet optimization modeling.

Standardizing state-level fleet reporting, fostering manufacturer partnerships for charging data access, and creating centralized charge station databases can enable scalable AI solutions for urban transit electrification.

DatasetData Gap Summary
Electric vehicle charge station data

Critical gaps include limited findability of station-specific usage data due to proprietary restrictions and scattered data sources requiring aggregation from multiple providers. Manufacturer partnerships and utility collaboration can improve data access, while standardized reporting frameworks can consolidate fragmented datasets to enable comprehensive fleet optimization

Give feedback
US school bus fleet dataset

The dataset suffers from inconsistent state-level reporting structures and missing data from 4 US states, limiting comprehensive national analysis. Standardizing reporting formats and expanding state participation could enable more robust AI models for fleet electrification planning across diverse geographic and operational contexts.

Give feedback
Optimizing smart inverter management for distributed energy resources
Details (click to expand)

Solar panels and batteries are part of new power systems that don’t use traditional spinning generators. They use inverters to convert DC to AC power. Smart inverters can do more than just convert power—they help manage changes in energy supply and keep the grid stable by adjusting voltage and power levels. This prevents issues like sudden drops or spikes in voltage when solar and other sources are added to the grid.

Machine learning can help better monitor and control smart inverters, with the potential to make efficiency gains.

One key data gap towards unlocking this use case is the access to relevant data.

Partnerships between research labs, utilities, and smart inverter manufacturers may help alleviate this bottleneck.

DatasetData Gap Summary
Outputs from distribution connected inverter systems simulations

There is a need to enhance existing simulation tools to study inverter-based power systems rather than traditional machine-based based. Simulations should be able to represent a large number of distribution-connected inverters that incorporate DER models with the larger power system. Uncertainty surrounding model outputs will require verification and testing. Furthermore, accessibility to simulations and hardware in the loop facilities and systems requires user access proposal submission for NREL’s Energy Systems Integration Facility access. Similar testing laboratories may require access requests and funding.

Give feedback
Smart inverter devices database

Smart inverter operational data is not publicly available and requires partnerships with research labs, utilities, and smart inverter manufacturers. However, the California Energy Commission maintains a database of UL 1741-SB compliant manufacturers and smart inverter models that can then be contacted for research partnerships. In terms of coverage area, while California and Hawaii are now moving towards standardizing smart inverter technology in their power systems, other regions outside of the United States may locate similar manufacturers through partnerships and collaborations.

Give feedback
Scaling earth system monitoring
Details (click to expand)

Many climate-related applications suffer from a lack of real-time and on-the-ground data for monitoring Earth systems, limiting effective climate action and adaptation strategies.

ML can analyze satellite imagery at scale to fill these gaps through applications such as land cover classification, deforestation detection, emissions monitoring, and disaster management.

The massive volume of satellite data creates storage and processing challenges, while the lack of annotated datasets and limited access to high-resolution imagery particularly affects Global South applications.

Coordinated annotation efforts across sectors, development of foundation models for remote sensing, and expanded access to high-resolution imagery can enable more effective ML-driven Earth monitoring.

DatasetData Gap Summary
Satellite imagery

Satellite images face major challenges from massive data volumes that impede downloading and processing, lack of annotated data for training ML models, and limited access to high-resolution imagery, particularly affecting Global South applications.

Give feedback
Scaling identification and mapping of climate policy
Details (click to expand)

Laws and regulations relevant to climate change mitigation and adaptation are essential for assessing progress on climate action and addressing various research and practical questions. 

ML can be employed to identify climate-related policies and categorize them according to different focus areas.

Law corpora are published in various languages and formats by a variety of actors, including cities, national governments and other agencies. They are not all digitized, may be hard to access and require ample harmonization work.

These data gaps may be addressed through aggregation initiatives and ML may be a key component by automating lengthy processes such as translation or screening for relevance.

DatasetData Gap Summary
Climate-related laws and regulations

Laws and regulations for climate action are published in various formats through national and subnational governments, and most are not labeled as a “climate policy”. There are a number of initiatives that take on the challenge of selecting, aggregating, and structuring the laws to provide a better overview of the global policy landscape. This, however, requires a great deal of work, needs to be permanently updated, and datasets are not complete.

Give feedback
Scaling methane emission detection
Details (click to expand)

Methane is the most potent greenhouse gas and the second-largest contributor to climate change, with emissions from the oil and gas industry accounting for 20% of global methane emissions. 

Advanced machine learning techniques applied to satellite imagery enable the detection, quantification, and monitoring of methane emissions at scale, supporting more effective mitigation efforts across global oil and gas operations.

The primary data gap for methane detection is insufficient spatial resolution in widely available satellite data, making it difficult to pinpoint smaller or localized emission sources and accurately quantify their contribution.

Developing higher-resolution satellite systems like MethaneSAT and creating benchmark datasets with synthetic methane plume data can significantly improve detection capabilities, enabling more targeted mitigation efforts and potentially reducing a substantial portion of global methane emissions.

DatasetData Gap Summary
Satellite imagery – Hyperspectral

Very few actual hyperspectral images of methane plumes exist, creating a significant data volume limitation for training robust detection algorithms.

Give feedback
Satellite imagery – Multispectral

Current multispectral satellite data has insufficient spatial resolution to detect smaller methane leaks.

Give feedback
Scaling truck count inference from remote sensing data
Details (click to expand)

Truck counts are crucial for climate change mitigation because they help quantify freight activity, identify high-traffic corridors, and prioritize locations where electrification would have the greatest emissions impact. Typically, those numbers are obtained by vehicle counters installed by public or private entities on streets that distinguish between vehicle types.

As low- and middle-income countries have limited ground-based traffic monitoring and freight surveying activities, ML can be used to predict truck traffic from remote sensing imagery.

Truck count data suffers from limited coverage, especially in middle- and low-income countries, and often lacks sufficient granularity. Additionally, fragmented collection methods, inconsistent documentation, and limited data sharing hinder usability, while satellite imagery faces challenges related to cloud cover, resolution, and cost.

Data gaps for these use cases could be reduced by increased data sharing for research, such as counts and high-resolution raw satellite imagery, and expanding the number of vehicle counters to collect more data. 

DatasetData Gap Summary
Satellite imagery

Satellite imagery for monitoring rest area capacity and usage has the typical data gaps for use cases requiring high-resolution imagery (here, both high temporal and spatial resolution matter) that is cloud-free over several kilometers.

Give feedback
Truck count data

Truck count data suffers from limited coverage, especially in middle- and low-income countries, and often lacks sufficient granularity. Additionally, fragmented collection methods, inconsistent documentation, and limited data sharing hinder usability.

Give feedback
Understanding fleet overturning and international second-hand vehicle markets
Details (click to expand)

Fleet overturning can lower emissions through more efficient vehicles and is central to shifting to electric vehicles but often used vehicles are internationally. Understanding international second-hand vehicle markets is needed for estimating how fast fleets will overturn globally. For example, the European Union (EU) targets all new vehicles to be electric by 2035; this will lead to used combustion-engine cars and increasingly electric cars being sold outside the EU. The number of second-hand vehicles sold internationally and their types are only known for some countries.

ML could have the potential to infer more data if sufficient samples are provided, and also assist in the analysis of the data.

Data on second-hand vehicle trade and electric vehicle infrastructure is fragmented – limited by poor coverage, low volume, and missing details.

There is a need for efforts aiming to collect more data and aggregate them into comprehensive, transnational datasets that can support effective policy and global transition monitoring.

DatasetData Gap Summary
Electric vehicle infrastructure transnational data

While certain countries have good electric vehicle infrastructure data, there is a need to create transnational datasets, for example to analyze infrastructure readiness in case of increasing international trade of EVs.

Give feedback
Second-hand vehicle international trade data

Second-hand vehicle trade data is limited by poor country coverage, low volume based on few case studies, and missing key details like vehicle type, age, fuel type, and mileage, making it insufficient for understanding global trade patterns and technology shifts.

Give feedback
Understanding the impact of urban planning on travel emissions
Details (click to expand)

Creating sustainable, healthy, and equitable urban transportation is crucial because urban design influences and restricts population travel behaviors, limiting the potential for more sustainable mobility.

Travel behavior and its relation to the built environment in cities is a complex and multifaceted phenomenon that is challenging to comprehend and model using traditional statistical methods. ML aids in understanding these complexities, including threshold effects and nonlinearities.

A significant challenge is the scarcity of data in the Global South, where cities are rapidly expanding, and the impact of urban planning changes could be substantial. In contrast, changes in established cities like Berlin are more limited.

The level of digitalization of city administrations in both the Global North and South represents a bottleneck in data availability. Important changes in the city infrastructure are not digitalized, hindering impact analysis. Any released data is highly fragmented and not standardized. Furthermore, much of the key data on human mobility is commercial and remains practically inaccessible for scientific research. There is a need to open and harmonize this data, as the Overture Foundation has done with Points of Interest data, for example.

DatasetData Gap Summary
Building stock – satellite-derived

Datasets of the evolution of the building stock derived from different vintages of satellite imagery provide valuable information on cities expansions and on new constructions in general. Those include the World Settlement Footprint Evolution dataset or the Global Human Settlement Layer - AGE. These datasets, however, are provided as raster data, and the resolution is insufficient for analyzing micro urban planning interventions.

Give feedback
GPS travel trajectories and Origin-Destination data

Critical issues in using GPS and OD data include a lack of provenance details from commercial providers, causing analysis uncertainties. Accessibility is limited due to high costs and data silos from insufficient privacy-preserving sharing mechanisms. Essential trip details are often missing, and data from the Global South is less accessible despite global collection. Additionally, inadequate documentation undermines the data’s scientific value, as key information like sample representativeness is frequently absent, challenging accurate interpretation.

Give feedback
OpenStreetMap (land use map)

OpenStreetMap data for mobility infrastructure is often incomplete outside of main roads and biased towards high-income countries. The data’s reliability is uncertain due to its crowdsourced nature, requiring quality checks. Its permissive data model leads to inconsistencies, necessitating thorough pre-processing, and it often lacks proper documentation, despite the benefits of documenting data provenance.

Give feedback
Points of Interest

POI data coverage is uneven, often biased towards high-income countries, with even leading datasets like Google Maps facing gaps, especially in the Global South. Timeliness is an issue as datasets may not reflect current business statuses. Reliability suffers from locational inaccuracies, affecting data matching and analysis. Usability is hindered by varied categorizations and languages in assembled datasets, necessitating standardization.

Give feedback
Street infrastructure data – LiDAR-derived

Very few such datasets exist. One of the only examples is the Berlin road survey (Straßenbefahrung) 2014, available at https://fbinter.stadt-berlin.de/fb/gisbroker.do;jsessionid=680EFD768EDCC386FBDF72B1637E71D7?cmd=navigationShowResult&mid=K.k_StraDa%40senstadt

Give feedback
Street view imagery

Street view imagery generates massive data volumes, complicating usability, and access is restricted, often favoring larger cities and wealthier countries. Preprocessing for tasks like computer vision is intensive, and coverage can be incomplete or biased. Additionally, images may lack full 360-degree views or contextual details like weather conditions, impacting their treatment by computer vision algorithms.

Give feedback
Travel surveys

Travel surveys often overlook smaller cities and rural areas, lack sufficient local data points, and provide data at ZIP code levels, limiting detailed urban planning. Modern technologies like GPS and privacy-preserving methods could address these gaps.

Give feedback
Urban planning projects data

The development of urban planning projects datasets faces significant hurdles, including a lack of machine-readable formats or a typical focus on large projects in existing publicly available data. These issues are compounded by geographical biases, where data availability and detail vary based on regional digitalization levels.

Give feedback
Weather forecasting: Short-to-medium term (1-14 days)
Details (click to expand)

Weather forecasting at 1-14 days ahead has implications for real-time response and planning applications within both climate change mitigation and adaptation. ML can help improve short-to-medium-term weather forecasts.

DatasetData Gap Summary
ECMWF ENS (global 9-km 15-day ahead weather model ensemble)

The biggest challenge of ENS is that only a portion of it is available to the public for free.

Give feedback
ECMWF ERA5 Atmospheric Reanalysis

ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.

Give feedback
ECMWF HRES (global 9-km 10-day ahead weather model)

The biggest challenge of HRES is that only a portion of it is available to the public for free.

Give feedback
WeatherBench 2

Weather Bench 2 is based on ERA5, so the issues of ERA5 are also inherent here, that is, data has biases over regions where there are no observations.

Give feedback
Weather forecasting: Subseasonal horizon
Details (click to expand)

High-fidelity weather forecasts at subseasonal to seasonal scales (3-4 weeks ahead) are important for a variety of climate change-related applications. ML can be used to postprocess outputs from physics-based numerical forecast models in order to generate higher-fidelity forecasts.

DatasetData Gap Summary
ECMWF ERA5 Atmospheric Reanalysis

ERA5 is widely used due to its high resolution and global coverage, but faces significant accessibility and reliability challenges. Download times from the Copernicus Climate Data Store can take days to months due to high demand and data storage on tape systems. ERA5’s own biases and uncertainties, particularly in precipitation fields, limit its effectiveness as ground truth for ML bias correction. Enhanced download infrastructure and improved reanalysis methods incorporating ML-based data assimilation can address these limitations.

Give feedback
subX

More data is needed to develop a more accurate and robust ML model. It is also important to note that SUBX data contains biases and uncertainties, which can be inherited by ML models trained with this data.

Give feedback
Weather forecasting: Subseasonal-to-seasonal horizon
Details (click to expand)

High-fidelity weather forecasts at the subseasonal-to-seasonal (S2S) scale (i.e., 10-46 days ahead) are important for a variety of climate change-related applications. ML can be used to postprocess outputs from physics-based numerical forecast models in order to generate higher-fidelity forecasts.

DatasetData Gap Summary
CPC Precipitation (global unified daily precipitation)

CPC Global Unified gauge-based analysis of daily precipitation https://psl.noaa.gov/data/gridded/data.cpc.globalprecip.html

Give feedback
S2S forecast data

More data is needed to take advantage of the large ML models.

Give feedback
Dataset Gap Types Modalities Sectors
Camera trap wildlife image collections
Details (click to expand)

Camera traps are likely the most widely used sensors in automated biodiversity monitoring due to their low cost and simple installation. This medium offers close-range monitoring over long time scales. Images and image sequences can be used to not only classify species but to identify specifics about an individual, e.g. sex, age, health, behavior, and predator-prey interactions. Camera trap data has been used to estimate species occurrence, richness, distribution, and density. 

Use CaseData Gap Summary
Automating individual re-identification for wildlife

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Give feedback
Enabling 2D to 3D shape recovery and pose estimation of animals

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity studies. Addressing this gap requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Give feedback
Facilitating forest restoration monitoring

There is a lack of standardized protocol to guide data collection for restoration projects, including which variables to measure and how to measure them. Developing such protocols would standardize data collection practices, enabling consistent assessment and comparison of restoration successes and failures on a global scale.

Give feedback
Facilitating the detection of climate-induced ecosystem changes

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Give feedback
Improving terrestrial wildlife detection and species classification

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Give feedback
Academic literature databases
Details (click to expand)

Academic literature databases, such as Openalex, Web of Science, Scopus.

Use CaseData Gap Summary
Active fire data – satellite-derived
Details (click to expand)

Active fire data derived from images taken by satellites such as MODIS, VIIRS, and LANDSAT at different spatial resolutions and temporal frequencies. These datasets provide near real-time detection of active fires globally and can be downloaded fromhttps://firms.modaps.eosdis.nasa.gov/active_fire.

Use CaseData Gap Summary
Advanced metering infrastructure data
Details (click to expand)

Advanced Metering Infrastructure (AMI) facilitates communication between utilities and customers through smart meter device systems that collect, store, and analyze per building energy consumption.

AMI data can be retrieved through public data portals, individual data collection, or research partnerships with local utilities. Some examples of utility research partnerships include the Irvine Smart Grid Demonstration (ISGD) project conducted by Southern California Edison (SCE) and the smart meter pilot test from the Sacramento Municipal Utility. An example of publicly available data that is aggregated and anonymized is the Commission for Energy Regulation (CER) Smart Metering Project hosted by the Irish Social Science Data Archive (ISSDA).

View dataset

Use CaseData Gap Summary
Improving short-term electricity load forecasting

AMI data is challenging to obtain without pilot study partnerships with utilities since data collection on individual building consumer behavior can infringe upon customer privacy, especially at the residential level. Granularity of time series data can also vary depending on the level of access to the data, whether it be aggregated and anonymized or based on the resolution of readings and system. Additionally, the coverage of data will be limited to utility pilot test service areas, thereby restricting the scope and scale of demand studies.

Give feedback
Aerial power line corridor inspection data
Details (click to expand)

LiDAR and image data collected from unmanned aerial vehicles (UAVs) for power line right-of-way (RoW) inspection can be accessed from private providers such as LUMA Energy and COR3, as well as sources like China Southern Power Grid with dastasets from Yunnan RoW-1, Yunnan RoW-2, and Hubei RoW 4. Open source EPRI distribution inspection imagery is also available and labeled with information regarding conductors, poles, crossarms, insulators, and other infrastructure components. These datasets pair images with geolocated GIS data to identify priority vegetation management areas near transmission lines.

Use CaseData Gap Summary
Enhancing power grid-vegetation management for wildfire risk mitigation

UAV imagery for vegetation management near power lines requires partnerships with private companies and utilities for access. LiDAR data is often sparse with partial line scans resulting in poor data quality. Coverage is typically limited to specific rights-of-way, requiring continuous monitoring to track vegetation growth over time.

Give feedback
Automated surface observation system (ASOS)
Details (click to expand)

This dataset contains one- and five-minute observations from automated surface observation system stations in the US. The ASOS network provides near real-time surface weather measurements including wind speed and direction, dew point, air temperature, station pressure, precipitation, visibility, and cloud characteristics. See https://madis.ncep.noaa.gov/madis_OMO.shtml

Use CaseData Gap Summary
Accelerating and improving weather forecasting: Near-term (< 24 hours)

Data volume is large and only data specific to the US is available.

Give feedback
Benchmark datasets for building energy modeling
Details (click to expand)

Building energy modeling datasets provide measurements of energy demand profiles for a sample of buildings, as well as relevant input variables for traditional and ML-based models, enabling us to benchmark the performance of different models for energy prediction tasks. For example, the US Office of Energy Efficiency and Renewable Energy hosts 15 building datasets for 10 states covering 7 climate zones and 11 different building types (https://bbd.labworks.org/dataset-search). The data covers energy, indoor air quality, occupancy, environment, HVAC, lighting, and energy consumption to name a few. Datasets are organized by name and points of contact.

All data featured on the platform is open access, with standardization on metadata format to allow for ease of use and information specific to buildings based on type, location, and climate zone. Data quality and guidance on curation and cleaning, in addition to access restrictions, are specified in the metadata of each hosted dataset. Licensing information for each individual featured dataset is provided.

View dataset

Use CaseData Gap Summary
Accelerating building energy models

Benchmark datasets for building energy modeling are few, are mostly available in the US, and cover a limited range of building types. The variables provided in such datasets are not always precise and comprehensive enough to test models adequately.

Give feedback
Bicycle count data – permanent sensing
Details (click to expand)

Permanent bicycle sensing involves installing fixed sensors at key locations, such as bike lanes or intersections, to continuously record the number of cyclists passing by. These sensors, often using technologies like inductive loops, infrared beams, or pneumatic tubes, collect data at high temporal resolution (15 min intervals) over time, allowing cities to monitor usage patterns, track trends, and evaluate the impact of infrastructure or policy changes.

Use CaseData Gap Summary
Interpolating city-wide bicycle volumes from limited count data

Several data gaps limit the effectiveness of bicycle count data from permanent sensing. Events like construction can compromise reliability, while obtainability is hindered by limited accessibility outside major cities and the lack of centralized platforms for finding and aggregating data. Additionally, coverage is often insufficient to provide a clear picture across the city, with data collected at only a few locations within a city.

Give feedback
Bicycle count data – temporary sensing
Details (click to expand)

Temporary bicycle sensing involves deploying portable counters at selected locations for a limited period, typically ranging from a few days to a few weeks. Using technologies like pneumatic tubes or infrared sensors, these short-term counts provide snapshots of cycling activity, often used to supplement permanent data, capture seasonal variations, or assess areas without permanent infrastructure, or inform planning decisions when infrastructure changes are being considered.

Use CaseData Gap Summary
Interpolating city-wide bicycle volumes from limited count data

Bicycle counts from temporary sensing face several obtainability and sufficiency challenges. Accessibility is often limited outside major cities due to a lack of capacity to publish data, and users must contact individual cities to access it, as no central platforms exist. The data itself is often insufficient (collected at only a few locations and for short periods), limiting its usefulness. Additionally, limited documentation about why a count was conducted at a specific time and location reduces the usability of the data by omitting important contextual information.

Give feedback
Bike infrastructure data – from city administrations
Details (click to expand)

Bike infrastructure data from city administrations is usually generated through planning and transportation departments using GIS tools, engineering surveys, and infrastructure project records. It reflects officially planned and built cycling facilities, including bike lanes, shared paths, and intersections. The quality of this data is higher than crowdsourced alternatives, with a tolerance down to a few centimeters.

Use CaseData Gap Summary
Interpolating city-wide bicycle volumes from limited count data

City-provided bike infrastructure data is often hard to find, with no central platforms and limited sharing beyond major cities. Even when available, it may lack clear documentation on what has been implemented or omit detailed features of the infrastructure.

Give feedback
Biodiversity images and recordings – community science data
Details (click to expand)

Images and recordings contributed by volunteers represent another significant source of data on biodiversity and ecosystems. Crowdsourcing platforms, such as iNaturalist, eBird, Zooniverse, and Wildbook, facilitate the sharing of community science data. Many of these platforms also serve as hubs for collating and annotating datasets.

Use CaseData Gap Summary
Improving terrestrial wildlife detection and species classification

One challenge with community science data is biases in geographic and taxonomic representativity: while community science data can provide broader coverage than formal survey data, but is highly biased and the biases are not documented.. Data tends to be concentrated in accessible areas and often focuses on charismatic or commonly encountered species. This limits the generalizability of ML models that can be built from training on this data.

Give feedback
Building data genome project (hourly building-level metered data)
Details (click to expand)

The Building Data Genome Project 2 dataset contains hourly building-level data from 3,053 energy meters from 1,636 non-residential buildings covering two years worth of metered data with respect to electricity, water, and solar in addition to logistical metadata with respect to area, primary building use category, floor area, time zone, weather, and smart meter type. The goal of the dataset to allow for the development of generalizable building models for energy efficiency analysis studies. The building data genome project 2 compiles building data from public open datasets along with privately curated building data specific to university and higher education institutions.

View dataset

Use CaseData Gap Summary
Improving short-term electricity load forecasting

While the dataset has clear documentation, the lack of diversity in building types necessitates further development of the dataset to include other types of buildings, as well as expanding coverage to areas and times beyond those currently available.

Give feedback
Building energy performance certificates
Details (click to expand)

Energy Performance Certificates (EPCs) are official documents that rate the energy efficiency of a building on a scale from A (most efficient) to G (least efficient). They provide information on the property’s current energy use, typical energy costs, and recommendations for improving efficiency. EPCs are required when a property is built, sold, or rented, and help buyers or tenants understand potential energy expenses and environmental impact. While EPCs are specific to Europe, similar energy efficiency rating systems exist in countries like the United States, Australia, and Canada under different names and regulations.

Use CaseData Gap Summary
Enhancing the scalability and robustness of building stock assessments

Energy Performance Certificate (EPC) datasets face major gaps in aggregation, provenance, documentation, missing components, structure, and timeliness. Differences in formats and methodologies across countries, limited metadata, outdated records, and missing key attributes can affect their usability.

Give feedback
Building stock – from cadaster and aerial imagery
Details (click to expand)

Building stock maps enable a geolocalized understanding of where and which kind of buildings stand, relevant both to climate change mitigation and adaptation. Building stock data from cadasters and aerial imagery provide the most precise available data. In addition to precise building footprints, the 3D geometry of walls and roofs may be available thanks to LiDAR aerial surveys. Further high-quality information from the cadaster may be available as attributes, such as the current usage or the construction year of the building.

Use CaseData Gap Summary
Assessing rooftop solar photovoltaic potential

This use case requires 3D models of buildings that include roof geometries (surfaces and angles), which only few datasets, mostly in Europe, provide currently.

Give feedback
Enhancing the scalability and robustness of building stock assessments

Datasets tend to face gaps in obtainability, reliability, usability, and sufficiency. These include challenges in finding and interpreting data due to inconsistent naming, poor documentation, variable quality, limited geographic and temporal coverage, and inconsistent data models requiring manual aggregation.

Give feedback
Facilitating disaster risk assessments

These datasets are mainly available in rich countries from Europe, North America, and Asia, leaving large parts of the world with timely challenges involving their building stock without appropriate data for detailed assessments. The lack of attributes beyond the location and shape of the buildings also limits the breadth of applications. ML may help by inferring missing attributes, given the availability of sufficient training data. Research efforts in particular in Europe, including EUBUCCO (eubucco.com) or the Digital Building Stock Model by the Joint Research Centre of the European Commission (https://data.jrc.ec.europa.eu/collection/id-00382), are addressing several of the existing data gaps.

Give feedback
Building stock – satellite-derived
Details (click to expand)

Building stock maps enable a geolocalized understanding of where and which kind of buildings stand, relevant both to climate change mitigation and adaptation. Satellite-derived datasets, which often use ML for processing satellite imagery, can provide such maps on a global scale. Coarser-resolution maps come as raster data at resolutions varying from 10 to more than 100 m, while the maps with the highest resolution provide details on building footprint geometries as vector data. Some of these datasets may have a temporal resolution and some inferred attributes describing the building characteristics.

Use CaseData Gap Summary
Enhancing the scalability and robustness of building stock assessments

Building datasets generated through large-scale ML extraction, such as Microsoft ML buildings, face reliability and sufficiency issues due to limited validation, positional inaccuracies, and inferred heights with low accuracy. Usability is also hindered by missing documentation on methodologies and input imagery, while datasets with coarse raster resolution or missing key attributes like usage type or age reduce the data’s applicability for detailed energy analyses.

Give feedback
Facilitating disaster risk assessments

These datasets are typically released at a scale that makes their validation complex and partial, implying potentially large uncertainties in the data. The lack of attributes beyond the location and shape of the buildings also limits the breadth of applications. ML may help by inferring missing attributes, given the availability of sufficient training data.

Give feedback
Understanding the impact of urban planning on travel emissions

Datasets of the evolution of the building stock derived from different vintages of satellite imagery provide valuable information on cities expansions and on new constructions in general. Those include the World Settlement Footprint Evolution dataset or the Global Human Settlement Layer - AGE. These datasets, however, are provided as raster data, and the resolution is insufficient for analyzing micro urban planning interventions.

Give feedback
CMIP6 (earth system model intercomparison data)
Details (click to expand)

CMIP6 (Coupled Model Intercomparison Project Phase 6) provides climate simulations from a consortium of state-of-the-art global climate models, covering historical periods and future scenarios through 2100. The dataset includes multiple climate variables at various spatial and temporal resolutions from modeling centers worldwide. Data can be found here https://pcmdi.llnl.gov/CMIP6/.

Use CaseData Gap Summary
Accelerating data-driven generation of climate simulations

The dataset faces three key challenges: its large volume makes access and processing difficult with standard computational infrastructure; lack of uniform structure across models complicates multi-model analysis; and inherent biases and uncertainties in the simulations affect reliability.

Give feedback
Enhancing bias-correction of climate projections

Large uncertainties in future climate projections limit confidence in bias-correction applications. The massive data volume and inconsistent formats across models—including variable naming conventions, resolutions, and file structures—hinder effective multi-model analysis. Improved model evaluation frameworks and data standardization efforts can enhance projection reliability and streamline ML model development.

Give feedback
CPC Precipitation (global unified daily precipitation)
Details (click to expand)

CPC Global Unified gauge-based analysis of daily precipitation https://psl.noaa.gov/data/gridded/data.cpc.globalprecip.html

Use CaseData Gap Summary
Weather forecasting: Subseasonal-to-seasonal horizon

High-fidelity weather forecasts at the subseasonal-to-seasonal (S2S) scale (i.e., 10-46 days ahead) are important for a variety of climate change-related applications. ML can be used to postprocess outputs from physics-based numerical forecast models in order to generate higher-fidelity forecasts.

Give feedback
City-level transportation mode share data
Details (click to expand)

City-level modal share data represents the distribution of transportation modes—such as walking, cycling, public transit, and driving—used by residents within a city. This data is typically gathered through travel surveys, censuses, or mobility studies and provides insights into how people commute and travel locally.

Use CaseData Gap Summary
Enabling inference of city-level transportation mode shares

There are issues with the quality of the data and consistent time series: there are no datasets with data for multiple cities that were produced with the same methodology, that are directly usable and highly trustworthy for scientific research.

Give feedback
ClimSim (benchmark data for hybrid ML-physics research)
Details (click to expand)

ClimSim is an ML-ready benchmark dataset designed for hybrid ML-physics research, for example, for emulating subgrid clouds and convection processes in climate models.

Use CaseData Gap Summary
Hybrid ML-physics climate models for enhanced simulations

ClimSim faces challenges with its large data volume, making downloading and processing difficult for most ML practitioners, and its resolution is insufficient to resolve some fine-scale physical processes critical for accurate climate modeling.

Give feedback
ClimateBench v1.0 (benchmark dataset for earth system models)
Details (click to expand)

ClimateBench v1.0 is a benchmark dataset derived from the NorESM2 Earth System Model (a participant in CMIP6) designed specifically for evaluating machine learning methods that emulate key climate variables. The dataset is publicly available at https://zenodo.org/records/7064308

Use CaseData Gap Summary
Accelerating data-driven generation of climate simulations

The dataset currently includes simulations from only one Earth system model, limiting the diversity of training data and potentially affecting the robustness and generalizability of ML emulators trained on it.

Give feedback
ClimateSet (ML-ready earth system model inputs/outputs)
Details (click to expand)

 ClimateSet is an ML-ready benchmark dataset compiled from inputs and outputs of the Input4MIPS and CMIP6 archives, structured for various machine learning tasks including climate model emulation, downscaling, and prediction. More information is available at https://arxiv.org/pdf/2311.03721.pdf

Use CaseData Gap Summary
Accelerating data-driven generation of climate simulations

No significant data gap identified yet.

Give feedback
Computational fluid dynamics simulation for building energy models
Details (click to expand)

Computational fluid dynamics (CFD) simulation output from building energy models is a means of precisely assessing thermal (e.g. insulation of the walls) and ventilation (e.g. natural ventilation or HVAC) properties of a building. Given the building geometry, terrain, presence of neighboring buildings, and boundary conditions Navier-Stokes equations are typically solved. Datasets including precise building inputs and outputs from CFD would help build ML surrogate models. Surrogate models, such as GANs or physics constrained deep neural network architectures have been shown to provide promising results though further research with respect to turbulence representation needs to be taken into account.

Use CaseData Gap Summary
Accelerating building energy models

Despite its usefulness in ventilation studies for new construction, CFD simulations are computationally expensive making it difficult to include in the early phase of the design process where building morphosis can be optimized to reduce future operational consumption associated with building lighting, heating, and cooling. Simulations require accurate input information with respect to material properties that may not be present in traditional urban building types. Output of models  require the integration of domain knowledge to interpret results from large volumes of synthetic data for different wind directions becoming challenging to manage. Future data collection with respect to simulation output verification can benefit surrogate or proxy approaches to computationally expensive Navier-Stokes equations, and coverage is often restricted to modern building approaches, leaving out passive building techniques known as vernacular architecture from indigenous communities from being taken into design consideration.

Give feedback
DOE Atmospheric Radiation Measurement research facility data products
Details (click to expand)

The DOE Atmospheric Radiation Measurement (ARM) dataset comprises ground-based measurements from various field programs sponsored by the US Department of Energy, including sun-tracking photometers, radiometers, and spectrometer data useful for solar radiation time series forecasting and solar potential assessment.

Use CaseData Gap Summary
Improving solar power forecasting: nowcasting/very-short-term (0-30min)

ARM data presents challenges with data volume management, measurement verification (especially for aerosol composition), limited spatial coverage (ARM sites only), and sensor calibration issues. Solutions include AI-based data compression, enhanced aerosol composition measurements, collaboration with partner networks to expand coverage, and automated quality control.

Give feedback
DYAMOND (global atmospheric circulation model intercomparison data)
Details (click to expand)

DYAMOND (DYnamics of the Atmospheric general circulation Modeled On Non-hydrostatic Domains) is an intercomparison of global storm-resolving model simulations at 5 km resolution or less, used as targets for climate model emulators.

Use CaseData Gap Summary
Hybrid ML-physics climate models for enhanced simulations

DYAMOND faces similar challenges to ClimSim: its large volume creates processing difficulties, and its resolution, while high, remains insufficient for resolving fine-scale atmospheric processes needed for accurate climate modeling.

Give feedback
Digital elevation model
Details (click to expand)

Surface elevation data, often called digital elevation model or terrain surface model, provide a 3D representation of the bare surface of the Earth. These topographic inputs are important for disaster risk assessments and modeling to assess risks due to floods, sea level rise, or landslides, where the elevation of a given location determines whether it is at risk. These digital models are typically estimated from remote sensing data, for example, the Shuttle Radar Topography Mission. They are often provided as raster but may also be provided as points (vector).

Use CaseData Gap Summary
Facilitating disaster risk assessments

Very high-resolution reference data is currently not freely open to the public.

Give feedback
Direct measurement of methane emission of rice paddies
Details (click to expand)

With sampling systems placed in rice paddies, methane concentrations can be directly measured in the air above the fields or in the soil. 

Use CaseData Gap Summary
Enhancing estimations of methane emissions from rice paddies

There is a lack of direct observation of methane emissions from rice paddies.

Give feedback
Distribution system simulators
Details (click to expand)

Distribution system simulators such as OpenDSS and GridLab-D enable analysis of hosting capacity for distribution-level substation feeders by simulating how various factors affect grid stability and reliability. These open-source tools allow researchers to model voltage limits, thermal capabilities, control parameters, and fault currents under different scenarios, providing insights into how distribution grids can safely accommodate distributed energy resources like solar panels. These simulators serve as critical alternatives when real circuit feeder data from utilities is unavailable.

Use CaseData Gap Summary
Accelerating distribution-side hosting capacity estimations

While OpenDSS and GridLab-D provide valuable simulation capabilities, their utility is limited by challenges in obtaining verification data from real distribution circuits, aggregating necessary input data from multiple sources, and navigating usage rights for proprietary utility data. Closing these gaps through improved utility-researcher partnerships and data sharing protocols would significantly enhance the accuracy of hosting capacity assessments, enabling greater renewable energy integration in distribution networks.

Give feedback
Drone imagery
Details (click to expand)

Drone imagery provides high-resolution, close-range visual data for species identification, individual tracking, and environmental reconstruction. These images offer detailed insights into habitats and sometimes direct observation of species populations (e.g. trees and large mammals), similar to camera traps but with greater flexibility in coverage. Currently, most drone imagery data is scattered across disparate sources, though some collections are hosted on platforms like www.lila.science.

Use CaseData Gap Summary
Automating individual re-identification for wildlife

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Give feedback
Enhancing digital reconstructions of the environment

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Give feedback
Facilitating forest restoration monitoring

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Give feedback
Improving terrestrial wildlife detection and species classification

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Give feedback
ECMWF ENS (global 9-km 15-day ahead weather model ensemble)
Details (click to expand)

Ensemble forecast system providing 50 probabilistic forecasts up to 15 days ahead at 9-km resolution, generated twice daily by ECMWF’s numerical weather prediction model. This ensemble approach quantifies forecast uncertainty and serves as a benchmark for evaluating ML-based probabilistic weather forecasting. Data can be found here.

Use CaseData Gap Summary
Enhancing bias-correction of weather forecasts

Same as HRES, the biggest challenge of ENS is that only a portion of it is available to the public for free.

Give feedback
Weather forecasting: Short-to-medium term (1-14 days)

The biggest challenge of ENS is that only a portion of it is available to the public for free.

Give feedback
ECMWF ERA5 Atmospheric Reanalysis
Details (click to expand)

ERA5 is a comprehensive atmospheric reanalysis dataset covering 1940 to present that integrates in-situ and remote sensing observations from weather stations, satellites, and radar into a global, hourly gridded product at 31 km resolution. The dataset is continuously updated and available for download through the Copernicus Climate Data Store.

View dataset

Use CaseData Gap Summary
Enhancing bias-correction of climate projections

ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.

Give feedback
Hybrid ML-physics climate models for enhanced simulations

While ERA5 is widely used due to its good structure and global coverage, users face significant challenges with downloading times that can take days to months, and the sheer data volume presents processing difficulties for many users. 

Give feedback
Weather forecasting: Subseasonal horizon

ERA5 is widely used due to its high resolution and global coverage, but faces significant accessibility and reliability challenges. Download times from the Copernicus Climate Data Store can take days to months due to high demand and data storage on tape systems. ERA5’s own biases and uncertainties, particularly in precipitation fields, limit its effectiveness as ground truth for ML bias correction. Enhanced download infrastructure and improved reanalysis methods incorporating ML-based data assimilation can address these limitations.

Give feedback
ECMWF HRES (global 9-km 10-day ahead weather model)
Details (click to expand)

Single high-resolution deterministic forecast up to 10 days ahead generated by ECMWF’s Integrated Forecasting System (IFS), providing global weather predictions at 9-km resolution updated twice daily. This dataset serves as a benchmark for evaluating ML-based weather forecasting approaches. Data can be found here.

Use CaseData Gap Summary
Enhancing bias-correction of weather forecasts

Limited public access to real-time high-resolution forecasts and computational challenges from large data volumes restrict ML model development and validation for operational weather bias correction.

Give feedback
Weather forecasting: Short-to-medium term (1-14 days)

The biggest challenge of HRES is that only a portion of it is available to the public for free.

Give feedback
EPRI10 (transmission control center alarm and operational data set)
Details (click to expand)

Supervisory Control and Data Acquisition (SCADA) systems collect data from sensors throughout the power grid. Alarm operational data, a portion of the data received by SCADA, provides discrete event-based information on the status of protection and monitoring devices in a tabular format, which includes semi-structured text descriptions of individual alarm events. 

Often, the data is formatted based on timestamp (in milliseconds), station, signal identification information, location, description, and action. Encoded within the identification information is the alarm message.

View dataset

Use CaseData Gap Summary
Facilitating grid reliability events analysis

Access to EPRI10 grid alarm data is currently limited within EPRI. Data gaps with respect to usability are the result of redundancies in grid alarm codes, requiring significant preprocessing and analysis of code ids, alarm priority, location, and timestamps. Alarm codes can vary based on sensor, asset and line. Resulting actions as a consequence of alarm trigger events need field verifications to assess the presence of fault or non-fault events.

Give feedback
EUROSTAT city-level socio-economic data
Details (click to expand)

In order to increase the availability and quality of data at a more disaggregated level, Eurostat has promoted and coordinated the efforts of national statistical offices in delivering harmonised city statistics, and disseminates the data on its website. 

The data offers statistics on a wide range of indicators, including population demographics, employment rates, education levels, income, and housing conditions across European cities. This data is collected and standardized with the goal of providing consistent comparisons and analysis of urban development, economic performance, and well-being.

Use CaseData Gap Summary
Enabling inference of city-level transportation mode shares

EUROSTAT city-level socio-economic data faces challenges with inconsistent time series due to changing boundaries, incomplete validation and aggregation issues, and missing values that limit its reliability and usability.

Give feedback
Electric vehicle charge station data
Details (click to expand)

Electric vehicle charging station datasets typically include location, charger specifications, energy delivery amounts, charge duration, costs, and usage patterns for both AC slow charging (depot-based) and DC fast charging (en-route) stations, though specific datasets vary by provider and region.

Use CaseData Gap Summary
Optimizing electrified bus fleet in urban vehicle-to-grid systems

Critical gaps include limited findability of station-specific usage data due to proprietary restrictions and scattered data sources requiring aggregation from multiple providers. Manufacturer partnerships and utility collaboration can improve data access, while standardized reporting frameworks can consolidate fragmented datasets to enable comprehensive fleet optimization

Give feedback
Electric vehicle infrastructure transnational data
Details (click to expand)

Electric vehicle (EV) infrastructure data includes information shared across countries on charging station availability, grid capacity, energy sources, interoperability of charging systems, deployment rates, and usage patterns. Transnational datasets would help track how EV support systems are developing globally and understand if infrastructures across countries can sustain the likely increased number of cars coming from new and second-hand international markets.

Use CaseData Gap Summary
Understanding fleet overturning and international second-hand vehicle markets

While certain countries have good electric vehicle infrastructure data, there is a need to create transnational datasets, for example to analyze infrastructure readiness in case of increasing international trade of EVs.

Give feedback
Emission dataset compiled from FAO statistics
Details (click to expand)

Dataset Introduction: This dataset comprises agricultural emissions data compiled from Food and Agriculture Organization (FAO) statistics and spatially extrapolated to provide geospatial coverage. It includes estimates of greenhouse gas emissions related to agricultural practices across different regions worldwide and is periodically updated as new FAO statistics become available.

Use CaseData Gap Summary
Modeling effects of soil processes on soil organic carbon

Data is extrapolated from statistics on a national level. It is unknown how accurate this data is when focusing on local information

Give feedback
Environmental DNA (eDNA)
Details (click to expand)

Environmental DNA (eDNA) datasets consist of genetic material obtained from environmental samples, like soil and water, after being shed by living or dead organisms. By analyzing this genetic material, researchers can detect and monitor species present in a non-invasive and efficient manner, aiding biodiversity studies, conservation efforts, and environmental monitoring. Some eDNA data can be found via GBIF (the Global Biodiversity Information Facility). 

Use CaseData Gap Summary
Automating individual re-identification for wildlife

A significant challenge for eDNA-based monitoring is the incomplete barcoding reference databases, limiting the ability to accurately identify species from genetic material. Initiatives like the BIOSCAN project are actively working to address this gap by expanding reference collections for diverse taxonomic groups, particularly for understudied regions and species.

Give feedback
Enhancing digital reconstructions of the environment

 One gap in data is the incomplete barcoding reference databases.

Give feedback
Improving terrestrial wildlife detection and species classification

eDNA is an emerging new technique in biodiversity monitoring. There are still many issues impeding the application of eDNA-based tools. One important gap in data is the incompleteness of barcoding reference databases.

Give feedback
Equivalent circuit models
Details (click to expand)

Equivalent circuit models are simplified representations of batteries represented by networks of resistors and capacitors to model battery behavior due to electrochemical reactions. Due to their ease of use, they can integrate easily into battery management control systems and customized to model a variety of battery chemistries and conditions. Different types of equivalent circuit models include the Rint model, hysteresis models, Randles models, and Thevenin models. These models differ in complexity with respect to the extent with which battery behavior is captured. For example, the simplest model, the Rint model, is static while other models vary in their representation of dynamic properties such as state of charge and battery lifetime.

Use CaseData Gap Summary
Improving battery management systems

While ECMs enable real-time battery SoC predictions due to their computational efficiency, they often oversimplify real-life operating conditions which limits the accuracy of SoH and RUL estimates. Additionally, verification with data from physical battery systems is required to validate simulated outcomes and improve prediction reliability across diverse operational environments.

Give feedback
Faraday (Synthetic smart meter data)
Details (click to expand)

Due to consumer privacy protections, advanced metering infrastructure (AMI) data is unavailable for realistic demand response studies. In an effort to open smart meter data, the Octopus Energy’s Centre for Net Zero has generated a synthetic dataset conditioned on the presence of low carbon technologies, energy efficiency, and property type from a model trained on 300 million actual smart meter readings from a United Kingdom (UK) energy supplier. Faraday is currently accessible through the Centre for Net Zero’s API.

View dataset

Use CaseData Gap Summary
Improving short-term electricity load forecasting

Despite the model being trained on actual AMI data, synthetic data generation requires a priori knowledge with respect to building type, efficiency, and presence of low-carbon technologies. Furthermore, since the dataset is trained on UK building data, AMI time series generated may not be an accurate representation of load demand in regions outside the UK. Finally, since data is synthetically generated, studies will require validation and verification using real data or aggregated per substation level data to assess its effectiveness.

Give feedback
FathomNet (marine wildlife annotated imagery)
Details (click to expand)

FathomNet is an open-source image database that standardizes and aggregates expertly curated labeled data. The data can be used to train, test, and validate  ML algorithms to help us understand our ocean and its inhabitants.

Use CaseData Gap Summary
Enhancing marine wildlife detection and species classification

The biggest challenge for the developer of FathomNet is enticing different institutions to contribute their data.

Give feedback
GPS travel trajectories and Origin-Destination data
Details (click to expand)

GPS travel routes refer to the detailed pathways captured by Global Positioning System (GPS) technology, which track the movement of individuals or vehicles from one location to another, providing precise data on the paths taken, speeds, and stops made during a journey. Origin-Destination (OD) data, on the other hand, specifically records the starting points (origins) and ending points (destinations) of trips, offering insights into travel patterns and demand without necessarily detailing the route taken. Together, these data types are essential for transportation planning, traffic management, and understanding mobility behaviors.

Use CaseData Gap Summary
Understanding the impact of urban planning on travel emissions

Critical issues in using GPS and OD data include a lack of provenance details from commercial providers, causing analysis uncertainties. Accessibility is limited due to high costs and data silos from insufficient privacy-preserving sharing mechanisms. Essential trip details are often missing, and data from the Global South is less accessible despite global collection. Additionally, inadequate documentation undermines the data’s scientific value, as key information like sample representativeness is frequently absent, challenging accurate interpretation.

Give feedback
Grid2Op and PandaPower (power systems simulation outputs))
Details (click to expand)

Grid2Op is a power systems simulation framework to perform reinforcement learning for electricity network operation that focuses on the use of topology to control the flows of the grid. 

Grid2Op allows users to control voltages by manipulating shunts or changing setpoint values of generators, influence active generation by use of redispatching, and manipulate storage units such as batteries or pumped storage to produce or absorb energy from the grid when needed. The grid is represented as a graph with nodes being buses and edges corresponding to power lines and transformers. Grid2Op has several available environments with different network topologies as well as variables that can be monitored as observations. The environment is designed for reinforcement learning agents to act upon with a variety of actions some of which are binary or continuous. This includes changes in topology such as changing bus, changing line status, setting storage, curtailment, redispatching, setting bus values, and setting line status. Multiple reward functions are also available in the platform for experimentation with different agents. It is important to note that Grid2Op has no internal modeling of equations of the grids or what kind of solver is necessary to adopt. Data on how the power grid is evolving is represented by the “Chronics.” The solver that computes the state of the grid is represented by the “Backend” which utilizes PandaPower to compute power flows.

Use CaseData Gap Summary
Improving power grid optimization

Grid2Op faces several data gaps related to usability, reliability, and coverage. Key issues include poor documentation, limited customization options (especially for reward functions and cascading failure scenarios), and a lack of support for multi-agent setups. The framework also lacks realistic system dynamics, fine time resolution, and flexible backend modeling, making it challenging to use for advanced research or real-world grid simulations without significant modification. These gaps can hinder the framework’s ability to accurately train reinforcement learning agents and simulate real-world power grid behavior.

Give feedback
Ground survey of land use and land management
Details (click to expand)

 Ground surveys collect direct field observations on land use practices and management approaches, providing critical ground-truth data that complements remote sensing. This information is essential for understanding human impacts on ecosystems and validating satellite-derived land cover classifications.

Use CaseData Gap Summary
Facilitating the detection of climate-induced ecosystem changes

Access to comprehensive ground survey data is restricted due to institutional barriers and privacy concerns, limiting its availability for ecosystem change analysis.

Give feedback
Ground-Based Weather Station Observations
Details (click to expand)

Ground-based weather station data provides point measurements of atmospheric variables including temperature, precipitation, and humidity from meteorological networks worldwide. These observations serve as ground truth for validating and bias-correcting climate model outputs, though spatial coverage varies significantly by region and is particularly sparse in developing countries.

Use CaseData Gap Summary
Enhancing bias-correction of climate projections

Irregular spatial distribution and point-based measurements require extensive preprocessing to create gridded datasets suitable for ML applications. Limited station density in many regions, especially over oceans and remote areas, constrains bias-correction accuracy. Enhanced observation networks and improved interpolation techniques can provide more comprehensive spatial coverage for model validation.

Give feedback
Enhancing bias-correction of weather forecasts

Sparse spatial coverage, restricted data access in many regions, and the need for gridding point measurements limit the effectiveness of station observations for training and validating ML bias-correction models.

Give feedback
Ground-survey based forest inventory data
Details (click to expand)

Forest information collected directly from forested areas through on-the-ground observations and measurements serves as ground truth for training and validating estimates. This data is crucial for accurate assessments, such as estimating forest canopy height using machine learning models. https://research.fs.usda.gov/programs/fia#data-and-tools

Use CaseData Gap Summary
Improving estimations of forest carbon stock

Manual collection results in data quality issues and limited spatial coverage, requiring improved collection protocols and integration with remote sensing to expand usability.

Give feedback
High-Resolution Rapid Refresh (HRRR) weather forecast
Details (click to expand)

The High-Resolution Rapid Refresh (HRRR) dataset contains near-term weather forecasts produced at 3-km resolution with hourly updates. It is a cloud-resolving, convection-allowing atmospheric model that assimilates radar data every 15 minutes over a 1-hour period. See https://rapidrefresh.noaa.gov/hrrr/

Use CaseData Gap Summary
Accelerating and improving weather forecasting: Near-term (< 24 hours)

Data volume is large, and only data covering the US is available.

Give feedback
Historical climate observations
Details (click to expand)

Historical climate observations encompass both global reanalysis datasets like ERA5, which provide comprehensive atmospheric data at coarse resolution worldwide, and local weather station records that offer higher temporal and spatial granularity for specific regions. These datasets typically cover multiple decades and include variables such as temperature, precipitation, humidity, and wind patterns.

Use CaseData Gap Summary
Facilitating the detection of climate-induced ecosystem changes

For highly biodiverse regions, like tropical rainforests, there is a lack of high-resolution data that captures the spatial heterogeneity of climate, hydrology, soil, and the relationship with biodiversity.

Give feedback
Improving assessments of climate impacts on public health

Climate data accessibility and integration challenges limit ML applications in climate-health research. Data exists in diverse formats that require significant preprocessing, and researchers without climate expertise struggle to identify appropriate datasets for their specific health applications.

Give feedback
Interpolating city-wide bicycle volumes from limited count data

It is worth noting that there are no major data gaps for this use case for cities where the other necessary data sources are available.

Give feedback
JRC PVGIS (solar radiation data)
Details (click to expand)

PVGIS (Photovoltaic Geographical Information System), developed by the European Commission’s Joint Research Centre (JRC), is a comprehensive online tool designed to assess the solar energy potential of any location. It combines satellite-derived solar radiation data, weather information, and models that estimate photovoltaic system performance, including expected energy output and losses based on local climate and system parameters.

Use CaseData Gap Summary
Assessing rooftop solar photovoltaic potential

This dataset does not have major data gaps for this use case, but there are some approximations and other errors in the data to be considered.

Give feedback
Lab measurements of material property and carbon absorption
Details (click to expand)

Lab measurements of material properties (such as chemical composition and physical properties) and their performance on carbon absorption (such as absorption capacity).

Use CaseData Gap Summary
Accelerating the design of new carbon-absorbing materials

The major challenge is that data is not shared with the public.

Give feedback
Large-eddy simulations (atmospheric processes)
Details (click to expand)

Large-eddy simulations are very high-resolution atmospheric simulations (finer than 150 m) where atmospheric turbulence is explicitly resolved in the model, providing detailed insights into small-scale atmospheric processes.

Use CaseData Gap Summary
Hybrid ML-physics climate models for enhanced simulations

These simulations are essential for resolving turbulent processes that current climate models cannot capture, but they require significant computational resources and are not readily available as benchmark datasets for the wider research community.

Give feedback
LiDAR point cloud – airbone
Details (click to expand)

Airborne LiDAR (Light Detection and Ranging) collects high-resolution, three-dimensional point clouds of forest structure using sensors mounted on aircraft or drones. This technology captures precise data about forest canopies, enabling detailed assessment of biomass and carbon stocks at local to regional scales.

Use CaseData Gap Summary
Improving estimations of forest carbon stock

Limited geographical coverage due to high collection costs, combined with the need for domain expertise to process the complex point cloud data, restricts the use of this high-value data source.

Give feedback
Marktstammdatenregister (solar photovoltaic data)
Details (click to expand)

The Marktstammdatenregister (MaStR) is Germany’s official registry for all energy-producing units, providing detailed and authoritative data on solar photovoltaic (PV) systems. It includes mandatory registration of all grid-connected solar PV installations, covering system type (e.g., rooftop or ground-mounted), installed capacity, commissioning date, geolocation, operator type, and grid connection details. Updated regularly and publicly accessible, MaStR enables comprehensive analysis of solar deployment patterns, supports grid planning, and helps track progress toward Germany’s energy transition and climate goals.

Use CaseData Gap Summary
Mapping existing solar photovoltaic systems

Although MaStR is one of the best datasets for solar PVs, substantial errors exist in the data, e.g. in terms of temporal or position accuracy.

Give feedback
Material intensity data
Details (click to expand)

Material intensity coefficients are numerical values that represent the amount of material used per unit of a reference measure, such as floor area, weight, or volume of a building or product. Typically expressed in units like kilograms per square meter (kg/m²), they are used to quantify how much of a specific material (e.g., concrete, steel, insulation) is needed for construction or manufacturing. These coefficients are essential in life cycle assessments (LCA), material flow analysis, and environmental impact studies, helping estimate resource use, waste generation, and embodied carbon.

Use CaseData Gap Summary
Enhancing the scalability and robustness of building stock assessments

Material intensity coefficient datasets face key gaps in aggregation, provenance, documentation, granularity, and timeliness. These issues stem from inconsistent formats, missing metadata, outdated or high-level data, and limited transparency on how values are derived, all of which can hinder reliable, comparative use in material and emissions modeling.

Give feedback
Micro-synchrophasors (µPMU data)
Details (click to expand)

Micro-phasor measurement units (µPMUs) provide synchronized voltage and current measurements with higher accuracy, precision, and sampling rate making it ideal for distribution network monitoring. 

For example, µPMUs have an angle accuracy to the allowance of .01 degrees and a total vector error allowance of .05% in contrast to 1 degree and 1% total vector error allowance for classic PMUs. With sampling rates of 10-120 samples per second, µPMUs are capable of capturing dynamic and transient states within the low voltage distribution network allowing for improved event and fault detection and localization. Today most µPMU datasets can be accessed through manual field deployments in test-beds, collaborative research studies, or through publicly available datasets.

View dataset

Use CaseData Gap Summary
Facilitating fault detection in low voltage distribution grids

For effective fault localization using µPMU data, it is crucial that distribution circuit models supplied by utility partners or Distribution System Operators (DSOs) are accurate, though they often lack detailed phase identification and impedance values, leading to imprecise fault localization and time series contextualization. Noise and errors from transformers can degrade µPMU measurement accuracy. Integrating additional sensor data affects data quality and management due to high sampling rates and substantial data volumes, necessitating automated analysis strategies. Cross-verifying µPMU time series data, especially around fault occurrences, with other data sources is essential due to the continuous nature of µPMU data capture. Additionally, µPMU installations can be costly, limiting deployments to pilot projects over a small network. Furthermore, latencies in the monitoring system due to communication and computational delays challenge real-time data processing and analysis.

Give feedback
Museum specimens
Details (click to expand)

Museum specimens contain detailed biological records documenting species’ characteristics, including morphological traits. Data on where and when they were collected is also often recorded. This offers documentation on the occurrence of species in both space and time. Museum specimens are valuable resources for various applications, such as species classification and species distribution modeling. 

Use CaseData Gap Summary
Improving terrestrial wildlife detection and species classification

The majority of the world’s museum specimens remain undigitized, creating a significant barrier to using these records in machine learning applications for biodiversity monitoring and climate change research.

Give feedback
NEX-GDDP-CMIP6 (Global daily downscaled long-term climate projections)
Details (click to expand)

The NEX-GDDP-CMIP6 dataset provides high-resolution, bias-corrected global climate projections derived from Coupled Model Intercomparison Project Phase 6 (CMIP6) across four greenhouse gas emissions scenarios (Shared Socioeconomic Pathways). It includes daily climate variables such as temperature, precipitation, humidity, and radiation from 2015 to 2100 at approximately 25km resolution, enabling detailed analysis of climate change impacts sensitive to local topography and fine-scale climate gradients. For more information, see https://www.nccs.nasa.gov/services/data-collections/land-based-products/nex-gddp-cmip6.

Use CaseData Gap Summary
Improving long-term extreme heat prediction

The dataset’s massive size (petabytes of data) creates significant barriers for access, transfer, and analysis, requiring specialized computing infrastructure and technical expertise that many researchers lack. Additionally, efficiently extracting relevant extreme heat information from this comprehensive climate dataset presents computational and methodological challenges.

Give feedback
NIST campus photovoltaic arrays and weather station data
Details (click to expand)

This dataset contains measurements from PV arrays at the National Institute of Standards and Technology campus from August 2014-July 2017, including electrical, temperature, meteorological, and radiation data sampled at high frequency with one-minute averages.

Use CaseData Gap Summary
Improving solar power forecasting: nowcasting/very-short-term (0-30min)

The dataset has limited spatial coverage (Gaithersburg, MD only) and is no longer maintained after July 2017, limiting its usefulness for current applications.

Give feedback
NOAA's SOLRAD Network Solar Radiation Data
Details (click to expand)

The National Oceanic and Atmospheric Administration’s SOLRAD Network monitors surface radiation at nine locations across the United States. The data includes high-precision measurements from various instruments, including pyrheliometers, pyranometers, and UV radiometers that collect minute-interval measurements of incoming solar radiation. These measurements characterize the Earth’s surface radiation budget and can be used to accurately forecast solar energy generation for grid planning and management.

Use CaseData Gap Summary
Improving solar power forecasting: short-term (30 min-6 hours)

While NOAA’s SOLRAD is an excellent data source for long term solar irradiance and climate studies,  it has limitations for short-term solar forecasting applications. Key gaps include lower quality hourly averages compared to native resolution data, and limited geographic coverage with only nine monitoring stations across the United States. These constraints impact the effectiveness of forecasting for real-time energy management, grid stability, and market operations.

Give feedback
NREL NOW23 (wind data)
Details (click to expand)

National Renewable Energy Laboratory’s NOW23 data is the latest wind resource data set for offshore regions in the United States. 

The NOW-23 data set was produced using the Weather Research and Forecasting Model (WRF) version 4.2.1. A regional approach was used: for each offshore region, the WRF setup was selected based on validation against available observations. The WRF model was initialized with the European Centre for Medium Range Weather Forecasts 5 Reanalysis (ERA-5) data set, using a 6-hour refresh rate. The model is configured with an initial horizontal grid spacing of 6 km and an internal nested domain that refined the spatial resolution to 2 km. The model is run with 61 vertical levels, with 12 levels in the lower 300m of the atmosphere, stretching from 5 m to 45 m in height.

It is accessible here: https://data.openei.org/submissions/4500 

Use CaseData Gap Summary
Improving offshore wind power forecasting: short-to long-term (3 hours–1 year)

This is numerically modeled data.

Give feedback
NREL Physical Solar Model Solar Radiation Database
Details (click to expand)

The National Renewable Energy Laboratory (NREL)’s Solar Radiaion Database provides hourly and half-hourly solar radiation data modeled using NREL’s Physical Solar Model (PSM). The data is derived from multiple satellite sources including NOAA’s Geostationary Operational Environmental Satellites (GOES), the Interactive Multisensor Snow and Ice Mapping System (IMS), MODIS, and MERRA-2 reanalysis. The PSM derives cloud and aerosol properties as inputs for the Fast All-sky Radiation Model for Solar applications (FARMS), enabling users to access spectral irradiance data based on time, location, and PV orientation.

Use CaseData Gap Summary
Improving solar power forecasting: short-term (30 min-6 hours)

While NSRDB offers global coverage using satellite-derived data, several challenges exist. The dataset requires periodic recalculation and updating to remain current, with unbalanced temporal coverage favoring the United States. Satellite-based estimations may be inaccurate in regions with frequent cloud cover, snow, or bright surfaces, requiring ground-based verification. Additionally, data derived from satellite imagery may need preprocessing to account for parallax effects and field-of-view issues that aren’t fully addressed in the higher-level FARMS products.

Give feedback
NREL SRRL Baseline Measurement System for Multi-Variable Solar Research
Details (click to expand)

The NREL Solar Radiation Research Laboratory’s Baseline Measurement System (SRRL BMS) provides 130 variables at 60-second intervals for site-specific environmental factors at its Golden, Colorado facility. This comprehensive dataset includes co-located measurements of temperature, pressure, precipitation, wind parameters, humidity, UV index, aerosol optical depth, albedo, and cloud cover categorized as opaque, thin, and clear. This multi-variable dataset supports photovoltaic potential studies and renewable resource climatology research.

Use CaseData Gap Summary
Improving solar power forecasting: short-term (30 min-6 hours)

While NREL’S SRRL BMS provides real-time joint variable data from ground-based sensors, its coverage is limited to the single location in Golden, CO in the United States. The diverse sensor network requires regular maintenance, and instrument malfunctions or calibration issues may lead to data inaccuracies if not promptly detected and addressed, affecting the reliability of solar forecasting applications.

Give feedback
NREL Solar panel PV system dataset
Details (click to expand)

The Solar Panel PV System Dataset (https://www.kaggle.com/datasets/arnavsharmaas/solar-panel-pv-system-dataset/data) is a tabular dataset from the National Renewable Energy Laboratory that includes specific feature data on PV system size, rebate, construction, tracking, mounting, module types, number of inverters and types, capacity, electricity pricing, and battery-rated capacity in the US. 

The solar panel PV system dataset was created by collecting and cleaning data for 1.6 million individual PV systems, representing 81% of all U.S. distributed PV systems installed through 2018. The analysis of installed prices focused on a subset of roughly 680,000 host-owned systems with available installed price data, of which 127,000 were installed in 2018. The dataset was sourced primarily from state agencies, utilities, and organizations administering PV incentive programs, solar renewable energy credit registration systems, or interconnection processes. 

Use CaseData Gap Summary
Mapping existing solar photovoltaic systems

The solar panel PV system dataset excluded third-party-owned systems and systems with battery backup. Since data was self-reported, component cost data may be imprecise. The dataset also has a data gap in timeliness as some of the data includes historical data, which may not reflect current pricing and costs of PV systems.

Give feedback
NREL WIND toolkit (wind and weather)
Details (click to expand)

The National Renewable Energy Laboratory WIND Toolkit includes instantaneous meteorological conditions from computer model output and calculated turbine power for more than 126,000 sites in the continental United States for the years 2007–2013. While the dataset mostly covers onshore areas, it also has an offshore component.

It features three datasets:

The meteorological dataset includes basic information on the weather conditions in each 2-km x 2-km grid cell. The meteorological dataset also includes parameters such as wind profiles, atmospheric stability, and solar radiation data in those cells.

The power dataset was created using wind data at 100-meter hub height and site-appropriate turbine power curves to estimate the power produced at each of the turbine sites.

The forecast dataset includes forecasts for 1-hour, 4-hour, 6-hour, and 24-hour forecast horizons.

The data features 2–4-km spatial resolution and 5-minute to hourly temporal resolution for offshore and land-based wind in the continental United States, Hawaii, and Alaska. It uses ensemble-based modeling for uncertainty quantification. Accessible here: https://www.nrel.gov/grid/wind-toolkit 

Use CaseData Gap Summary
Improving offshore wind power forecasting: short-to long-term (3 hours–1 year)

The data is outdated and only a proxy of actual meteorological conditions.

Give feedback
NREL Wind Active Power Control Simulation Tools
Details (click to expand)

NREL has developed simulation tools to understand the effects of wind power on interconnection system frequency, including the Flexible Energy Scheduling Tool for Integrating Variable Generation (FESTIV) and Multi-Area Frequency Response Integration Tool (MAFRIT). These tools use traditional commercial software and custom-developed models to perform dynamic simulations and wind generation studies for active power control of the grid.

These simulation tools include:

NREL Flexible Energy Scheduling Tool for Integrating Variable Generation (FESTIV)

NREL Multi-Area Frequency Response Integration Tool (MAFRIT)

Use CaseData Gap Summary
Enhancing wind power grid integration and stability

Access to NREL’s FESTIV model requires special permission, limiting broader research applications. The model’s hourly temporal resolution cannot capture sub-hourly dynamics critical for frequency response and system stability. Additionally, the simulation-based approach requires validation with real-world operational data to ensure accuracy for practical grid applications.

Give feedback
NREL solar power data for integration studies
Details (click to expand)

The NREL Solar Power Data for Integration Studies provides one year (2006) of 5-minute solar power data and hourly day-ahead forecasts for 6,000 simulated PV plants across the United States. The dataset was created using sub-hour irradiance algorithms and Numeric Weather Prediction simulations, covering both utility-scale (with single-axis tracking) and distributed-scale (fixed-tilt) PV systems. 

Use CaseData Gap Summary
Improving solar power forecasting: long-term (>24 hours)

While valuable for renewable energy integration studies, this dataset has limitations in geographic coverage (limited to the US), temporal scope (only 2006 data), and relies on simulated rather than measured PV outputs. Addressing these gaps would enable more accurate and globally applicable ML-based solar forecasting models.

Give feedback
Natural hazards forecasts
Details (click to expand)

Natural hazard data used for risk assessments can usually be modeled with characteristics derived from, and statistically consistent with, the observational record. Some hazard data catalogs can be found here https://sedac.ciesin.columbia.edu/theme/hazards/data/sets/browse, as well as from the Risk Data Library of the World Bank.

Use CaseData Gap Summary
Facilitating disaster risk assessments

The resolution of current natural hazard forecast data is not sufficient for effective physical risk assessment.

Give feedback
Negative experimental synthesis data
Details (click to expand)

Experimental data typically include a combination of structural, performance, and synthesis information. This may involve crystallographic data (e.g., X-ray diffraction patterns), adsorption isotherms, gas or liquid permeability measurements, catalytic conversion rates, and thermal or chemical stability profiles. Additionally, metadata about synthesis conditions—such as temperature, pressure, reactant concentrations, and time—are recorded alongside characterization data like spectroscopy (e.g., NMR, IR, UV-Vis) and microscopy (e.g., SEM, TEM) to evaluate material morphology and composition. This data is often heterogeneous, high-dimensional, and collected under varying experimental conditions.

Use CaseData Gap Summary
Enabling predictions of materials optimised for filtration, catalysis, electrics & magnetics

Negative – not only positive – synthesis experimental data is necessary to train algorithms, but such data is not publicly available.

Give feedback
Occupancy data from rest areas from cameras
Details (click to expand)

Occupancy data from rest areas using cameras refers to information collected about how many vehicles—especially trucks—are parked at highway rest stops over time. Cameras installed at these sites capture images or video, which can be analyzed (manually or with computer vision algorithms) to determine how full the parking area is at different times of day or year. This data helps understand peak usage patterns, identify underused or overcrowded locations, and support planning for infrastructure like electric truck charging stations, making it a valuable tool for both transportation planning and climate policy.

Use CaseData Gap Summary
Enabling assessments of rest area capacity and use for electric truck charging

This data is generally not shared and only accessible for few rest areas.

Give feedback
Ocean observations from floating infrastructure (FINO3)
Details (click to expand)

FINO3 is an off-shore wind mast based wind speed and wind direction research platform datasets which include time series data with respect to temperature, air pressure, relative humidity, global radiation, and precipitation. Images from the perspective of the platform provide a snapshot of of environmental conditions directly. 

The platform is located in the northern part of the German Bight, 80km northwest of the island of Sylt in the midst of wind farms. Wind measurements are taken between 32 to 102 meters above sea level with wind speed measurements taken every 10 meters. Data is collected from August 2009 until the present day.

Use CaseData Gap Summary
Improving offshore wind power forecasting: short-to long-term (3 hours–1 year)

Due to the location, FINO platform sensors are prone to failures of measurement sensors due to adverse outdoor conditions such as high wind and high waves. Often, when sensors fail, manual recalibration or repair is nontrivial requiring weather conditions to be amenable for human intervention. This can directly affect the data quality which can last for several weeks to a season. Coverage is also constrained to the platform and associated wind farm location.

Give feedback
Offshore wind data from masts and LiDAR
Details (click to expand)

The offshore operation data from the Danish energy company Orsted provides 2 years worth of 10-minute Supervisory Control and Data Acquisition (SCADA) information for nacelle wind speed, electrical power, rotor speed, yaw position, as well as pitch angle for turbines with on-site wave buoy data and ground based LiDAR from different offshore wind farm sites. 

For one site, the Anholt Westermost Rough offshore wind farm, data is collected from 111 Siemens SWT-120-3.6 MW wind turbines arranged in a layout of 20 km by 8 km with internal spacing between turbines being 5-7 rotors and a depth of 15-19 m. In another site, The Northeast of Withernsea off Holderness coast in North Sea, England, has a wind farm with a 35 km by 35 km spatial coverage area.

Use CaseData Gap Summary
Improving offshore wind power forecasting: short-to long-term (3 hours–1 year)

The spatiotemporal coverage of the offshore windspeed mast data is restricted to the dimensions of the platform/tower itself as well as the time of construction. Depending on the data provider access to the data may require the signing of a non-disclosure agreement.

Give feedback
Offshore wind farm operation data (Orsted)
Details (click to expand)

The offshore operation data from the Danish energy company Orsted provides 2 years worth of 10-minute Supervisory Control and Data Acquisition (SCADA) information for nacelle wind speed, electrical power, rotor speed, yaw position, as well as pitch angle for turbines with on-site wave buoy data and ground based LiDAR from different offshort wind farm sites. 

For one site, the Anholt Westermost Rough offshore wind farm, data is collected from 111 Siemens SWT-120-3.6 MW wind turbines arranged in a layout of 20 km by 8 km with internal spacing between turbines being 5-7 rotors and a depth of 15-19 m. In another site, The Northeast of Withernsea off Holderness coast in North Sea, England, has a windfarm with a 35 km by 35 km spatial coverage area.

Use CaseData Gap Summary
Improving offshore wind power nowcasting (10 min)

Data can be accessed by requesting access via the Orsted form.  Sufficiency of the dataset is constrained by volume where only a finite amount of short term off-shore wind farms exist to which expanding the coverage area, volume and time granularity of data to under 10 minutes may enable transient detection from generated active power.

Give feedback
OpenStreetMap (land use map)
Details (click to expand)

Open Street Map is an open-source map database providing worldwide geographic features such as buildings, roads, and land uses, maintained by a community of mappers who add objects manually or trace them from remote sensing imagery.

Use CaseData Gap Summary
Facilitating disaster risk assessments

The quality of OpenStreetMap is very variable in terms of coverage of geometries e.g. buildings and attributes. Roads are better mapped than buildings in general. The very permissive data model from OpenStreetMap enables users to provide a variety of information, but it is often not well harmonized. Recent corporate editing efforts have increased dramatically the coverage in previously poorly mapped regions.

Give feedback
Interpolating city-wide bicycle volumes from limited count data

Bike infrastructure data in OpenStreetMap faces reliability and usability issues due to a lack of validation and inconsistent naming conventions, requiring extensive pre-processing. Elements like bike parking are often missing, reducing data completeness. Cities and shared tools can help address these gaps.

Give feedback
Mapping existing solar photovoltaic systems

OpenStreetMap’s solar PV data suffers from uneven global coverage and missing critical attributes, limiting its utility for comprehensive energy assessments. Additionally, inconsistent tagging and lack of quality control hinder data usability, reliability, and integration.

Give feedback
Understanding the impact of urban planning on travel emissions

OpenStreetMap data for mobility infrastructure is often incomplete outside of main roads and biased towards high-income countries. The data’s reliability is uncertain due to its crowdsourced nature, requiring quality checks. Its permissive data model leads to inconsistencies, necessitating thorough pre-processing, and it often lacks proper documentation, despite the benefits of documenting data provenance.

Give feedback
Optimal power flow simulation outputs
Details (click to expand)

PowerWorld Simulator and MATPOWER are software tools used for optimizing power systems and include representation of both alternating current (AC) and direct current (DC) systems. PowerWorld Simulator models, analyzes, and optimizes power systems for a wide range of configurations and scenarios with the ability to model small distribution networks as well as transmission systems. 

MATPOWER is an open source alternative and also solves both the AC and DC versions of optimal power flow (OPF) with DC OPF simplified into a quadratic program using DC modeling assumptions and reducing polynomial costs to second order using real power flows as a function of voltage angles (thereby eliminating voltage magnitude and reactive power). PowerWorld Simulator utilizes a combination of iterative algorithms (Newton-Raphson) with traditional power flow equations.

MATPOWER is open source and PowerWorld Simulator has several options for industry practitioners as well as those who would like to use it for academic purposes. Demo software that is licensed for educational use that includes simulator features such as available transfer capability, optimal power flow, security-constrained OPF, OPF reserves, PV/QV curve tool, transient stability, and geomagnetically induced current. In terms of topology, the free version contains up to 13 buses while the full version of the simulator can handle 250,000 buses.

Use CaseData Gap Summary
Improving power grid optimization

Traditional OPF simulation software may require the purchase of licenses for advanced features and functionalities. To simulate more complex systems or regions, additional data regarding energy infrastructure, region-specific load demand, and renewable generation may be needed to conduct studies. OPF simulation output would require verification and performance evaluation to assess results in practice. Increasing the granularity of the simulation model by increasing the number of buses, limits, or additional parameters increases the complexity of the OPF problem, thereby increasing the computational time and resources required.

Give feedback
Outputs from distribution connected inverter systems simulations
Details (click to expand)

There is a need to enhance existing simulation tools to study inverter based power systems rather than traditional machine based. Simulations should be able to represent a large number of distribution connected inverters which incorporate DER models with the larger power system. Uncertainty surrounding model outputs will require verification and testing.

NREL’s PREconfiguring and Controlling Inverter SEt-points (PRECISE) can identify interconnection located on network based on PV customer’s address and model the distribution feeder and preconfigure advanced inverter modes to provide grid support and minimize energy curtailment. The tool can allow utilities to perform power flow analysis and analyze inverter modes.

Furthermore, NREL’s Energy Systems Integration Facility (ESIF) has real-time simulation connected with power hardware that allows for smart inverter manufacturers to test operational control with simulated dynamics and scenarios.

Use CaseData Gap Summary
Optimizing smart inverter management for distributed energy resources

There is a need to enhance existing simulation tools to study inverter-based power systems rather than traditional machine-based based. Simulations should be able to represent a large number of distribution-connected inverters that incorporate DER models with the larger power system. Uncertainty surrounding model outputs will require verification and testing. Furthermore, accessibility to simulations and hardware in the loop facilities and systems requires user access proposal submission for NREL’s Energy Systems Integration Facility access. Similar testing laboratories may require access requests and funding.

Give feedback
Passive acoustic monitoring for biodiversity assessment
Details (click to expand)

Passive acoustic recording provides continuous monitoring of both environment and species vocalizations. While some annotated datasets are available through repositories like ARBIMON (https://arbimon.org/) or Macaulay Library (www.macaulaylibrary.org), there remains a general lack of robust, large, and diverse annotated bioacoustic datasets for machine learning applications.

Use CaseData Gap Summary
Facilitating forest restoration monitoring

There is a lack of standardized protocol to guide data collection for restoration projects, including which variables to measure and how to measure them. Developing such protocols would standardize data collection practices, enabling consistent assessment and comparison of restoration successes and failures on a global scale.

Give feedback
Facilitating the detection of climate-induced ecosystem changes

The major data gap is the limited institutional capacity to process and analyze data promptly to inform decision-making.

Give feedback
Improving terrestrial wildlife detection and species classification

The first and foremost challenge of bioacoustic data is its sheer volume, which makes data sharing especially challenging. Solutions are needed for cheaper and more reliable data hosting and sharing platforms. 

Additionally, there’s a significant shortage of large and diverse annotated datasets, which is even more severe than image data such as camera trap, drone, and crowd-sourced images.

Give feedback
Pecan Street (appliance-level consumption data)
Details (click to expand)

Pecan Street DataPort began as a Smart Grid Demonstration program through the Pecan Street energy research nonprofit organization which worked closely with the University of Texas at Austin. Funded by the DOE in 2014, the project signed up 1000 research participants from the Mueller community in Austin, Texas to share green button, smart meter, and home energy management system (HEMS) data in 750 homes and 25 commercial properties. Financial incentivization of plug-in electric vehicle use and rooftop solar installation by Austin Energy encouraged residential lifestyle shifts. In addition to providing access to sub-metered appliance level consumption data, Pecan Street includes electric vehicle charging, rooftop solar, heating, cooling, and water usage data. Data coverage has expanded to volunteer households from California, New York and Colorado. Previously open for use, Pecan Street has been privatized and now data access and products are available for commercial and academic purchase depending on the level of access requested.

View dataset

Use CaseData Gap Summary
Enabling non-intrusive electricity load monitoring

Pecan Street DataPort requires non-academic and academic users to purchase access via licensing which varies depending on the building data features requested. Coverage area of data is primarily concentrated in the Mueller planned housing community in Austin, Texas–a modern built environment which is not representative of older historical buildings that may be in need of energy efficient upgrades and retrofits. In customer segmentation studies and consumer-in-the-loop load consumption modeling, annual socio-demographic survey data may be too coarse and not provide insight into behavioral effects of household members on consumption profiles with time.

Give feedback
Points of Interest
Details (click to expand)

Points of Interest (POIs) refer to specific geographic locations or sites that hold particular interest or usefulness to individuals, businesses, or communities. These areas often include landmarks, tourist attractions, parks, historical sites, restaurants, retail stores, and other significant locations that people might want to visit or know about. 

POIs are commonly used in mapping and navigation services to provide users with relevant information about notable locations within a given area. POIs can be either crowdsourced (e.g. OpenStreetMap) or from commercial providers (e.g. Google Maps). 

POIs are relevant to analyze the sustainability and equity of cities, e.g. via analyses of the accessibility of important services to the overall population. From a climate perspective, it is relevant to understand the need to travel (and associated travel emissions) to access services, appearing for example in the 15-minute city concept.

Use CaseData Gap Summary
Understanding the impact of urban planning on travel emissions

POI data coverage is uneven, often biased towards high-income countries, with even leading datasets like Google Maps facing gaps, especially in the Global South. Timeliness is an issue as datasets may not reflect current business statuses. Reliability suffers from locational inaccuracies, affecting data matching and analysis. Usability is hindered by varied categorizations and languages in assembled datasets, necessitating standardization.

Give feedback
Population and asset exposure to natural hazards
Details (click to expand)

Exposure is defined as the representative value of populations and assets potentially exposed to a natural hazard occurrence.  such as population, physical assets (e.g. buildings), economic output (e.g. measured by GDP),, buildings, or agriculture output, depending on the risk exposed to.

There areopen datatasets with global coverage, for example, the Global Exposure Model, as well as proprietary data with more detailed information coming from well-established insurance companies. 

Use CaseData Gap Summary
Facilitating disaster risk assessments

Accessibility and reliability are the most significant challenges with exposure data.

Give feedback
Power Grid Lib (optimal power flow benchmark library)
Details (click to expand)

The Power Grid Library (PGLib-OPF) is a collection of git repositories that house benchmark data for validating power system simulations. 

It contains 36 networks with 3-13,659 buses sourced from IEEE Power Flow Test Cases, IEEE Dynamic Test Cases, IEEE Reliability Test System, Polish Test Cases, PEGASE Test Cases, and RTE Test Cases which have been modified to raise optimality gaps to values between 1-10% thereby creating more challenging suboptimal solutions to AC-OPF. 

By curating and collecting this data, users who want to study more realistic AC-OPF simulation scenarios can directly retrieve compiled bus IDs, branch IDs, generator IDs, power demand, shunt admittance, voltage magnitude range for buses, power injection range for generators, quadratic active power cost function coefficients  for generators, branch parameters like series admittance, line charge, transformer parameters, thermal limits, and branch voltage angle difference range which are more realistic. All parameters are conveniently standardized to MATPOWER data file format for direct use. PGLib-OPF is open source.

Use CaseData Gap Summary
Improving power grid optimization

While network datasets are open source, maintenance of the repository requires continuous curation and collection of more complex benchmark data to enable diverse AC-OPF simulation and scenario studies. Industry engagement can assist in developing more realistic data though such data without cooperative effort may be hard to find.

Give feedback
Power line robot inspection imagery
Details (click to expand)

Cable inspection robot data includes LiDAR and image captures of Specific Power Line (SPL) components such as dampers, insulators, broken strands, and attachments that may have degraded due to exposure to natural elements. The data also focuses on assessing risk at the lowest part of power lines near trees, roofs, and other crossing power lines. Since the robots physically traverse the lines, this data is particularly valuable for degradation detection of high voltage transmission lines and for maintenance scheduling.

Use CaseData Gap Summary
Enhancing power grid-vegetation management for wildfire risk mitigation

Grid inspection robot imagery requires coordination with local utilities foraccess, multiple robot trips for complete coverage, image preprocessing to remove ambient artifacts, position and location calibration, and may be limited by camera resolution for detecting subtle degradation patterns.

Give feedback
Public health data
Details (click to expand)

Health surveillance data includes medical records, epidemiological surveys, disease registries, healthcare utilization statistics, and population health indicators. These datasets vary in geographic coverage, temporal frequency, and demographic scope, with some maintained by government health agencies and others by healthcare institutions or research organizations.

Use CaseData Gap Summary
Improving assessments of climate impacts on public health

Limited accessibility and poor documentation of health datasets restrict their use in climate-health ML applications. Privacy concerns and institutional barriers prevent broader data sharing, while inconsistent documentation makes existing datasets difficult to use effectively.

Give feedback
Regularly gridded high-resolution atmospheric observations
Details (click to expand)

Though a lot of data is available, a set of regularly gridded 3D high-resolution observations of the atmosphere state (like a higher-resolution version of ERA5) is still needed. This is essential for both an improved understanding of the atmospheric processes and the development of ML-based weather forecast models and climate models.  

Use CaseData Gap Summary
Accelerating and improving weather forecasting: Near-term (< 24 hours)

An enhanced version of ERA5 with higher granularity and fidelity is needed. Many surface observations and remote sensing data are available but underutilized for developing such a dataset.

Give feedback
Hybrid ML-physics climate models for enhanced simulations

While conceptually needed, this dataset does not exist in the form required. An enhanced version of ERA5 with higher resolution and fidelity would significantly improve ML model training and validation. 

Give feedback
Residential daylight performance metric (DPM) data
Details (click to expand)

The amount of daylight that buildings are exposed to through windows is an important parameter for heating demand (via heat gains from solar radiations) and electricity demand for lighting (via the illumination of indoor spaces by natural light). Architects can optimize these dimensions via adjusting window placement and window-to-window ratios. 

Daylight performance metrics (DPMs) have been developed by building researchers and architects based on daylight access simulation output to quantify the illumination of indoor spaces by natural light.

Residential daylight performance metric data (DPM) with respect to daylight autonomy (DA), continuous daylight autonomy (cDA), spatial daylight autonomy (sDA), and useful daylight illuminance (UDI) can be generated using physics-based ray tracing simulations that calculate illuminances over a prototype building layout. Some simulation software available to calculate DPMs include IES virtual environment (IESVE), DesignBuilder, VELUX daylight visualizer, and the open source RADIANCE 5.0. To generate synthetic data from these simulation frameworks, the user must provide a geometric model of the building, climate data with respect to the building location, reflectance and transmittance values for materials, desired radiance parameters, occupancy schedule, and a virtual sensor grid over which the incident illuminance is to be calculated. Strategies based on the output of the simulations can assist architects in optimizing window placement and size, incorporation of shading devices, and the design of floor plans to control building direct and diffuse natural light.

Use CaseData Gap Summary
Accelerating building energy models

While daylight performance metric (DPM) evaluation is an important step in the planning of commercial buildings, residential buildings do not have a similar focus, which is unusual given that most new building construction occurs within the residential sector. Residential DPMs often lack metrics associated with direct sunlight access, rely on annual averages for seasons, and utilize fixed occupancy schedules that are overly simplified for residential spaces. Additionally, illuminance metrics and thresholds utilized in commercial spaces do not translate well for residential spaces where people may prefer higher or lower illuminances depending on their location and lifestyles. Lastly, DPM optimization is based on operational metrics and assumptions on illumination in a space and its effects on the resulting thermal comfort and operational consumption of a traditional urban residential spaces, vernacular architecture which is specific to a local region and culture may not share similar objectives, preferring more indoor-outdoor transitional spaces, earthen materials, and less focus on windows and incident natural sunlight.

Give feedback
S2S forecast data
Details (click to expand)

NWP model output from S2S experiment https://confluence.ecmwf.int/display/S2S/Models

Use CaseData Gap Summary
Weather forecasting: Subseasonal-to-seasonal horizon

More data is needed to take advantage of the large ML models.

Give feedback
SMA Solar Technology PV System Performance database
Details (click to expand)

PV Anlage-Reinhart System provides hourly photovoltaic power, energy production, CO2 emissions avoided, and system configuration information for publicly available PV installations worldwide. SMA, a leading German manufacturer of solar inverters, has compiled data from their international deployments across multiple countries including Germany, the US, Chile, Brazil, Mexico, Canada, Spain, Italy, France, China, Australia, Belgium, India, Poland, Japan, UK, South Africa, Türkiye, and the UAE. This dataset, which includes inverter specifications, module information, and sometimes battery data, supports microgrid studies and distributed energy resource forecasting.

Use CaseData Gap Summary
Improving solar power forecasting: short-term (30 min-6 hours)

The SMA PV monitoring system requires user profile creation and specific system access requests, with documentation primarily in German creating potential language barriers. Data representation is geographically unbalanced with stronger coverage in Germany, Netherlands, and Australia despite its global presence. Additionally, only a subset of systems includes energy storage data, which would be valuable for comprehensive distributed energy resource load forecasting studies.

Give feedback
SOLETE Hybrid Solar-Wind Generation dataset
Details (click to expand)

SOLETE, developed by the Energy System Integration Lab (SYSLAB) at the Technical University of Denmark, provides 15 months of measurements at multiple resolutions (seconds to hours) from June 2018 to September 2019. The dataset includes timestamps, meteorological data (temperature, humidity, pressure, wind speed and direction), solar irradiance measurements (global horizontal and plane of array), and active power generated by an 11 kW Gaia wind turbine and a 10 kW PV inverter. This comprehensive dataset supports time-series forecasting for hybrid solar-wind distributed energy resource systems.

Use CaseData Gap Summary
Improving solar power forecasting: short-term (30 min-6 hours)

While SOLETE offers valuable data for joint wind-solar distributed energy resource forecasting, several sufficiency gaps limit its application. The dataset’s 15-month temporal coverage doesn’t capture long-term seasonal variations, and it monitors only a single wind turbine and PV array, limiting analyseis of multi-source generation coordination. Additionally, maintenance schedule and system downtime data are missing, which would enhance realistic system dynamic modeling. Supplementing with external data sources or simulation could address these limitations.

Give feedback
SRRL TSI-880 (sky imagery)
Details (click to expand)

The SRRL TSI-880 contains data from ground-based sky imagers that provide high temporal and spatial resolution (<1 km) information at single locations to support cloud detection and solar forecasting.

Use CaseData Gap Summary
Improving solar power forecasting: nowcasting/very-short-term (0-30min)

Data coverage is limited by camera locations, temporal resolution is restricted to 10-minute increments, and image resolution is limited to 352x288 24-bit jpeg images.

Give feedback
SWINySEG (sky imagery)
Details (click to expand)

SWINySEG (Singapore whole sky Nychthemeron image SEGmentation database) contains 6,768 daytime and nighttime sky/cloud images with corresponding binary ground truth maps taken in Singapore over 12 months in 2016, with annotations by the Singapore Meteorological Services.

Use CaseData Gap Summary
Improving solar power forecasting: nowcasting/very-short-term (0-30min)

The dataset provides valuable annotated data for cloud detection and segmentation but is limited to Singapore, has an insufficient volume of samples (especially nighttime images), and restricts commercial use.

Give feedback
Satellite imagery
Details (click to expand)

Satellite imagery datasets consist of Earth observation data captured from space-based sensors with varying spatial (size of the pixels), spectral (number and type of channels), and temporal (amount of time between collections) resolutions. Satellite imagery can have a global coverage, which enables global mapping applications. 

This data is relevant to many ML use cases, but different applications require different spatial, spectral, and temporal resolutions and different kinds of labels. 

Some of the most widely used satellite imagery include Sentinel-1 and 2, MODIS, VIIRS, Landsat, which are open to the public and of resolutions down to 5m. Commercial satellites can have much higher-resolution images (e.g. 30-cm of Maxar), but they are not open to the public. It is worth noting that Planet NICFI provides free high-resolution, analysis-ready mosaics of the world’s tropics for non-commercial use. 

Use CaseData Gap Summary
Accelerating post-disaster damage assessments

Satellite imagery for disaster assessment faces challenges with temporal currency and spatial resolution, with public datasets having insufficient resolution for accurate damage assessment and commercial high-resolution options being prohibitively expensive.

Give feedback
Enabling assessments of rest area capacity and use for electric truck charging

Satellite imagery for obtaining truck counts requires high-resolution imagery (here, both high temporal and spatial resolution matter) that is cloud-free over several kilometers. Usual cloud-free products are not suitable, because the time stamp attached to the image is important, and one image should cover several kilometers of a street or highway.

Give feedback
Enhancing digital reconstructions of the environment

Satellite images provide environmental information for habitat monitoring. Combined with other data, e.g. bioacoustic data, they have been used to model and predict species distribution, richness, and interaction with the environment. High-resolution images are needed but most of them are not open to the public for free.

Give feedback
Improving solar power forecasting: medium-term (6-24 hours)

Satellite remote sensing data for solar forecasting faces several challenges: variability in spatial and temporal resolution across different satellite sources, complex preprocessing requirements for multispectral data, and the need to accurately translate cloud observations into ground-level irradiance predictions. Improving granularity through supplementation with ground-based measurements and developing standardized preprocessing pipelines would significantly enhance forecast accuracy for grid management applications.

Give feedback
Improving terrestrial wildlife detection and species classification

Some commercial high-resolution satellite images can also be used to identify large animals such as whales, but those images are not open to the public.

Give feedback
Mapping existing solar photovoltaic systems

Data gaps for this use case may stem from the need for large data volumes and high-resolution historical imagery.

Give feedback
Scaling earth system monitoring

Satellite images face major challenges from massive data volumes that impede downloading and processing, lack of annotated data for training ML models, and limited access to high-resolution imagery, particularly affecting Global South applications.

Give feedback
Scaling truck count inference from remote sensing data

Satellite imagery for monitoring rest area capacity and usage has the typical data gaps for use cases requiring high-resolution imagery (here, both high temporal and spatial resolution matter) that is cloud-free over several kilometers.

Give feedback
Satellite imagery – GEDI LiDAR
Details (click to expand)

The Global Ecosystem Dynamics Investigation (GEDI) is a NASA/University of Maryland mission that uses LiDAR to create detailed 3D maps of forest canopy height and structure. By measuring forests in 3D, GEDI data enables accurate estimation of forest biomass and carbon storage across global scales.

Use CaseData Gap Summary
Improving estimations of forest carbon stock

Quality uncertainties in GEDI data affect carbon stock estimation reliability, requiring validation methods and calibration procedures to improve measurement accuracy.

Give feedback
Satellite imagery – Hyperspectral
Details (click to expand)

This dataset consists of hyperspectral satellite imagery from platforms such as PRISMA and EnMAP, which capture hundreds of narrow spectral bands across the electromagnetic spectrum, providing detailed spectral information for detecting methane plumes with greater sensitivity than multispectral systems.

Use CaseData Gap Summary
Scaling methane emission detection

Very few actual hyperspectral images of methane plumes exist, creating a significant data volume limitation for training robust detection algorithms.

Give feedback
Satellite imagery – Multi-Radar/Multi-Sensor System
Details (click to expand)

The Multi-Radar Multi-Sensor (MRMS) system combines data from multiple radars, satellites, surface observations, lightning reports, rain gauges, and numerical weather prediction models to produce decision-support products every two minutes. It provides detailed depictions of high-impact weather events such as heavy rain, snow, hail, and tornadoes, enabling forecasters to issue more accurate and earlier warnings. See https://www.nssl.noaa.gov/projects/mrms/

Use CaseData Gap Summary
Accelerating and improving weather forecasting: Near-term (< 24 hours)

Obtaining and integrating radar data from various sources is challenging due to access restrictions, format inconsistencies, and limited global coverage.

Give feedback
Satellite imagery – Multispectral
Details (click to expand)

This dataset contains images captured by spectrometer-equipped satellites that record data at specific wavelengths to detect the spectral signatures associated with methane. Notable missions include the Sentinel-5P TROPOMI instrument and the upcoming MethaneSAT, which provide global coverage of methane concentrations in the atmosphere.

Use CaseData Gap Summary
Scaling methane emission detection

Current multispectral satellite data has insufficient spatial resolution to detect smaller methane leaks.

Give feedback
Satellite imagery – PALSAR radar images
Details (click to expand)

PALSAR (Phased Array type L-band Synthetic Aperture Radar) provides radar imagery that can capture the 3D structure of forests by penetrating cloud cover and forest canopies. This technology enables consistent monitoring regardless of weather conditions or time of day, making it valuable for continuous forest carbon stock estimation.

Use CaseData Gap Summary
Improving estimations of forest carbon stock

Domain expertise is needed to preprocess this data, limiting its accessibility to researchers and practitioners without specialized knowledge in radar imagery interpretation.

Give feedback
Second-hand vehicle international trade data
Details (click to expand)

Second-hand vehicle international trade data refers to information on the cross-border movement of used vehicles, including details such as origin and destination countries, and volumes traded, but also ideally information such as vehicle types, age, fuel type, or mileage. This data can help track how used vehicles flow between regions, often from high-income to lower-income countries. It is critical for understanding environmental and economic impacts, such as the spread of older, higher-emission vehicles to areas with weaker regulations.

Use CaseData Gap Summary
Understanding fleet overturning and international second-hand vehicle markets

Second-hand vehicle trade data is limited by poor country coverage, low volume based on few case studies, and missing key details like vehicle type, age, fuel type, and mileage, making it insufficient for understanding global trade patterns and technology shifts.

Give feedback
Simulated variables from process-based models of soil organic carbon dynamics
Details (click to expand)

This dataset contains soil data generated by physics-based or process-based soil models that simulate soil organic carbon dynamics based on environmental and management inputs. These simulations provide alternatives to direct measurements where field data collection is prohibitively expensive or impractical.

Use CaseData Gap Summary
Modeling effects of soil processes on soil organic carbon

Data collection is extremely expensive for some variables, leading to the use of simulated variables. Unfortunately, simulated values have large uncertainties due to the assumptions and simplifications made within simulation models.

Give feedback
Smart inverter devices database
Details (click to expand)

The California Energy Commission keeps a list of smart inverters that meet strict standards for safety and communication. These inverters must pass extra tests to show they can handle things like voltage, frequency, timing, and how they connect or disconnect from the grid, along with other technical functions to keep the power system safe and stable.

Those include: CEC Grid Support Solar Inverters, CEC Grid Support Battery Inverters, CEC Grid Support Solar/Battery Inverters, CEC Inverters with Power Control Systems functionality.

Additional vendors can also be contacted for smart inverter information:

SMA-America Solar Inverters.

Use CaseData Gap Summary
Optimizing smart inverter management for distributed energy resources

Smart inverter operational data is not publicly available and requires partnerships with research labs, utilities, and smart inverter manufacturers. However, the California Energy Commission maintains a database of UL 1741-SB compliant manufacturers and smart inverter models that can then be contacted for research partnerships. In terms of coverage area, while California and Hawaii are now moving towards standardizing smart inverter technology in their power systems, other regions outside of the United States may locate similar manufacturers through partnerships and collaborations.

Give feedback
Soil Survey Geographic Database (SSURGO)
Details (click to expand)

The Soil Survey Geographic Database (SSURGO) contains soil organic carbon data collected through field observations and laboratory analysis of soil samples. It provides comprehensive soil information for the United States, including physical and chemical soil properties.

View dataset

Use CaseData Gap Summary
Modeling effects of soil processes on soil organic carbon

In general, there is insufficient soil organic carbon data for training a well-generalized ML model (both in terms of coverage and granularity).

Give feedback
Soil measurements – NorthWyke Farms platform
Details (click to expand)

The NorthWyke Farms platform data is a collection of soil measurements from the UK’s North Wyke Farm Platform, providing quarterly soil organic carbon values along with other environmental parameters. The dataset covers experimental farm plots under different management practices and is continuously updated with new measurements.

View dataset

Use CaseData Gap Summary
Modeling effects of soil processes on soil organic carbon

The common and biggest challenges for use cases involving soil organic carbon is the insufficiency of data and the lack of high granularity data.

Give feedback
Solcast (global solar forecasting and historical solar irradiance data)
Details (click to expand)

Solcast is a global solar forecasting and historical solar irradiance data provider that combines satellite imagery from Himawari 8, GOES-16, GOES-17, and Numeric Weather Prediction models to deliver 10-15 minute scale solar irradiance data products.

Use CaseData Gap Summary
Improving solar power forecasting: nowcasting/very-short-term (0-30min)

Solcast data is only accessible through academic or research institutions, uses coarse elevation models, has limited coverage (33 global sites), and provides data at 5-60 minute intervals, insufficient for very-short-term forecasting.

Give feedback
Strava GPS-based cycling data
Details (click to expand)

Strava GPS-based cycling data is collected from cyclists who use the Strava app to track their rides via GPS. This data captures detailed information such as route choice, speed, time of day, and basic user characteristics (age, gender, etc.), providing rich insights into cycling behavior and network usage.

Use CaseData Gap Summary
Interpolating city-wide bicycle volumes from limited count data

Strava GPS cycling data offers both the highest temporal and spatial resolution and the most comparable source across cities. However, it is accessible to cities but less so to academics and mainly represents specific user groups, limiting its coverage of the overall cycling population.

Give feedback
Street infrastructure data – LiDAR-derived
Details (click to expand)

LiDAR-derived datasets of street infrastructure can provide a vectorized representation of the street space allocation across multiple usages e.g. road space, special lanes such as bike or bus lanes, sidewalks, greenery, parking lots, etc.

Such data enables micro-level analyses of street designs and their impact on sustainability and equity of cities, for example, to understand the walkability of an area or traffic patterns in relation to the built environment. These datasets can be used for analyses of the status quo or prospective modeling.

Use CaseData Gap Summary
Understanding the impact of urban planning on travel emissions

Very few such datasets exist. One of the only examples is the Berlin road survey (Straßenbefahrung) 2014, available at https://fbinter.stadt-berlin.de/fb/gisbroker.do;jsessionid=680EFD768EDCC386FBDF72B1637E71D7?cmd=navigationShowResult&mid=K.k_StraDa%40senstadt

Give feedback
Street view imagery
Details (click to expand)

Street View imagery is a feature offered by mapping services like Google Maps, providing panoramic 360-degree views of streets and various locations worldwide. Crowdsourced alternatives such as Mapillary have also emerged.

Captured using high-resolution cameras mounted on vehicles or drones, these images are stitched together to create seamless panoramic scenes, allowing users to virtually explore and navigate environments. 

This data can serve climate-relevant use cases, including the monitoring of urban infrastructure. It can be useful to compute walkability, perceived greenness, parking space usage, etc.

Use CaseData Gap Summary
Understanding the impact of urban planning on travel emissions

Street view imagery generates massive data volumes, complicating usability, and access is restricted, often favoring larger cities and wealthier countries. Preprocessing for tasks like computer vision is intensive, and coverage can be incomplete or biased. Additionally, images may lack full 360-degree views or contextual details like weather conditions, impacting their treatment by computer vision algorithms.

Give feedback
Sub-metered appliance-level data
Details (click to expand)

This collection includes multiple international datasets of sub-metered building electricity consumption, primarily from residential buildings across North America, Europe, and Asia collected between 2011-2020. These datasets provide granular appliance-level energy consumption data at varying sampling frequencies (1Hz-15kHz) along with aggregate building-level measurements. Some datasets include additional measurements such as occupancy information, environmental conditions, and utility billing data. The datasets vary in coverage from single households to hundreds of homes, with monitoring periods ranging from two months to several years.

- Almanac of Minutely Power dataset (AMPds2): A single building electricity, water, and natural gas consumption dataset from a home in Burnaby, British Columbia, Canada from 2012-2014 which includes environment and utility billing data as well. 

- Commercial building energy dataset (COMBED): A dataset of 6 commercial buildings on the Indraprastha Institute of Information Technology (IIIT-Delhi) from August 2013 to the present containing data with respect to the total power consumption, sub-metered data with respect to elevators, air handling units (AHUs), uninterruptible power supplies (UPS), and central campus heating, ventilation, and air conditioning (HVAC) pumps and chillers at a 30 second cadence.

- DEDDIAG: A dataset comprised of aggregate and disaggregated power consumption from 15 southern German homes monitored at 1Hz containing 50 appliances including dishwashers, washing machines, refrigerators and dryers over a span of 3.5 years (2016-2020). Aggregated data includes three-phase measurements. This dataset also contains event start and stop timestamps for 14 appliances.

- Dutch Residential Energy Dataset (DRED): Requires request. Consists of data collected from a single household in the Netherlands which contains the appliance level and total energy consumption over two months. Appliance consumption measured was a refrigerator, washing machine, central heating, microwve, oven, cooker, blender, toaster, television, fan, living room outlets, and a laptop recorded with a sampling frequency of 1 Hz. DRED additionally has data on human occupancy based on WiFi and bluetooth signals received from occupant smartphones and wearable devices to allow for locating the consumer without setting up the home with more intrusive monitoring devices. DRED can be accessed by request.

- Electricity Consumption and Occupation (ECO): A dataset collected from June 2012-January 2013 covering 6 home in Switzerland where 6-10 smart plugs were deployed in each household. Aggregate consumption at the building level was measured in three phases to capture voltage, current, and phase shifts. Occupancy data was tracked by residents manually and via a passive infrared entry door sensor.

- Greend: A dataset of 9 households in Austria and Italy for one year covering December 2013-April 2014. Data included aggregated and submetered appliance level data which varied depending on the appliance inventory of the household covering active power measurements taken at a frequency of 1Hz. GREEND can be requested by form

- HIPE: A dataset from October 2017-December 2017 recording smart meter measurements from 10 machines and the main terminal of an electronics production site operated by the Institute of Data Processing and Electronics (IPE) at Karlsruhe Institute of Technology (KIT) in Germany at a cadence of 5 seconds with measurements with respect to active power, reactive power, voltage, frequency, and distortion.

- Indian data for Ambient Water and Electricity Sensing (iAWE): Total consumption, appliance level, as well as circuit panel level in a single family home in New Delhi, India was collected in summer of 2013 over the course of 73 days. Additional quantities such as water usage from an overhead tank, and network strength based on packet loss was also jointly measured.

- IDEAL: A joint electricity, gas, temperature, humidity, and light dataset for 255 homes in the UK from August 2016 to June 2018. Aggregate and sub-metered consumption was measured at 1 second intervals, while temperature, humidity and light were measured at 12 second intervals. Household occupancy was measured through initial surveys with respect to socio-demographic data and self-reported updates to the data in the event that there was a change in occupancy.

- Reference Energy Disaggregation Dataset (REDD): Contains 119 days worth of aggregate consumption taken in 2011 from 10 residential buildings located in the greater Boston area. The data includes meter level phases of power, and voltage recorded at 15kHz as well as sub-meter level 24 circuits labeled by appliance category and measured at a cadence of 0.5Hz and 1Hz for large and small plug level appliances respectively.

- REFIT: A dataset containing aggregate and individual appliance monitor sub-meter data taken every 8 seconds from 20 UK households from September 2013 to September 2015. Of the 8 households, 6 households had rooftop solar panels however, 3 were rewired to remove the effect of generation.

- UMass Smart Home data set: This dataset is comprised of metered and sub-metered data from three homes in west Massachussetts taken over a period of three years. Measurements included average household load, circuit-level load, and plug load per second. Accompanying generation data from solar panels and wind turbines is available for one of the three homes. Environmental data with respect to the outdoor weather and indoor temperature and humidity are provided as well as occupancy information through wall switch data, doors, and motion sensors. HVAC trigger events and corresponding temperature settings and operational status are also provided. 

- UK Domestic Appliance-Level Electricity data set (UK-DALE): A dataset comprised of measurements of aggregated as well as individual appliance level consumption recorded every 6 seconds from 5 UK homes taken from researchers at Imperial College. The continuous coverage varied per house ranging from 39 to 786 days spanning dates from 2012 to 2015. Data included whole house active power, apparent power, and RMS voltage. Appliance level measurements were taken every 6 seconds using individual appliance monitors for up to 54 appliances per residence. 

View dataset

Use CaseData Gap Summary
Enabling non-intrusive electricity load monitoring

 For accurate NILM studies, benchmark datasets are required to include not only consumption but local power generation (e.g., from rooftop solar), as it can affect the overall aggregate load observed at the building level. While some datasets may include generation information, most studies do not take rooftop solar generation into account. Additionally, devices that can behave both as a load and generator such as electric vehicles or stationary batteries were also not included. The majority of building types are single family housing units limiting the diversity of representation. Furthermore, most datasets are no longer maintained following study close.

Give feedback
TABULA building typology
Details (click to expand)

The TABULA project—short for Typology Approach for Building Stock Energy Assessment—was an EU initiative (2009–2012) to create a harmonized structure of residential building typologies across Europe. Countries define grids by construction era and building size, and for each “cell” select or model a representative building whose envelope, heating system, and energy use are characterized. These archetypes are used in a web tool to estimate heating demand, primary energy, CO₂ emissions, and to evaluate potential energy-saving measures—facilitating benchmarking, policy analysis, and refurbishment planning at national and regional scales.

Use CaseData Gap Summary
Enhancing the scalability and robustness of building stock assessments

TABULA suffers from limited granularity, insufficient volume, and outdated information. They often provide only one representative value per archetype, lack typological diversity across countries, and include parameters with questionable accuracy.

Give feedback
The Public Utility Data Liberation (PUDL)
Details (click to expand)

The Public Utility Data Liberation (PUDL) project, maintained by Catalyst Cooperative, integrates and standardizes energy sector data from US government agencies including EIA, FERC, EPA, and system operators into analysis-ready formats. This continuously updated database covers power generation, fuel consumption, emissions, and financial data from 2009 to present across the United States. 

View dataset

Use CaseData Gap Summary
Enhancing energy policy and market analysis

Government energy datasets suffer from inconsistent formats, missing documentation, and aggregation challenges that prevent ready analysis. Key gaps include complex pre-processing requirements due to format changes, limited documentation maintenance, and missing weather and transmission data. Standardized reporting formats across agencies, improved documentation practices, and expanded data collection could significantly enhance the utility of integrated energy datasets for policy analysis.

Give feedback
Travel surveys
Details (click to expand)

Travel surveys are essential tools for gathering detailed information on travel patterns, behaviors, modal shares (i.e., which percentage of the population uses which transportation mode), and preferences of individuals, crucial for transportation planning, policy development, and infrastructure improvements. 

These surveys are collected through various methods, including household interviews, travel diaries, and online or telephone surveys. The effectiveness of travel surveys depends on high participation rates and accurate data collection to ensure the sample’s representativeness. Travel surveys may be conducted at the city or national level.

Use CaseData Gap Summary
Understanding the impact of urban planning on travel emissions

Travel surveys often overlook smaller cities and rural areas, lack sufficient local data points, and provide data at ZIP code levels, limiting detailed urban planning. Modern technologies like GPS and privacy-preserving methods could address these gaps.

Give feedback
Truck count data
Details (click to expand)

Truck count data is typically collected using roadside sensors such as inductive loops, radar, or pneumatic tubes that detect and classify passing vehicles based on size and axle configuration. Cameras with computer vision algorithms can also identify and count trucks in real time, sometimes distinguishing between vehicle types. In some cases, weigh-in-motion systems or tolling infrastructure provide additional data on truck flows. This data helps transportation planners understand freight patterns and prioritize locations for interventions like electric vehicle charging to reduce emissions.

Use CaseData Gap Summary
Scaling truck count inference from remote sensing data

Truck count data suffers from limited coverage, especially in middle- and low-income countries, and often lacks sufficient granularity. Additionally, fragmented collection methods, inconsistent documentation, and limited data sharing hinder usability.

Give feedback
US large-scale solar photovoltaic database (USPVDB)
Details (click to expand)

The US Large-scale Solar Photovoltaic Database (USPVDB) contains polygon representation of large-scale photovoltaic installations,  associated with facility-specific data attributes. 

They were mined from the US Energy Information Administration (EIA) form 860 and facility type designation by the US Environmental Protection Agency (EPA). The dataset also has information on whether the large-scale PV installations are for agrivoltaic purposes. Overall, 3,699 US ground mounted facilities with capacity greater than or equal to 1MWdc are represented. The USPVDB data must be accessed through the United States Geological Survey (USGS) mapper browser application or for download as GIS data in the form of shapefiles or geojsons. Tabular data and metadata are provided in CSV and XML format.

Use CaseData Gap Summary
Mapping existing solar photovoltaic systems

Only the US are covered in this dataset and coverage in the US is not complete.

Give feedback
US school bus fleet dataset
Details (click to expand)

The US school bus fleet dataset compiled by the World Resources Institute contains information on school district, model year, fuel type, manufacturer, seating capacity, and ownership mode for over 450,000 buses from 46 states and the District of Columbia, covering data collected from March to November 2022.

View dataset

Use CaseData Gap Summary
Optimizing electrified bus fleet in urban vehicle-to-grid systems

The dataset suffers from inconsistent state-level reporting structures and missing data from 4 US states, limiting comprehensive national analysis. Standardizing reporting formats and expanding state participation could enable more robust AI models for fleet electrification planning across diverse geographic and operational contexts.

Give feedback
Urban planning projects data
Details (click to expand)

Urban planning projects datasets would track urban infrastructure changes initiated by authorities, such as modifications to parking spaces, modal filters, low-traffic neighborhoods, new housing developments, and street space allocations. 

It would aim to provide a reliable and temporally accurate record that is critical for causal inference analyses of urban planning interventions, where a more authoritative source than platforms like OpenStreetMap is required.

Use CaseData Gap Summary
Understanding the impact of urban planning on travel emissions

The development of urban planning projects datasets faces significant hurdles, including a lack of machine-readable formats or a typical focus on large projects in existing publicly available data. These issues are compounded by geographical biases, where data availability and detail vary based on regional digitalization levels.

Give feedback
WHOI Martha’s Vineyard Coastal Observatory (wind speed and direction)
Details (click to expand)

Woods Hole Oceanographic Institute Martha’s Vineyard Coastal Observatory data is a three-year measurement dataset of wind speed and direction from 60-200 meters. It is accessible at https://mvco.whoi.edu/.

Use CaseData Gap Summary
Improving offshore wind power forecasting: short-to long-term (3 hours–1 year)

The data only contains measurements close to coastline, constraining its applicability for offshore wind applications in deep sea.

Give feedback
WeatherBench 2
Details (click to expand)

Benchmark for global, medium-range (1-14 day) data-driven weather forecasting https://weatherbench2.readthedocs.io/en/latest/data-guide.html

Use CaseData Gap Summary
Weather forecasting: Short-to-medium term (1-14 days)

Weather Bench 2 is based on ERA5, so the issues of ERA5 are also inherent here, that is, data has biases over regions where there are no observations.

Give feedback
Wind Forecast Improvement Project 3 (wind data)
Details (click to expand)

The Wind Forecast Improvement Project 3 is a multi-seasonal offshore field measurement campaign in 2024-2025, linked to intensive numerical modeling development and validation efforts. 

This data will have not only wind speed and direction measurements, but also detailed observations of air-sea fluxes and atmospheric soundings, which could inform physics-based nowcasting. Infos:https://www2.whoi.edu/site/wfip3/ 

Use CaseData Gap Summary
Improving offshore wind power forecasting: short-to long-term (3 hours–1 year)

This data is not yet available.

Give feedback
subX
Details (click to expand)

NWP model output from subseasonal forecast experiment https://iridl.ldeo.columbia.edu/SOURCES/.Models/.SubX/.

Use CaseData Gap Summary
Weather forecasting: Subseasonal horizon

More data is needed to develop a more accurate and robust ML model. It is also important to note that SUBX data contains biases and uncertainties, which can be inherited by ML models trained with this data.

Give feedback
xBD Dataset (pre- and post-disaster satellite imagery)
Details (click to expand)

xBD is an annotated benchmark dataset containing pre- and post-disaster satellite imagery used for training and evaluating ML models in disaster damage assessment. The dataset is publicly available at https://paperswithcode.com/dataset/xbd

View dataset

Use CaseData Gap Summary
Accelerating post-disaster damage assessments

The xBD dataset has two significant limitations: it is geographically biased toward North America and lacks granular damage severity classification, limiting its global applicability and assessment precision.

Give feedback