Data Gaps (Beta)

Artificial intelligence (AI) and machine learning (ML) offer a powerful suite of tools to accelerate climate change mitigation and adaptation across different sectors. However, the lack of high-quality, easily accessible, and standardized data often hinders the impactful use of AI/ML for climate change applications.

In this project, Climate Change AI, with the support of Google DeepMind, aims to identify and catalog critical data gaps that impede AI/ML applications in addressing climate change, and lay out pathways for filling these gaps. In particular, we identify candidate improvements to existing datasets, as well as "wishes" for new datasets whose creation would enable specific ML-for-climate use cases. We hope that researchers, practitioners, data providers, funders, policymakers, and others will join the effort to address these critical data gaps.

This project is currently in its beta phase, with ongoing improvements to content and usability. The information provided is not exhaustive, and may contain errors. We encourage you to provide input and contributions via the routes listed below, or by emailing us at datagaps@climatechange.ai. We are grateful to the many stakeholders and interviewees who have already provided input.

Use Case Gap Types Sectors
Accelerating and improving weather forecasting: Near-term (< 24 hours)
Details (click to expand)

Accurate near-term (< 24 hours ahead) weather forecasting is critical for climate change mitigation (e.g., solar panel deployment) and adaptation (e.g., crisis management during disasters), with applications requiring high spatial and temporal resolution of temperature, precipitation, wind, and cloud coverage.

Machine learning can help make these forecasts more computationally efficient and accurate while maintaining or improving the high resolution needed for climate applications.

The main data gaps include limited geographic coverage (primarily US-centric data), extremely large data volumes that are difficult to transfer and process, and inconsistent data formats from different sources.

Addressing these gaps requires expanding coverage to global regions (especially the Global South), providing cloud-based computational resources alongside the data, and developing standardized formats for multi-source data integration.

DatasetData Gap Summary
Automated surface observation system (ASOS)

Data volume is large and only data specific to the US is available.

Give feedback
High-Resolution Rapid Refresh (HRRR) weather forecast

Data volume is large, and only data covering the US is available.

Give feedback
Regularly gridded high-resolution atmospheric observations

An enhanced version of ERA5 with higher granularity and fidelity is needed. Many surface observations and remote sensing data are available but underutilized for developing such a dataset.

Give feedback
Satellite imagery – Multi-Radar/Multi-Sensor System

Obtaining and integrating radar data from various sources is challenging due to access restrictions, format inconsistencies, and limited global coverage.

Give feedback
Accelerating building energy models
Details (click to expand)

Building energy modeling (also called building performance simulation) is key across an array of use cases that can help reduce energy demand in buildings, including architectural design, heating, ventilation, and air conditioning (typically abbreviated HVAC) design and control, building performance rating, and building stock analysis. 

Traditional building energy modeling, such as the software EnergyPlus relies on detailed physics models with significant computational complexity and processing time.. Machine learning models can significantly enhance evaluation by providing fast emulators for these models based on synthetic and real-world data, enabling faster prototyping and optimization of building design and operations along multiple comfort, consumption, and environmental objectives. 

Traditional models and ML-based emulation both require precise inputs about the building design, its usage, as well as the physical and environmental conditions surrounding it. However, information building usage and design are often kept in silos, while information about the surroundings are, when available, dispersed across various datasets. There are very few benchmarks gathering all information for given buildings.

Closing these gaps involves releasing anonymized usage data, working on building bridges between relevant datasets, and developing benchmark datasets. This may enable testing models across more geographies and building types to reduce existing biases and uncertainties attached to building energy models.

DatasetData Gap Summary
Benchmark datasets for building energy modeling

Benchmark datasets for building energy modeling are few, are mostly available in the US, and cover a limited range of building types. The variables provided in such datasets are not always precise and comprehensive enough to test models adequately.

Give feedback
Computational fluid dynamics simulation for building energy models

Despite its usefulness in ventilation studies for new construction, CFD simulations are computationally expensive making it difficult to include in the early phase of the design process where building morphosis can be optimized to reduce future operational consumption associated with building lighting, heating, and cooling. Simulations require accurate input information with respect to material properties that may not be present in traditional urban building types. Output of models  require the integration of domain knowledge to interpret results from large volumes of synthetic data for different wind directions becoming challenging to manage. Future data collection with respect to simulation output verification can benefit surrogate or proxy approaches to computationally expensive Navier-Stokes equations, and coverage is often restricted to modern building approaches, leaving out passive building techniques known as vernacular architecture from indigenous communities from being taken into design consideration.

Give feedback
Residential daylight performance metric (DPM) data

While daylight performance metric (DPM) evaluation is an important step in the planning of commercial buildings, residential buildings do not have a similar focus, which is unusual given that most new building construction occurs within the residential sector. Residential DPMs often lack metrics associated with direct sunlight access, rely on annual averages for seasons, and utilize fixed occupancy schedules that are overly simplified for residential spaces. Additionally, illuminance metrics and thresholds utilized in commercial spaces do not translate well for residential spaces where people may prefer higher or lower illuminances depending on their location and lifestyles. Lastly, DPM optimization is based on operational metrics and assumptions on illumination in a space and its effects on the resulting thermal comfort and operational consumption of a traditional urban residential spaces, vernacular architecture which is specific to a local region and culture may not share similar objectives, preferring more indoor-outdoor transitional spaces, earthen materials, and less focus on windows and incident natural sunlight.

Give feedback
Accelerating data-driven generation of climate simulations
Details (click to expand)

Climate simulation using physics-based Earth system models is computationally intensive and time-consuming, limiting the exploration of different climate scenarios. 

ML can accelerate this process by creating surrogate models that approximate complex Earth system model simulations, enabling rapid generation of climate projections under various greenhouse gas emission scenarios.

Current ML approaches are limited by the availability of diverse training data from multiple climate models, with most datasets featuring only single-model simulations or inconsistent data structures across models.

Addressing these gaps requires standardizing data formats across climate models, making high-volume data more accessible through cloud-based solutions, and improving model quality to reduce biases and uncertainties in simulations.Closing these data gaps would enable more robust ML emulators capable of producing reliable climate projections at a fraction of the computational cost, accelerating climate research and supporting more informed policy decisions.

DatasetData Gap Summary
ClimateBench v1.0 (benchmark dataset for earth system models)

The dataset currently includes simulations from only one Earth system model, limiting the diversity of training data and potentially affecting the robustness and generalizability of ML emulators trained on it.

Give feedback
ClimateSet (ML-ready earth system model inputs/outputs)

No significant data gap identified yet.

Give feedback
CMIP6 (earth system model intercomparison data)

The dataset faces three key challenges: its large volume makes access and processing difficult with standard computational infrastructure; lack of uniform structure across models complicates multi-model analysis; and inherent biases and uncertainties in the simulations affect reliability.

Give feedback
Accelerating distribution-side hosting capacity estimations
Details (click to expand)

Transitioning power grids from carbon-based generation to renewable sources requires restructuring from unidirectional to bidirectional energy networks, which stresses existing systems—especially at the low-voltage distribution level. The hosting capacity of distribution feeders determines how much distributed renewable generation can be safely integrated without triggering safety equipment or compromising power quality.

Traditional methods for assessing distribution network hosting capacity rely on computationally expensive power flow simulations that are difficult to perform in real-time. Machine learning models can serve as surrogate models by capturing spatio-temporal patterns across multiple data streams, enabling real-time hosting capacity estimation and accelerated scenario evaluation through reinforcement learning.

A significant data gap is the limited availability of real distribution feeder data, requiring researchers to rely on simulations that may not accurately reflect actual grid conditions due to differences in load patterns, environmental factors, and DER penetration levels.

Distribution system operators, utilities, and researchers can collaborate to improve data sharing while protecting sensitive information, thereby enabling more accurate hosting capacity assessments and facilitating higher renewable energy integration in distribution networks.

DatasetData Gap Summary
Distribution system simulators

While OpenDSS and GridLab-D provide valuable simulation capabilities, their utility is limited by challenges in obtaining verification data from real distribution circuits, aggregating necessary input data from multiple sources, and navigating usage rights for proprietary utility data. Closing these gaps through improved utility-researcher partnerships and data sharing protocols would significantly enhance the accuracy of hosting capacity assessments, enabling greater renewable energy integration in distribution networks.

Give feedback
Accelerating post-disaster damage assessments
Details (click to expand)

Post-disaster evaluations are crucial for identifying vulnerabilities exposed by climate-related events, which is essential for enhancing resilience and informing climate adaptation strategies.

ML can help by rapidly identifying and quantifying damage, such as structural collapse or vegetation loss, thereby improving response and recovery efforts.

Current datasets for ML-based damage assessment face significant geographic bias and granularity issues, limiting their effectiveness in global contexts and for detailed damage classification.

Expanding geographic coverage beyond North America and enhancing damage severity classifications would enable more accurate and globally applicable ML damage assessment models, improving disaster response worldwide.

DatasetData Gap Summary
Financial loss datasets related to the impacts of disasters

Financial loss data for disasters is primarily proprietary and inaccessible to researchers, limiting the development of comprehensive disaster impact assessment models.

Give feedback
Satellite imagery

Satellite imagery for disaster assessment faces challenges with temporal currency and spatial resolution, with public datasets having insufficient resolution for accurate damage assessment and commercial high-resolution options being prohibitively expensive.

Give feedback
xBD Dataset (pre- and post-disaster satellite imagery)

The xBD dataset has two significant limitations: it is geographically biased toward North America and lacks granular damage severity classification, limiting its global applicability and assessment precision.

Give feedback
Accelerating the design of new carbon-absorbing materials
Details (click to expand)

Carbon sequestration through absorption methods can effectively reduce CO2 levels in the atmosphere. Engineered molecules, carbon sorbents, can be designed to selectively bind to CO2. Traditionally, these molecules require in-lab experimentation, which can be time- and resource-intensive because they necessitate replication to identify adsorbent characteristics. Additionally, the search space of possible molecules can be very large and non-trivial to explore directly through experiment. 

Machine learning can significantly accelerate materials discovery by systematically generating and evaluating candidate molecule properties based on structure, thereby facilitating rapid iteration.

There is a lack of openly-accessible lab measurements to train ML simulation models.

Multiple initiatives could be taken to close this gap, including creating industry-research data sharing initiatives or establishing mandatory data sharing requirements for scientific publications.

DatasetData Gap Summary
Lab measurements of material property and carbon absorption

The major challenge is that data is not shared with the public.

Give feedback
Assessment of climate impacts on public health
Details (click to expand)

Climate change has major implications for public health. ML can help analyze the relationships between climate variables and health outcomes to assess how changes in climate conditions affect public health.

DatasetData Gap Summary
Health data

The biggest issue for health data is its limited and restricted access.

Give feedback
Historical climate observations

Processing climate data and Integrating climate data with health data is a big challenge.

Give feedback
Automating individual re-identification for wildlife
Details (click to expand)

Identifying individual animals within wildlife populations is critical for monitoring endangered species, understanding their behaviors, and developing effective conservation strategies for biodiversity preservation. 

Computer vision and machine learning techniques enable automatic individual identification at scale, helping researchers track specific animals over time without invasive tagging methods.

The scarcity of publicly available and well-annotated datasets poses a significant challenge for applying ML in wildlife identification, with the most valuable data scattered across individual research labs or organizations rather than centralized repositories.

Addressing this requires fostering a culture of data sharing in the ecological community through incentives like financial rewards and recognition for data collectors, while establishing standardized pipelines and infrastructures to aggregate existing annotated data for model training.

DatasetData Gap Summary
Camera trap wildlife image collections

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Give feedback
Drone imagery

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Give feedback
Environmental DNA (eDNA)

A significant challenge for eDNA-based monitoring is the incomplete barcoding reference databases, limiting the ability to accurately identify species from genetic material. Initiatives like the BIOSCAN project are actively working to address this gap by expanding reference collections for diverse taxonomic groups, particularly for understudied regions and species.

Give feedback
Bias-correction of climate projections
Details (click to expand)

Climate projections provide essential information about future climate conditions, guiding critical mitigation and adaptation efforts such as disaster risk assessments and power grid optimization. 

ML enhances the accuracy of these projections by bias-correcting forecasts from physics-based climate models like CMIP6, learning relationships between historical simulations and observed ground truth data. 

Large uncertainties in climate projections and inconsistent data formats across models create significant barriers for developing robust ML bias-correction methods. 

Improved model ensemble techniques and standardized data formats can enhance projection reliability and enable more effective climate risk planning.

DatasetData Gap Summary
CMIP6 (earth system model intercomparison data)

Large uncertainties in future climate projections limit confidence in bias-correction applications. The massive data volume and inconsistent formats across models—including variable naming conventions, resolutions, and file structures—hinder effective multi-model analysis. Improved model evaluation frameworks and data standardization efforts can enhance projection reliability and streamline ML model development.

Give feedback
ECMWF ERA5 Atmospheric Reanalysis

ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.

Give feedback
Ground-Based Weather Station Observations

Irregular spatial distribution and point-based measurements require extensive preprocessing to create gridded datasets suitable for ML applications. Limited station density in many regions, especially over oceans and remote areas, constrains bias-correction accuracy. Enhanced observation networks and improved interpolation techniques can provide more comprehensive spatial coverage for model validation.

Give feedback
Bias-correction of weather forecasts
Details (click to expand)

ML can be used to improve the fidelity of high-impact weather forecasts by post-processing outputs from physics-based numerical forecast models and by learning to correct the systematic biases associated with physics-based numerical forecasting models.

DatasetData Gap Summary
ECMWF ENS (global 9-km 15-day ahead weather model ensemble)

Same as HRES, the biggest challenge of ENS is that only a portion of it is available to the public for free.

Give feedback
ECMWF HRES (global 9-km 10-day ahead weather model)

The biggest challenge with using HRES data is that only a portion of it is available to the public for free.

Give feedback
Ground-Based Weather Station Observations

Data is not regularly gridded and needs to be preprocessed before being used in an ML model.

Give feedback
Enabling 2D to 3D shape recovery and pose estimation of animals
Details (click to expand)

3D shape recovery and pose estimation refer to the reconstruction of the 3D shapes and poses of animals from 2D images. This information can provide non-invasive insights into animals’ health, age, or reproductive status in their natural environment, which are important for biodiversity monitoring. 

ML-based computer vision techniques have been used to construct more accurate estimations of 3D animal shapes and poses. 

However, there is a lack of open annotated datasets to train models.

More efforts going into the curation and release of such datasets could be pivotal towards unlocking this use case.

DatasetData Gap Summary
Camera trap wildlife image collections

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity studies. Addressing this gap requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Give feedback
Enabling non-intrusive electricity load monitoring
Details (click to expand)

Non-intrusive load monitoring (NILM) is critical for disaggregating building electricity consumption into individual appliance profiles, enabling targeted energy efficiency strategies, demand response, and better supply/demand matching to reduce carbon emissions and maintain grid stability. 

AI techniques can analyze patterns in aggregate electricity data to identify individual appliance signatures without requiring separate meters for each device, providing cost-effective insights for both consumers and utilities.

The effectiveness of AI-based NILM is hindered by insufficient training data that represents diverse appliance types, usage patterns, and building characteristics across different regions, limiting model accuracy and generalizability in real-world settings.

Utilities, researchers, and manufacturers can collaborate to create standardized, privacy-preserving datasets through controlled data collection campaigns and by developing synthetic data generation techniques that capture the diversity of appliance signatures and usage patterns.

DatasetData Gap Summary
Pecan Street (appliance-level consumption data)

Pecan Street DataPort requires non-academic and academic users to purchase access via licensing which varies depending on the building data features requested. Coverage area of data is primarily concentrated in the Mueller planned housing community in Austin, Texas–a modern built environment which is not representative of older historical buildings that may be in need of energy efficient upgrades and retrofits. In customer segmentation studies and consumer-in-the-loop load consumption modeling, annual socio-demographic survey data may be too coarse and not provide insight into behavioral effects of household members on consumption profiles with time.

Give feedback
Sub-metered appliance-level data

 For accurate NILM studies, benchmark datasets are required to include not only consumption but local power generation (e.g., from rooftop solar), as it can affect the overall aggregate load observed at the building level. While some datasets may include generation information, most studies do not take rooftop solar generation into account. Additionally, devices that can behave both as a load and generator such as electric vehicles or stationary batteries were also not included. The majority of building types are single family housing units limiting the diversity of representation. Furthermore, most datasets are no longer maintained following study close.

Give feedback
Enhancing digital reconstructions of the environment
Details (click to expand)

Digital reconstruction of the environment using remote sensing data is crucial for understanding habitat conditions and their impacts on wildlife, enabling more effective conservation strategies in the face of climate change.

ML enhances this process by efficiently analyzing large volumes of data from multiple sources, producing more detailed and accurate environmental reconstructions.

A key data gap is the limited availability of high-resolution imagery, with most high-quality data being commercial and not freely accessible, particularly affecting studies that require detailed environmental monitoring.

Fostering a data-sharing culture through incentives for collectors, creating standardized annotation pipelines, and making commercial high-resolution satellite imagery more accessible would significantly advance ML-enabled environmental monitoring for biodiversity conservation.

DatasetData Gap Summary
Drone imagery

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Give feedback
Environmental DNA (eDNA)

 One gap in data is the incomplete barcoding reference databases.

Give feedback
Satellite imagery

Satellite images provide environmental information for habitat monitoring. Combined with other data, e.g. bioacoustic data, they have been used to model and predict species distribution, richness, and interaction with the environment. High-resolution images are needed but most of them are not open to the public for free.

Give feedback
Enhancing energy policy and market analysis
Details (click to expand)

Energy transition policies require comprehensive data on generation, emissions, and financial performance across power systems, but fragmented government datasets make evidence-based policymaking challenging.

AI and data fusion techniques can integrate scattered regulatory data from utilities and energy companies to create analysis-ready datasets that inform carbon pricing, renewable incentives, and grid modernization policies.

Inconsistent data formats, missing identifiers, and poor documentation across government agencies create significant barriers for automated data processing and analysis.

Standardized reporting formats, improved documentation, and centralized data platforms could enable more effective AI-driven policy analysis and accelerate evidence-based energy transitions.

DatasetData Gap Summary
The Public Utility Data Liberation (PUDL)

Government energy datasets suffer from inconsistent formats, missing documentation, and aggregation challenges that prevent ready analysis. Key gaps include complex pre-processing requirements due to format changes, limited documentation maintenance, and missing weather and transmission data. Standardized reporting formats across agencies, improved documentation practices, and expanded data collection could significantly enhance the utility of integrated energy datasets for policy analysis.

Give feedback
Enhancing estimations of methane emissions from rice paddies
Details (click to expand)

Rice paddies are a major source of global anthropogenic methane emissions. Accurate quantification of CH₄ emissions, especially how they vary with different agricultural practices, is crucial for addressing climate change. 

ML can enhance methane emission estimation by automatically processing and analyzing remote-sensing data, leading to more efficient assessments.

Currently, there is a lack of direct observation of methane emissions from rice paddies that could be used to train ML models.

Real-world data collection is needed to unlock this use case.

DatasetData Gap Summary
Direct measurement of methane emission of rice paddies

There is a lack of direct observation of methane emissions from rice paddies.

Give feedback
Enhancing marine wildlife detection and species classification
Details (click to expand)

Marine wildlife detection and species classification are crucial for understanding the impacts of climate change on marine ecosystems. These tasks involve identifying and categorizing different marine species. 

ML can significantly enhance these efforts by automatically processing large volumes of data from diverse sources, improving accuracy and efficiency in monitoring and analyzing marine life.

Current bottlenecks due to data availability include the lack of sufficient labeled data and the lack of open data. Regarding existing data, enabling broader data sharing is the most critical challenge to address. A lot of ocean data is collected, there are massive gaps in coverage, with heavy biases towards coastal regions. Collecting data from the deep ocean is technologically challenging and financial incentives are lacking. High seas fall outside national jurisdictions, so data collection often occurs only through mining companies, military operations, or ad hoc research expeditions. The absence of marine protected areas on high seas and the migratory nature of species like phytoplankton further complicate data collection. 

Open-source databases containing labeled data and label editors such as FathomNet can increase the amount of relevant data for training ML models. Initiatives like the Ocean Biodiversity Information System (OBIS) and, Integrated Ocean Observing System (IOOS) contribute to data availability more broadly. Data collection efforts may strategically target places where biodiversity is large, but currently available data is sparse. Financial tools or regulations could incentivize data collection.

DatasetData Gap Summary
FathomNet (marine wildlife annotated imagery)

The biggest challenge for the developer of FathomNet is enticing different institutions to contribute their data.

Give feedback
Enhancing power grid-vegetation management for wildfire risk mitigation
Details (click to expand)

Vegetation encroachment near high-voltage transmission lines can lead to outages and pose major fire risks, compromising the safety and reliability of the power grid and potentially igniting dangerous wildfires that release stored carbon and endanger wildlife.

Machine learning, especially computer vision applied to remote sensing imagery and historic management records, can accelerate vegetation management by identifying overgrowth areas and tracking dynamic seasonal vegetation growth near grid infrastructure.

Key data gaps include limited access to proprietary utility data, sparse LiDAR captures leading to incomplete scans, insufficient temporal and spatial coverage, and preprocessing requirements for imagery from multiple sensor platforms.

Solutions include establishing partnerships with utilities for data sharing, coordinating multiple robot/UAV inspection trips for improved coverage, developing preprocessing pipelines for diverse sensor data, and implementing regular monitoring schedules to capture seasonal vegetation changes.

DatasetData Gap Summary
Aerial power line corridor inspection data

UAV imagery for vegetation management near power lines requires partnerships with private companies and utilities for access. LiDAR data is often sparse with partial line scans resulting in poor data quality. Coverage is typically limited to specific rights-of-way, requiring continuous monitoring to track vegetation growth over time.

Give feedback
Power line robot inspection imagery

Grid inspection robot imagery requires coordination with local utilities foraccess, multiple robot trips for complete coverage, image preprocessing to remove ambient artifacts, position and location calibration, and may be limited by camera resolution for detecting subtle degradation patterns.

Give feedback
Enhancing wind power grid integration and stability
Details (click to expand)

The integration of low-inertia distributed energy resources like wind power into the grid creates critical stability and reliability challenges, particularly for maintaining system frequency at nominal levels to prevent damage and blackouts.

AI and machine learning can enhance wind power’s contribution to grid stability by optimizing synthetic inertial and primary frequency response capabilities through advanced modeling and control strategies.

Key data gaps include limited accessibility to simulation tools, insufficient temporal granularity in models that operate on hourly rather than sub-hourly scales, and reliability concerns due to the lack of real-world validation data for model outputs.

Grid operators and research institutions can collaborate to improve model accessibility, increase temporal resolution to capture sub-hourly dynamics, and validate simulations with operational data, enabling more effective AI-driven solutions for grid stability as renewable penetration increases.

DatasetData Gap Summary
NREL Wind Active Power Control Simulation Tools

Access to NREL’s FESTIV model requires special permission, limiting broader research applications. The model’s hourly temporal resolution cannot capture sub-hourly dynamics critical for frequency response and system stability. Additionally, the simulation-based approach requires validation with real-world operational data to ensure accuracy for practical grid applications.

Give feedback
Facilitating grid reliability events analysis
Details (click to expand)

Due to rapid fluctuations in power generation, renewables introduce variability into the grid. These signals are capable of triggering safety monitoring systems related to grid stability. Power grid control centers receive multiple streams of data from these systems (e.g. alarms, sensors, and field reports) that are semi-structured and arriving at a high volume. For operators, these alarm triggers and associated data can be overwhelming to rationalize, reduce, and contextualize to diagnose grid conditions. 

ML can assist in interpreting these data to better understand the sequence of events leading up to an incident as well as to identify and detect the causes behind system disturbances affecting grid reliability.

Access to grid reliability data remains limited, the amount of preprocessing needed constitutes a hurdle, and not all alarm triggers have been validated, also possibly resulting in noise. 

More open data releases and open community work regarding data preprocessing can help further advance this use case.

DatasetData Gap Summary
EPRI10 (transmission control center alarm and operational data set)

Access to EPRI10 grid alarm data is currently limited within EPRI. Data gaps with respect to usability are the result of redundancies in grid alarm codes, requiring significant preprocessing and analysis of code ids, alarm priority, location, and timestamps. Alarm codes can vary based on sensor, asset and line. Resulting actions as a consequence of alarm trigger events need field verifications to assess the presence of fault or non-fault events.

Give feedback
Facilitating disaster risk assessments
Details (click to expand)

As climate change progresses, extreme weather events and related hazards are expected to become more frequent and severe. To effectively address these challenges, robust disaster risk assessment and management are crucial. This involves better mapping of which population and assets are subject to given risks. 

ML can be used to facilitate disaster risk assessments by helping analyze satellite imagery and geographic data in order to pinpoint vulnerable areas and produce more detailed risk maps. By this, ML can overcome some limitations of traditional ground surveys that are time- and cost-intensive.

There is a general lack of data from the Global South where, for many regions, collection capabilities are lower while climate impacts are forecasted to be disproportionally high. Existing data are typically incomplete, even in most high-income countries, limiting the depth of potential analyses and generating uncertainties in assessments, for example, about monetary losses due to disasters.

Closing these data gaps involves inter alia deploying ML techniques that perform well in the Global South, collecting high-quality data involving local knowledge in a variety of contexts, and making the best remote sensing and cadaster data available to these efforts.   

DatasetData Gap Summary
Building stock – from cadaster and aerial imagery

These datasets are mainly available in rich countries from Europe, North America, and Asia, leaving large parts of the world with timely challenges involving their building stock without appropriate data for detailed assessments. The lack of attributes beyond the location and shape of the buildings also limits the breadth of applications. ML may help by inferring missing attributes, given the availability of sufficient training data. Research efforts in particular in Europe, including EUBUCCO (eubucco.com) or the Digital Building Stock Model by the Joint Research Centre of the European Commission (https://data.jrc.ec.europa.eu/collection/id-00382), are addressing several of the existing data gaps.

Give feedback
Building stock – satellite-derived

These datasets are typically released at a scale that makes their validation complex and partial, implying potentially large uncertainties in the data. The lack of attributes beyond the location and shape of the buildings also limits the breadth of applications. ML may help by inferring missing attributes, given the availability of sufficient training data.

Give feedback
Digital elevation model

Very high-resolution reference data is currently not freely open to the public.

Give feedback
Financial loss datasets related to the impacts of disasters

Financial loss data is typically proprietary and not publicly accessible.

Give feedback
Natural hazards forecasts

The resolution of current natural hazard forecast data is not sufficient for effective physical risk assessment.

Give feedback
OpenStreetMap (land use map)

The quality of OpenStreetMap is very variable in terms of coverage of geometries e.g. buildings and attributes. Roads are better mapped than buildings in general. The very permissive data model from OpenStreetMap enables users to provide a variety of information, but it is often not well harmonized. Recent corporate editing efforts have increased dramatically the coverage in previously poorly mapped regions.

Give feedback
Population and asset exposure to natural hazards

Accessibility and reliability are the most significant challenges with exposure data.

Give feedback
Facilitating fault detection in low voltage distribution grids
Details (click to expand)

The low-voltage distribution portion of the grid directly supplies power to consumers. As consumers integrate more distributed energy resources and dynamic loads (such as electric vehicles), low-voltage distribution systems are susceptible to power quality issues that can affect the stability and reliability of the grid. Fault-inducing harmonics can be challenging to monitor, diagnose, and control due to the number of nodes/buses that connect various grid assets and the short distances between them. 

Machine learning methods can recognize patterns to automate fault diagnoses agnostic to specific line thresholds and topologies. If integrated into advanced monitoring systems, detecting and localizing faults can accelerate adaptive protection and network reconfiguration efforts to ensure reliability and stability.

Data gaps for this use case include lack of coverage (spatial and temporal), noise in the data and high data volume.

New data collection and further analyses of existing data to better understand its pitfalls have the potential to help mitigate the existing gaps for this use case.

DatasetData Gap Summary
Micro-synchrophasors (µPMU data)

For effective fault localization using µPMU data, it is crucial that distribution circuit models supplied by utility partners or Distribution System Operators (DSOs) are accurate, though they often lack detailed phase identification and impedance values, leading to imprecise fault localization and time series contextualization. Noise and errors from transformers can degrade µPMU measurement accuracy. Integrating additional sensor data affects data quality and management due to high sampling rates and substantial data volumes, necessitating automated analysis strategies. Cross-verifying µPMU time series data, especially around fault occurrences, with other data sources is essential due to the continuous nature of µPMU data capture. Additionally, µPMU installations can be costly, limiting deployments to pilot projects over a small network. Furthermore, latencies in the monitoring system due to communication and computational delays challenge real-time data processing and analysis.

Give feedback
Facilitating forest restoration monitoring
Details (click to expand)

Efforts are being made to restore ecosystems like forests and mangroves. 

ML can be used to monitor biodiversity changes before and after restoration efforts, in order to quantify their effectiveness and outcomes.

A significant data gap is the lack of standardized protocols to guide data collection for restoration projects, making it difficult to consistently assess biodiversity outcomes using ML across different restoration initiatives.

Developing standardized data collection protocols, fostering a culture of data sharing, and implementing incentives for data collectors would enable more effective ML applications, leading to better assessment of restoration successes and failures on a global scale.

DatasetData Gap Summary
Camera trap wildlife image collections

There is a lack of standardized protocol to guide data collection for restoration projects, including which variables to measure and how to measure them. Developing such protocols would standardize data collection practices, enabling consistent assessment and comparison of restoration successes and failures on a global scale.

Give feedback
Drone imagery

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Give feedback
Passive acoustic monitoring for biodiversity assessment

There is a lack of standardized protocol to guide data collection for restoration projects, including which variables to measure and how to measure them. Developing such protocols would standardize data collection practices, enabling consistent assessment and comparison of restoration successes and failures on a global scale.

Give feedback
Facilitating the detection of climate-induced ecosystem changes
Details (click to expand)

Climate change is causing significant alterations in ecosystems worldwide, threatening biodiversity and ecosystem services that are critical for both nature and human well-being. 

Machine learning can analyze complex ecological data from multiple sources to detect climate change impacts, identify vulnerable regions, and inform targeted conservation efforts. 

Key data gaps include insufficient high-resolution climate and biodiversity data, restricted access to ground survey data, and limited institutional capacity to process collected data efficiently. 

Addressing these gaps requires establishing decentralized monitoring networks, improving data accessibility through legislative reforms, and developing sustainable funding models for long-term ecosystem monitoring initiatives.

DatasetData Gap Summary
Camera trap wildlife image collections

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Give feedback
Ground survey of land use and land management

Access to comprehensive ground survey data is restricted due to institutional barriers and privacy concerns, limiting its availability for ecosystem change analysis.

Give feedback
Historical climate observations

For highly biodiverse regions, like tropical rainforests, there is a lack of high-resolution data that captures the spatial heterogeneity of climate, hydrology, soil, and the relationship with biodiversity.

Give feedback
Passive acoustic monitoring for biodiversity assessment

The major data gap is the limited institutional capacity to process and analyze data promptly to inform decision-making.

Give feedback
Hybrid ML-physics climate models for enhanced simulations
Details (click to expand)

Physics-based climate models incorporate numerous complex components that are computationally intensive, which limits the spatial resolution achievable in climate simulations. 

ML models can emulate these physical processes, providing a more efficient alternative to traditional methods, enabling faster simulations and enhanced model performance.

The most significant data gaps are the enormous volume of climate data, which creates challenges for storage, transfer, and processing, and insufficient granularity in existing datasets to resolve fine-scale physical processes like turbulence.

Developing improved computational infrastructure for handling large datasets and creating ultra-high-resolution benchmark simulations would significantly enhance hybrid climate modeling capabilities.

DatasetData Gap Summary
ClimSim (benchmark data for hybrid ML-physics research)

ClimSim faces challenges with its large data volume, making downloading and processing difficult for most ML practitioners, and its resolution is insufficient to resolve some fine-scale physical processes critical for accurate climate modeling.

Give feedback
DYAMOND (global atmospheric circulation model intercomparison data)

DYAMOND faces similar challenges to ClimSim: its large volume creates processing difficulties, and its resolution, while high, remains insufficient for resolving fine-scale atmospheric processes needed for accurate climate modeling.

Give feedback
ECMWF ERA5 Atmospheric Reanalysis

While ERA5 is widely used due to its good structure and global coverage, users face significant challenges with downloading times that can take days to months, and the sheer data volume presents processing difficulties for many users. 

Give feedback
Large-eddy simulations (atmospheric processes)

These simulations are essential for resolving turbulent processes that current climate models cannot capture, but they require significant computational resources and are not readily available as benchmark datasets for the wider research community.

Give feedback
Regularly gridded high-resolution atmospheric observations

While conceptually needed, this dataset does not exist in the form required. An enhanced version of ERA5 with higher resolution and fidelity would significantly improve ML model training and validation. 

Give feedback
Improving battery management systems
Details (click to expand)

Battery storage is crucial for transitioning to renewable energy and electrifying transportation, with efficiency and lifetime directly impacting these sustainability efforts.

Machine learning can improve battery management systems by accurately estimating state of charge (SoC), state of health (SoH), and remaining useful life (RUL), and optimizing charging and discharging strategies.

Key data gaps include oversimplified battery models that don’t account for real-world operating conditions and insufficient validation data from physical battery systems in diverse operational environments.

Enhancing model complexity and collecting comprehensive real-world performance data can significantly improve battery management predictions, leading to extended battery lifetimes, more efficient energy use, and accelerated adoption of electric vehicles and renewable energy storage.

DatasetData Gap Summary
Equivalent circuit models

While ECMs enable real-time battery SoC predictions due to their computational efficiency, they often oversimplify real-life operating conditions which limits the accuracy of SoH and RUL estimates. Additionally, verification with data from physical battery systems is required to validate simulated outcomes and improve prediction reliability across diverse operational environments.

Give feedback
Improving estimations of forest carbon stock
Details (click to expand)

Forests are one of Earth’s major carbon sinks, making accurate estimation of forest carbon stocks essential for climate change mitigation efforts and carbon accounting. 

ML can help by providing more precise and large-scale estimates of forest carbon through the analysis of satellite imagery, LiDAR data, and ground surveys.

Ground truth data for forest carbon stock estimation is often limited in geographical coverage and temporal frequency due to the high costs and labor-intensive nature of manual data collection. Additionally, remotely sensed data (satellite, airborne LiDAR) requires significant domain expertise for proper preprocessing and interpretation.

Governments and research institutions can address these gaps by investing in more comprehensive ground survey programs, making airborne LiDAR data more widely available, and developing standardized preprocessing tools for non-experts to utilize remote sensing data effectively.

DatasetData Gap Summary
Ground-survey based forest inventory data

Manual collection results in data quality issues and limited spatial coverage, requiring improved collection protocols and integration with remote sensing to expand usability.

Give feedback
LiDAR point cloud – airbone

Limited geographical coverage due to high collection costs, combined with the need for domain expertise to process the complex point cloud data, restricts the use of this high-value data source.

Give feedback
Satellite imagery – GEDI LiDAR

Quality uncertainties in GEDI data affect carbon stock estimation reliability, requiring validation methods and calibration procedures to improve measurement accuracy.

Give feedback
Satellite imagery – PALSAR radar images

Domain expertise is needed to preprocess this data, limiting its accessibility to researchers and practitioners without specialized knowledge in radar imagery interpretation.

Give feedback
Improving long-term extreme heat prediction
Details (click to expand)

Extreme heat events are becoming more frequent and intense due to climate change, posing serious risks to human health, infrastructure, and ecosystems worldwide.

Machine learning can improve long-term extreme heat prediction by identifying complex patterns in climate data and enhancing the accuracy and resolution of projections beyond what traditional physics-based models can achieve.

Working with climate projection datasets presents significant challenges due to their massive size, which requires substantial computational resources for storage, transfer, and processing, limiting accessibility for many researchers and stakeholders.

Cloud computing providers, research institutions, and funding agencies can collaborate to develop accessible platforms and tools for efficiently managing large climate datasets, enabling broader use of AI for extreme heat prediction and adaptation planning.

DatasetData Gap Summary
NEX-GDDP-CMIP6 (Global daily downscaled long-term climate projections)

The dataset’s massive size (petabytes of data) creates significant barriers for access, transfer, and analysis, requiring specialized computing infrastructure and technical expertise that many researchers lack. Additionally, efficiently extracting relevant extreme heat information from this comprehensive climate dataset presents computational and methodological challenges.

Give feedback
Improving offshore wind power forecasting: short-to long-term (3 hours–1 year)
Details (click to expand)

Wind forecasting can allow for resource assessment studies for offshore energy production, wind resource mapping, and wind farm modeling.

Machine learning can improve spatio-temporal forecasts at different horizons, given the availability of high-quality training data.

Current data gaps include coverage gaps, noisy data and difficulties to access data.

Efforts to get more of such data out of silos, mainly from energy companies, may help alleviate this gap.

DatasetData Gap Summary
Ocean observations from floating infrastructure (FINO3)

Due to the location, FINO platform sensors are prone to failures of measurement sensors due to adverse outdoor conditions such as high wind and high waves. Often, when sensors fail, manual recalibration or repair is nontrivial requiring weather conditions to be amenable for human intervention. This can directly affect the data quality which can last for several weeks to a season. Coverage is also constrained to the platform and associated wind farm location.

Give feedback
Offshore wind data from masts and LiDAR

The spatiotemporal coverage of the offshore windspeed mast data is restricted to the dimensions of the platform/tower itself as well as the time of construction. Depending on the data provider access to the data may require the signing of a non-disclosure agreement.

Give feedback
Improving offshore wind power nowcasting (10 min)
Details (click to expand)

Wind nowcasting can enable estimations of the active power generated by wind farms in the absence of curtailment and facilitate operations, potentially making them more efficient.

Machine learning can improve such very short-term spatio-temporal forecasts, given the availability of high-quality training data.

High-resolution wind data measured at wind farms currently remains limited to a few datasets. 

Efforts to get such data out of sillos, mainly from energy companies, may help alleviate this gap.

DatasetData Gap Summary
Offshore wind farm operation data (Orsted)

Data can be accessed by requesting access via the Orsted form.  Sufficiency of the dataset is constrained by volume where only a finite amount of short term off-shore wind farms exist to which expanding the coverage area, volume and time granularity of data to under 10 minutes may enable transient detection from generated active power.

Give feedback
Improving power grid optimization
Details (click to expand)

Optimal Power Flow (OPF) is used to find the cheapest way to generate electricity while meeting demand and staying within system limits like voltage and line capacity. Traditionally, OPF is a complex math problem solved separately for AC and DC systems. As more renewable energy is added, the grid is shifting toward hybrid AC/DC systems to better handle long-distance power flow and new challenges like two-way power movement. 

Changes in the grid due to renewable sources make OPF harder to solve. ML can be used to approximate OPF problems in order to allow them to be solved at greater speed, scale, and fidelity.

Data gaps for this use case are numerous and mainly across usability, reliability, and sufficiency. 

Closing these gaps requires an array of gap-specific actions; further industry engagement may have a significant impact on many of the gaps.

DatasetData Gap Summary
Grid2Op and PandaPower (power systems simulation outputs))

Grid2Op faces several data gaps related to usability, reliability, and coverage. Key issues include poor documentation, limited customization options (especially for reward functions and cascading failure scenarios), and a lack of support for multi-agent setups. The framework also lacks realistic system dynamics, fine time resolution, and flexible backend modeling, making it challenging to use for advanced research or real-world grid simulations without significant modification. These gaps can hinder the framework’s ability to accurately train reinforcement learning agents and simulate real-world power grid behavior.

Give feedback
Optimal power flow simulation outputs

Traditional OPF simulation software may require the purchase of licenses for advanced features and functionalities. To simulate more complex systems or regions, additional data regarding energy infrastructure, region-specific load demand, and renewable generation may be needed to conduct studies. OPF simulation output would require verification and performance evaluation to assess results in practice. Increasing the granularity of the simulation model by increasing the number of buses, limits, or additional parameters increases the complexity of the OPF problem, thereby increasing the computational time and resources required.

Give feedback
Power Grid Lib (optimal power flow benchmark library)

While network datasets are open source, maintenance of the repository requires continuous curation and collection of more complex benchmark data to enable diverse AC-OPF simulation and scenario studies. Industry engagement can assist in developing more realistic data though such data without cooperative effort may be hard to find.

Give feedback
Improving short-term electricity load forecasting
Details (click to expand)

Short-term load forecasting is critical for utilities to balance power demand with supply. Utilities need accurate forecasts (e.g. on the scale of hours, days, weeks, up to a month) to plan, schedule, and dispatch energy while decreasing costs and avoiding service interruptions. 

ML is well suited to handle large amounts of data such as historical electricity load data, weather forecasts, and continuous streams of advanced metering infrastructure (AMI) data, from which it may capture non-linearities which traditional linear models often struggle with.

Several data gaps for this use case resolve around the difficulty to access varied data due inter alia to privacy concerns and lack of willingness from private actors to share data for research.

ML can help the development of synthetic, privacy-preserving datasets that can accelerate research in this space.

DatasetData Gap Summary
Advanced metering infrastructure data

AMI data is challenging to obtain without pilot study partnerships with utilities since data collection on individual building consumer behavior can infringe upon customer privacy, especially at the residential level. Granularity of time series data can also vary depending on the level of access to the data, whether it be aggregated and anonymized or based on the resolution of readings and system. Additionally, the coverage of data will be limited to utility pilot test service areas, thereby restricting the scope and scale of demand studies.

Give feedback
Building data genome project (hourly building-level metered data)

While the dataset has clear documentation, the lack of diversity in building types necessitates further development of the dataset to include other types of buildings, as well as expanding coverage to areas and times beyond those currently available.

Give feedback
Faraday (Synthetic smart meter data)

Despite the model being trained on actual AMI data, synthetic data generation requires a priori knowledge with respect to building type, efficiency, and presence of low-carbon technologies. Furthermore, since the dataset is trained on UK building data, AMI time series generated may not be an accurate representation of load demand in regions outside the UK. Finally, since data is synthetically generated, studies will require validation and verification using real data or aggregated per substation level data to assess its effectiveness.

Give feedback
Improving solar power forecasting: long-term (>24 hours)
Details (click to expand)

Accurately forecasting solar power generation beyond 24 hours is critical for energy market pricing, investment decisions, and coordinating renewable energy sources in an increasingly decarbonized grid. 

Machine learning approaches can improve longer-term solar forecasting by combining weather predictions, historical generation data, and other relevant variables to create more accurate models than traditional methods. 

The primary data gaps include limited geographic coverage of existing datasets, reliance on simulated rather than measured data, and quality concerns when adapting models to specific regions. 

Expanding data collection networks, validating simulated data with real measurements, and creating standardized datasets for diverse regions would enable more reliable ML-based solar forecasting systems that could significantly improve grid stability and accelerate renewable energy adoption.

DatasetData Gap Summary
NREL solar power data for integration studies

While valuable for renewable energy integration studies, this dataset has limitations in geographic coverage (limited to the US), temporal scope (only 2006 data), and relies on simulated rather than measured PV outputs. Addressing these gaps would enable more accurate and globally applicable ML-based solar forecasting models.

Give feedback
Improving solar power forecasting: medium-term (6-24 hours)
Details (click to expand)

Medium-term solar forecasting (6-24 hours ahead) is essential for efficient grid management, especially as solar power integration increases, impacting energy markets, demand response, and microgrid operations.

Machine learning techniques can significantly improve these forecasts by integrating satellite data with weather predictions and historical patterns to provide more accurate solar irradiance estimates.

A key data gap is the inconsistency in satellite data resolutions and coverage, alongside challenges in processing multispectral data and accurately modeling how different cloud types affect ground irradiance.

Combining satellite observations with ground-based measurements and developing standardized preprocessing approaches would substantially improve forecast accuracy, enabling better grid management and renewable energy integration.

DatasetData Gap Summary
Satellite imagery

Satellite remote sensing data for solar forecasting faces several challenges: variability in spatial and temporal resolution across different satellite sources, complex preprocessing requirements for multispectral data, and the need to accurately translate cloud observations into ground-level irradiance predictions. Improving granularity through supplementation with ground-based measurements and developing standardized preprocessing pipelines would significantly enhance forecast accuracy for grid management applications.

Give feedback
Improving solar power forecasting: nowcasting/very-short-term (0-30min)
Details (click to expand)

Very-short-term solar power forecasting is critical for grid stability and efficiency as sudden changes in solar irradiance (ramp events) can cause abrupt fluctuations in power generation. 

AI techniques can analyze cloud dynamics through segmentation and classification to predict solar irradiance attenuation, enabling more accurate forecasting for real-time electricity markets, dispatch of other generating sources, and energy storage control.

Key data gaps include limited spatial coverage of ground monitoring stations, insufficient time resolution for sub-5-minute forecasting, challenges with large data volumes from sensor networks, and data quality issues related to sensor calibration.

Expanding sensor networks to diverse environments, implementing AI-based data compression and quality control, and integrating multi-source data can close these gaps, ultimately enabling more reliable integration of solar power into electricity grids.

DatasetData Gap Summary
DOE Atmospheric Radiation Measurement research facility data products

ARM data presents challenges with data volume management, measurement verification (especially for aerosol composition), limited spatial coverage (ARM sites only), and sensor calibration issues. Solutions include AI-based data compression, enhanced aerosol composition measurements, collaboration with partner networks to expand coverage, and automated quality control.

Give feedback
NIST campus photovoltaic arrays and weather station data

The dataset has limited spatial coverage (Gaithersburg, MD only) and is no longer maintained after July 2017, limiting its usefulness for current applications.

Give feedback
Solcast (global solar forecasting and historical solar irradiance data)

Solcast data is only accessible through academic or research institutions, uses coarse elevation models, has limited coverage (33 global sites), and provides data at 5-60 minute intervals, insufficient for very-short-term forecasting.

Give feedback
SRRL TSI-880 (sky imagery)

Data coverage is limited by camera locations, temporal resolution is restricted to 10-minute increments, and image resolution is limited to 352x288 24-bit jpeg images.

Give feedback
SWINySEG (sky imagery)

The dataset provides valuable annotated data for cloud detection and segmentation but is limited to Singapore, has an insufficient volume of samples (especially nighttime images), and restricts commercial use.

Give feedback
Improving solar power forecasting: short-term (30 min-6 hours)
Details (click to expand)

Solar irradiance forecasting at hourly intervals is critical for managing intermittent solar energy resources and ensuring grid stability and reliability. 

Machine learning approaches can enhance forecasting accuracy by leveraging multiple data sources, including measured irradiance, PV inverter outputs, and meteorological variables. 

Important data gaps include limited spatial coverage, with most high-quality data concentrated in specific regions, and inconsistent temporal resolution that affects forecasting precision. 

By expanding sensor networks globally and harmonizing data collection standards, forecasting models can better support real-time energy management, demand response, and grid stability across diverse geographical areas.

DatasetData Gap Summary
NOAA's SOLRAD Network Solar Radiation Data

While NOAA’s SOLRAD is an excellent data source for long term solar irradiance and climate studies,  it has limitations for short-term solar forecasting applications. Key gaps include lower quality hourly averages compared to native resolution data, and limited geographic coverage with only nine monitoring stations across the United States. These constraints impact the effectiveness of forecasting for real-time energy management, grid stability, and market operations.

Give feedback
NREL Physical Solar Model Solar Radiation Database

While NSRDB offers global coverage using satellite-derived data, several challenges exist. The dataset requires periodic recalculation and updating to remain current, with unbalanced temporal coverage favoring the United States. Satellite-based estimations may be inaccurate in regions with frequent cloud cover, snow, or bright surfaces, requiring ground-based verification. Additionally, data derived from satellite imagery may need preprocessing to account for parallax effects and field-of-view issues that aren’t fully addressed in the higher-level FARMS products.

Give feedback
NREL SRRL Baseline Measurement System for Multi-Variable Solar Research

While NREL’S SRRL BMS provides real-time joint variable data from ground-based sensors, its coverage is limited to the single location in Golden, CO in the United States. The diverse sensor network requires regular maintenance, and instrument malfunctions or calibration issues may lead to data inaccuracies if not promptly detected and addressed, affecting the reliability of solar forecasting applications.

Give feedback
SMA Solar Technology PV System Performance database

The SMA PV monitoring system requires user profile creation and specific system access requests, with documentation primarily in German creating potential language barriers. Data representation is geographically unbalanced with stronger coverage in Germany, Netherlands, and Australia despite its global presence. Additionally, only a subset of systems includes energy storage data, which would be valuable for comprehensive distributed energy resource load forecasting studies.

Give feedback
SOLETE Hybrid Solar-Wind Generation dataset

While SOLETE offers valuable data for joint wind-solar distributed energy resource forecasting, several sufficiency gaps limit its application. The dataset’s 15-month temporal coverage doesn’t capture long-term seasonal variations, and it monitors only a single wind turbine and PV array, limiting analyseis of multi-source generation coordination. Additionally, maintenance schedule and system downtime data are missing, which would enhance realistic system dynamic modeling. Supplementing with external data sources or simulation could address these limitations.

Give feedback
Improving terrestrial wildlife detection and species classification
Details (click to expand)

Terrestrial wildlife detection and species classification are essential for understanding the impacts of climate change on terrestrial ecosystems. 

ML can greatly improve these efforts by automatically processing large volumes of data from diverse sources, enhancing the accuracy and efficiency of monitoring and analyzing terrestrial species.

The primary data gaps include insufficient publicly available annotated datasets and challenges with sharing large-volume bioacoustic data due to storage limitations and high costs.

Solutions include developing affordable data hosting platforms, incentivizing data sharing through recognition and funding, and establishing standardized protocols for data integration.

DatasetData Gap Summary
Biodiversity images and recordings – community science data

The main challenge with community science data is its lack of diversity. Data tends to be concentrated in accessible areas and primarily focuses on charismatic or commonly encountered species.

Give feedback
Camera trap wildlife image collections

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Give feedback
Drone imagery

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Give feedback
Environmental DNA (eDNA)

 One gap in data is the incomplete barcoding reference databases.

Give feedback
Museum specimens

The majority of the world’s museum specimens remain undigitized, creating a significant barrier to using these records in machine learning applications for biodiversity monitoring and climate change research.

Give feedback
Passive acoustic monitoring for biodiversity assessment

The first and foremost challenge of bioacoustic data is its sheer volume, which makes its data sharing especially difficult due to limited storage options and high costs. Urgent solutions are needed for cheaper and more reliable data hosting and sharing platforms.

Additionally, there’s a significant shortage of large and diverse annotated datasets, much more severe compared to image data like camera trap, drone, and crowd-sourced images.

Give feedback
Satellite imagery

Some commercial high-resolution satellite images can also be used to identify large animals such as whales, but those images are not open to the public.

Give feedback
Modeling effects of soil processes on soil organic carbon
Details (click to expand)

Understanding the causal relationship between soil organic carbon and soil management or farming practices is crucial for enhancing agricultural productivity and evaluating agriculture-based climate mitigation strategies. 

ML can significantly contribute to this understanding by integrating data from diverse sources to provide more precise spatial and temporal analyses. 

The insufficient data coverage and granularity of soil organic carbon measurements severely limit the development of well-generalized ML models for accurately predicting soil carbon dynamics. 

Expanding monitoring networks and developing cost-effective measurement technologies, combined with better data standardization across different collection efforts, would enable more effective ML applications for soil carbon management and climate-smart agriculture.

DatasetData Gap Summary
Emission dataset compiled from FAO statistics

Data is extrapolated from statistics on a national level. It is unknown how accurate this data is when focusing on local information

Give feedback
Simulated variables from process-based models of soil organic carbon dynamics

Data collection is extremely expensive for some variables, leading to the use of simulated variables. Unfortunately, simulated values have large uncertainties due to the assumptions and simplifications made within simulation models.

Give feedback
Soil measurements – NorthWyke Farms platform

The common and biggest challenges for use cases involving soil organic carbon is the insufficiency of data and the lack of high granularity data.

Give feedback
Soil Survey Geographic Database (SSURGO)

In general, there is insufficient soil organic carbon data for training a well-generalized ML model (both in terms of coverage and granularity).

Give feedback
Optimizing electrified bus fleet in urban vehicle-to-grid systems
Details (click to expand)

Diesel-powered school buses contribute significant carbon emissions and air pollution in urban areas, while electric bus adoption faces high upfront costs that challenge school district budgets.

AI-powered optimization systems can manage electric school bus charging and discharging schedules to create virtual power plants, offsetting electrification costs through grid services revenue.

Key data gaps include inconsistent bus fleet reporting across states, limited access to proprietary charging profiles, and fragmented charge station data that prevent comprehensive fleet optimization modeling.

Standardizing state-level fleet reporting, fostering manufacturer partnerships for charging data access, and creating centralized charge station databases can enable scalable AI solutions for urban transit electrification.

DatasetData Gap Summary
Electric vehicle charge station data

Critical gaps include limited findability of station-specific usage data due to proprietary restrictions and scattered data sources requiring aggregation from multiple providers. Manufacturer partnerships and utility collaboration can improve data access, while standardized reporting frameworks can consolidate fragmented datasets to enable comprehensive fleet optimization

Give feedback
US school bus fleet dataset

The dataset suffers from inconsistent state-level reporting structures and missing data from 4 US states, limiting comprehensive national analysis. Standardizing reporting formats and expanding state participation could enable more robust AI models for fleet electrification planning across diverse geographic and operational contexts.

Give feedback
Optimizing smart inverter management for distributed energy resources
Details (click to expand)

Solar panels and batteries are part of new power systems that don’t use traditional spinning generators. They use inverters to convert DC to AC power. Smart inverters can do more than just convert power—they help manage changes in energy supply and keep the grid stable by adjusting voltage and power levels. This prevents issues like sudden drops or spikes in voltage when solar and other sources are added to the grid.

Machine learning can help better monitor and control smart inverters, with the potential to make efficiency gains.

One key data gap towards unlocking this use case is the access to relevant data.

Partnerships between research labs, utilities, and smart inverter manufacturers may help alleviate this bottleneck.

DatasetData Gap Summary
Outputs from distribution connected inverter systems simulations

There is a need to enhance existing simulation tools to study inverter-based power systems rather than traditional machine-based based. Simulations should be able to represent a large number of distribution-connected inverters that incorporate DER models with the larger power system. Uncertainty surrounding model outputs will require verification and testing. Furthermore, accessibility to simulations and hardware in the loop facilities and systems requires user access proposal submission for NREL’s Energy Systems Integration Facility access. Similar testing laboratories may require access requests and funding.

Give feedback
Smart inverter devices database

Smart inverter operational data is not publicly available and requires partnerships with research labs, utilities, and smart inverter manufacturers. However, the California Energy Commission maintains a database of UL 1741-SB compliant manufacturers and smart inverter models that can then be contacted for research partnerships. In terms of coverage area, while California and Hawaii are now moving towards standardizing smart inverter technology in their power systems, other regions outside of the United States may locate similar manufacturers through partnerships and collaborations.

Give feedback
Scaling identification and mapping of climate policy
Details (click to expand)

Laws and regulations relevant to climate change mitigation and adaptation are essential for assessing progress on climate action and addressing various research and practical questions. 

ML can be employed to identify climate-related policies and categorize them according to different focus areas.

Law corpora are published in various languages and formats by a variety of actors, including cities, national governments and other agencies. They are not all digitized, may be hard to access and require ample harmonization work.

These data gaps may be addressed through aggregation initiatives and ML may be a key component by automating lengthy processes such as translation or screening for relevance.

DatasetData Gap Summary
Climate-related laws and regulations

Laws and regulations for climate action are published in various formats through national and subnational governments, and most are not labeled as a “climate policy”. There are a number of initiatives that take on the challenge of selecting, aggregating, and structuring the laws to provide a better overview of the global policy landscape. This, however, requires a great deal of work, needs to be permanently updated, and datasets are not complete.

Give feedback
Scaling methane emission detection
Details (click to expand)

Methane is the most potent greenhouse gas and the second-largest contributor to climate change, with emissions from the oil and gas industry accounting for 20% of global methane emissions. 

Advanced machine learning techniques applied to satellite imagery enable the detection, quantification, and monitoring of methane emissions at scale, supporting more effective mitigation efforts across global oil and gas operations.

The primary data gap for methane detection is insufficient spatial resolution in widely available satellite data, making it difficult to pinpoint smaller or localized emission sources and accurately quantify their contribution.

Developing higher-resolution satellite systems like MethaneSAT and creating benchmark datasets with synthetic methane plume data can significantly improve detection capabilities, enabling more targeted mitigation efforts and potentially reducing a substantial portion of global methane emissions.

DatasetData Gap Summary
Satellite imagery – Hyperspectral

Very few actual hyperspectral images of methane plumes exist, creating a significant data volume limitation for training robust detection algorithms.

Give feedback
Satellite imagery – Multispectral

Current multispectral satellite data has insufficient spatial resolution to detect smaller methane leaks.

Give feedback
Scaling solar photovoltaics site assessments
Details (click to expand)

Statistical analysis on solar photovoltaic (PV) systems with respect to pricing, logistics, planning, and site capacity studies is an important part of the process for siting solar PV systems. 

Spatiotemporal generation forecasting using pre-existing site data can be used to inform future site recommendations, policy, and decision-making with respect to new developments.

Existing data displays lacks of coverage that limit the applicability and generalization capacities of ML across regions.

The availability of open datasets in more regions would help alleviate these gaps.

DatasetData Gap Summary
Solar panel PV system dataset

The solar panel PV system dataset excluded third-party-owned systems and systems with battery backup. Since data was self-reported, component cost data may be imprecise. The dataset also has a data gap in timeliness as some of the data includes historical data, which may not reflect current pricing and costs of PV systems.

Give feedback
US large-scale solar photovoltaic database

Only the US are covered in this dataset.. Enhancing the data by supplementing it with international large-scale photovoltaic satellite imagery can expand the coverage area of the dataset.

Give feedback
Weather forecasting: Short-to-medium term (1-14 days)
Details (click to expand)

Weather forecasting at 1-14 days ahead has implications for real-time response and planning applications within both climate change mitigation and adaptation. ML can help improve short-to-medium-term weather forecasts.

DatasetData Gap Summary
ECMWF ENS (global 9-km 15-day ahead weather model ensemble)

The biggest challenge of ENS is that only a portion of it is available to the public for free.

Give feedback
ECMWF ERA5 Atmospheric Reanalysis

ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.

Give feedback
ECMWF HRES (global 9-km 10-day ahead weather model)

The biggest challenge of HRES is that only a portion of it is available to the public for free.

Give feedback
WeatherBench 2

Weather Bench 2 is based on ERA5, so the issues of ERA5 are also inherent here, that is, data has biases over regions where there are no observations.

Give feedback
Weather forecasting: Subseasonal horizon
Details (click to expand)

High-fidelity weather forecasts at subseasonal to seasonal scales (3-4 weeks ahead) are important for a variety of climate change-related applications. ML can be used to postprocess outputs from physics-based numerical forecast models in order to generate higher-fidelity forecasts.

DatasetData Gap Summary
ECMWF ERA5 Atmospheric Reanalysis

ERA5 is widely used due to its high resolution and global coverage, but faces significant accessibility and reliability challenges. Download times from the Copernicus Climate Data Store can take days to months due to high demand and data storage on tape systems. ERA5’s own biases and uncertainties, particularly in precipitation fields, limit its effectiveness as ground truth for ML bias correction. Enhanced download infrastructure and improved reanalysis methods incorporating ML-based data assimilation can address these limitations.

Give feedback
subX

More data is needed to develop a more accurate and robust ML model. It is also important to note that SUBX data contains biases and uncertainties, which can be inherited by ML models trained with this data.

Give feedback
Weather forecasting: Subseasonal-to-seasonal horizon
Details (click to expand)

High-fidelity weather forecasts at the subseasonal-to-seasonal (S2S) scale (i.e., 10-46 days ahead) are important for a variety of climate change-related applications. ML can be used to postprocess outputs from physics-based numerical forecast models in order to generate higher-fidelity forecasts.

DatasetData Gap Summary
CPC Precipitation (global unified daily precipitation)

CPC Global Unified gauge-based analysis of daily precipitation https://psl.noaa.gov/data/gridded/data.cpc.globalprecip.html

Give feedback
S2S forecast data

More data is needed to take advantage of the large ML models.

Give feedback
Dataset Gap Types Modalities Sectors
Camera trap wildlife image collections
Details (click to expand)

Camera traps are likely the most widely used sensors in automated biodiversity monitoring due to their low cost and simple installation. This medium offers close-range monitoring over long-time scales. The image sequences can be used to not only classify species but to identify specifics about the individual, e.g. sex, age, health, behavior, and predator-prey interactions. Camera trap data has been used to estimate species occurrence, richness, distribution, and density. 

In general, the raw images from camera traps need to be annotated before they can be used to train ML models. Some of the available annotated camera trap images are shared via Wildlife Insights (www.wildlifeinsights.org) and LILA BC (www.lila.science), while others are listed on GBIF (https://www.gbif.org/dataset/search?q=). However, the majority of camera trap data is likely scattered across individual research labs or organizations and not publicly available. Sharing such images could provide significant progress towards filling the gaps associated with the lack of annotated data that currently hinders the progress of efficiently using ML in biodiversity studies. This is what initiatives like Wildlife Insights are looking to do. 

Use CaseData Gap Summary
Automating individual re-identification for wildlife

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Give feedback
Enabling 2D to 3D shape recovery and pose estimation of animals

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity studies. Addressing this gap requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Give feedback
Facilitating forest restoration monitoring

There is a lack of standardized protocol to guide data collection for restoration projects, including which variables to measure and how to measure them. Developing such protocols would standardize data collection practices, enabling consistent assessment and comparison of restoration successes and failures on a global scale.

Give feedback
Facilitating the detection of climate-induced ecosystem changes

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Give feedback
Improving terrestrial wildlife detection and species classification

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Give feedback
Academic literature databases
Details (click to expand)

Academic literature databases, such as Openalex, Web of Science, Scopus.

Use CaseData Gap Summary
Active fire data – satellite-derived
Details (click to expand)

Active fire data derived from images taken by satellites such as MODIS, VIIRS, and LANDSAT at different spatial resolutions and temporal frequencies. These datasets provide near real-time detection of active fires globally and can be downloaded fromhttps://firms.modaps.eosdis.nasa.gov/active_fire.

Use CaseData Gap Summary
Advanced metering infrastructure data
Details (click to expand)

Advanced Metering Infrastructure (AMI) facilitates communication between utilities and customers through smart meter device systems that collect, store, and analyze per building energy consumption.

AMI data can be retrieved through public data portals, individual data collection, or research partnerships with local utilities. Some examples of utility research partnerships include the Irvine Smart Grid Demonstration (ISGD) project conducted by Southern California Edison (SCE) and the smart meter pilot test from the Sacramento Municipal Utility. An example of publicly available data that is aggregated and anonymized is the Commission for Energy Regulation (CER) Smart Metering Project hosted by the Irish Social Science Data Archive (ISSDA).

View dataset

Use CaseData Gap Summary
Improving short-term electricity load forecasting

AMI data is challenging to obtain without pilot study partnerships with utilities since data collection on individual building consumer behavior can infringe upon customer privacy, especially at the residential level. Granularity of time series data can also vary depending on the level of access to the data, whether it be aggregated and anonymized or based on the resolution of readings and system. Additionally, the coverage of data will be limited to utility pilot test service areas, thereby restricting the scope and scale of demand studies.

Give feedback
Aerial power line corridor inspection data
Details (click to expand)

LiDAR and image data collected from unmanned aerial vehicles (UAVs) for power line right-of-way (RoW) inspection can be accessed from private providers such as LUMA Energy and COR3, as well as sources like China Southern Power Grid with dastasets from Yunnan RoW-1, Yunnan RoW-2, and Hubei RoW 4. Open source EPRI distribution inspection imagery is also available and labeled with information regarding conductors, poles, crossarms, insulators, and other infrastructure components. These datasets pair images with geolocated GIS data to identify priority vegetation management areas near transmission lines.

Use CaseData Gap Summary
Enhancing power grid-vegetation management for wildfire risk mitigation

UAV imagery for vegetation management near power lines requires partnerships with private companies and utilities for access. LiDAR data is often sparse with partial line scans resulting in poor data quality. Coverage is typically limited to specific rights-of-way, requiring continuous monitoring to track vegetation growth over time.

Give feedback
Automated surface observation system (ASOS)
Details (click to expand)

This dataset contains one- and five-minute observations from automated surface observation system stations in the US. The ASOS network provides near real-time surface weather measurements including wind speed and direction, dew point, air temperature, station pressure, precipitation, visibility, and cloud characteristics. See https://madis.ncep.noaa.gov/madis_OMO.shtml

Use CaseData Gap Summary
Accelerating and improving weather forecasting: Near-term (< 24 hours)

Data volume is large and only data specific to the US is available.

Give feedback
Benchmark datasets for building energy modeling
Details (click to expand)

Building energy modeling datasets provide measurements of energy demand profiles for a sample of buildings, as well as relevant input variables for traditional and ML-based models, enabling us to benchmark the performance of different models for energy prediction tasks. For example, the US Office of Energy Efficiency and Renewable Energy hosts 15 building datasets for 10 states covering 7 climate zones and 11 different building types (https://bbd.labworks.org/dataset-search). The data covers energy, indoor air quality, occupancy, environment, HVAC, lighting, and energy consumption to name a few. Datasets are organized by name and points of contact.

All data featured on the platform is open access, with standardization on metadata format to allow for ease of use and information specific to buildings based on type, location, and climate zone. Data quality and guidance on curation and cleaning, in addition to access restrictions, are specified in the metadata of each hosted dataset. Licensing information for each individual featured dataset is provided.

View dataset

Use CaseData Gap Summary
Accelerating building energy models

Benchmark datasets for building energy modeling are few, are mostly available in the US, and cover a limited range of building types. The variables provided in such datasets are not always precise and comprehensive enough to test models adequately.

Give feedback
Biodiversity images and recordings – community science data
Details (click to expand)

Images and recordings contributed by volunteers represent another significant source of data on biodiversity and ecosystems. Crowdsourcing platforms, such as iNaturalist, eBird, Zooniverse, and Wildbook, facilitate the sharing of community science data. Many of these platforms also serve as hubs for collating and annotating datasets.

Use CaseData Gap Summary
Improving terrestrial wildlife detection and species classification

The main challenge with community science data is its lack of diversity. Data tends to be concentrated in accessible areas and primarily focuses on charismatic or commonly encountered species.

Give feedback
Building data genome project (hourly building-level metered data)
Details (click to expand)

The Building Data Genome Project 2 dataset contains hourly building-level data from 3,053 energy meters from 1,636 non-residential buildings covering two years worth of metered data with respect to electricity, water, and solar in addition to logistical metadata with respect to area, primary building use category, floor area, time zone, weather, and smart meter type. The goal of the dataset to allow for the development of generalizable building models for energy efficiency analysis studies. The building data genome project 2 compiles building data from public open datasets along with privately curated building data specific to university and higher education institutions.

View dataset

Use CaseData Gap Summary
Improving short-term electricity load forecasting

While the dataset has clear documentation, the lack of diversity in building types necessitates further development of the dataset to include other types of buildings, as well as expanding coverage to areas and times beyond those currently available.

Give feedback
Building stock – from cadaster and aerial imagery
Details (click to expand)

Building stock maps enable a geolocalized understanding of where and which kind of buildings stand, relevant both to climate change mitigation and adaptation. Building stock data from cadasters and aerial imagery provide the most precise available data. In addition to precise building footprints, the 3D geometry of walls and roofs may be available thanks to LiDAR aerial surveys. Further high-quality information from the cadaster may be available as attributes, such as the current usage or the construction year of the building.

Use CaseData Gap Summary
Facilitating disaster risk assessments

These datasets are mainly available in rich countries from Europe, North America, and Asia, leaving large parts of the world with timely challenges involving their building stock without appropriate data for detailed assessments. The lack of attributes beyond the location and shape of the buildings also limits the breadth of applications. ML may help by inferring missing attributes, given the availability of sufficient training data. Research efforts in particular in Europe, including EUBUCCO (eubucco.com) or the Digital Building Stock Model by the Joint Research Centre of the European Commission (https://data.jrc.ec.europa.eu/collection/id-00382), are addressing several of the existing data gaps.

Give feedback
Building stock – satellite-derived
Details (click to expand)

Building stock maps enable a geolocalized understanding of where and which kind of buildings stand, relevant both to climate change mitigation and adaptation. Satellite-derived datasets, which often use ML for processing satellite imagery, can provide such maps on a global scale. Coarser-resolution maps come as raster data at resolutions varying from 10 to more than 100 m, while the maps with the highest resolution provide details on building footprint geometries as vector data. Some of these datasets may have a temporal resolution and some inferred attributes describing the building characteristics.

Use CaseData Gap Summary
Facilitating disaster risk assessments

These datasets are typically released at a scale that makes their validation complex and partial, implying potentially large uncertainties in the data. The lack of attributes beyond the location and shape of the buildings also limits the breadth of applications. ML may help by inferring missing attributes, given the availability of sufficient training data.

Give feedback
CMIP6 (earth system model intercomparison data)
Details (click to expand)

CMIP6 (Coupled Model Intercomparison Project Phase 6) provides climate simulations from a consortium of state-of-the-art global climate models, covering historical periods and future scenarios through 2100. The dataset includes multiple climate variables at various spatial and temporal resolutions from modeling centers worldwide. Data can be found here https://pcmdi.llnl.gov/CMIP6/.

Use CaseData Gap Summary
Accelerating data-driven generation of climate simulations

The dataset faces three key challenges: its large volume makes access and processing difficult with standard computational infrastructure; lack of uniform structure across models complicates multi-model analysis; and inherent biases and uncertainties in the simulations affect reliability.

Give feedback
Bias-correction of climate projections

Large uncertainties in future climate projections limit confidence in bias-correction applications. The massive data volume and inconsistent formats across models—including variable naming conventions, resolutions, and file structures—hinder effective multi-model analysis. Improved model evaluation frameworks and data standardization efforts can enhance projection reliability and streamline ML model development.

Give feedback
CPC Precipitation (global unified daily precipitation)
Details (click to expand)

CPC Global Unified gauge-based analysis of daily precipitation https://psl.noaa.gov/data/gridded/data.cpc.globalprecip.html

Use CaseData Gap Summary
Weather forecasting: Subseasonal-to-seasonal horizon

High-fidelity weather forecasts at the subseasonal-to-seasonal (S2S) scale (i.e., 10-46 days ahead) are important for a variety of climate change-related applications. ML can be used to postprocess outputs from physics-based numerical forecast models in order to generate higher-fidelity forecasts.

Give feedback
ClimSim (benchmark data for hybrid ML-physics research)
Details (click to expand)

ClimSim is an ML-ready benchmark dataset designed for hybrid ML-physics research, for example, for emulating subgrid clouds and convection processes in climate models.

Use CaseData Gap Summary
Hybrid ML-physics climate models for enhanced simulations

ClimSim faces challenges with its large data volume, making downloading and processing difficult for most ML practitioners, and its resolution is insufficient to resolve some fine-scale physical processes critical for accurate climate modeling.

Give feedback
ClimateBench v1.0 (benchmark dataset for earth system models)
Details (click to expand)

ClimateBench v1.0 is a benchmark dataset derived from the NorESM2 Earth System Model (a participant in CMIP6) designed specifically for evaluating machine learning methods that emulate key climate variables. The dataset is publicly available at https://zenodo.org/records/7064308

Use CaseData Gap Summary
Accelerating data-driven generation of climate simulations

The dataset currently includes simulations from only one Earth system model, limiting the diversity of training data and potentially affecting the robustness and generalizability of ML emulators trained on it.

Give feedback
ClimateSet (ML-ready earth system model inputs/outputs)
Details (click to expand)

 ClimateSet is an ML-ready benchmark dataset compiled from inputs and outputs of the Input4MIPS and CMIP6 archives, structured for various machine learning tasks including climate model emulation, downscaling, and prediction. More information is available at https://arxiv.org/pdf/2311.03721.pdf

Use CaseData Gap Summary
Accelerating data-driven generation of climate simulations

No significant data gap identified yet.

Give feedback
Computational fluid dynamics simulation for building energy models
Details (click to expand)

Computational fluid dynamics (CFD) simulation output from building energy models is a means of precisely assessing thermal (e.g. insulation of the walls) and ventilation (e.g. natural ventilation or HVAC) properties of a building. Given the building geometry, terrain, presence of neighboring buildings, and boundary conditions Navier-Stokes equations are typically solved. Datasets including precise building inputs and outputs from CFD would help build ML surrogate models. Surrogate models, such as GANs or physics constrained deep neural network architectures have been shown to provide promising results though further research with respect to turbulence representation needs to be taken into account.

Use CaseData Gap Summary
Accelerating building energy models

Despite its usefulness in ventilation studies for new construction, CFD simulations are computationally expensive making it difficult to include in the early phase of the design process where building morphosis can be optimized to reduce future operational consumption associated with building lighting, heating, and cooling. Simulations require accurate input information with respect to material properties that may not be present in traditional urban building types. Output of models  require the integration of domain knowledge to interpret results from large volumes of synthetic data for different wind directions becoming challenging to manage. Future data collection with respect to simulation output verification can benefit surrogate or proxy approaches to computationally expensive Navier-Stokes equations, and coverage is often restricted to modern building approaches, leaving out passive building techniques known as vernacular architecture from indigenous communities from being taken into design consideration.

Give feedback
DOE Atmospheric Radiation Measurement research facility data products
Details (click to expand)

The DOE Atmospheric Radiation Measurement (ARM) dataset comprises ground-based measurements from various field programs sponsored by the US Department of Energy, including sun-tracking photometers, radiometers, and spectrometer data useful for solar radiation time series forecasting and solar potential assessment.

Use CaseData Gap Summary
Improving solar power forecasting: nowcasting/very-short-term (0-30min)

ARM data presents challenges with data volume management, measurement verification (especially for aerosol composition), limited spatial coverage (ARM sites only), and sensor calibration issues. Solutions include AI-based data compression, enhanced aerosol composition measurements, collaboration with partner networks to expand coverage, and automated quality control.

Give feedback
DYAMOND (global atmospheric circulation model intercomparison data)
Details (click to expand)

DYAMOND (DYnamics of the Atmospheric general circulation Modeled On Non-hydrostatic Domains) is an intercomparison of global storm-resolving model simulations at 5 km resolution or less, used as targets for climate model emulators.

Use CaseData Gap Summary
Hybrid ML-physics climate models for enhanced simulations

DYAMOND faces similar challenges to ClimSim: its large volume creates processing difficulties, and its resolution, while high, remains insufficient for resolving fine-scale atmospheric processes needed for accurate climate modeling.

Give feedback
Digital elevation model
Details (click to expand)

Surface elevation data, often called digital elevation model or terrain surface model, provide a 3D representation of the bare surface of the Earth. These topographic inputs are important for disaster risk assessments and modeling to assess risks due to floods, sea level rise, or landslides, where the elevation of a given location determines whether it is at risk. These digital models are typically estimated from remote sensing data, for example, the Shuttle Radar Topography Mission. They are often provided as raster but may also be provided as points (vector).

Use CaseData Gap Summary
Facilitating disaster risk assessments

Very high-resolution reference data is currently not freely open to the public.

Give feedback
Direct measurement of methane emission of rice paddies
Details (click to expand)

With sampling systems placed in rice paddies, methane concentrations can be directly measured in the air above the fields or in the soil. 

Use CaseData Gap Summary
Enhancing estimations of methane emissions from rice paddies

There is a lack of direct observation of methane emissions from rice paddies.

Give feedback
Distribution system simulators
Details (click to expand)

Distribution system simulators such as OpenDSS and GridLab-D enable analysis of hosting capacity for distribution-level substation feeders by simulating how various factors affect grid stability and reliability. These open-source tools allow researchers to model voltage limits, thermal capabilities, control parameters, and fault currents under different scenarios, providing insights into how distribution grids can safely accommodate distributed energy resources like solar panels. These simulators serve as critical alternatives when real circuit feeder data from utilities is unavailable.

Use CaseData Gap Summary
Accelerating distribution-side hosting capacity estimations

While OpenDSS and GridLab-D provide valuable simulation capabilities, their utility is limited by challenges in obtaining verification data from real distribution circuits, aggregating necessary input data from multiple sources, and navigating usage rights for proprietary utility data. Closing these gaps through improved utility-researcher partnerships and data sharing protocols would significantly enhance the accuracy of hosting capacity assessments, enabling greater renewable energy integration in distribution networks.

Give feedback
Drone imagery
Details (click to expand)

Drone imagery provides high-resolution, close-range visual data for species identification, individual tracking, and environmental reconstruction. These images offer detailed insights into habitats and wildlife populations, similar to camera traps but with greater flexibility in coverage. Currently, most drone imagery data is scattered across disparate sources, with some collections hosted on platforms like www.lila.science.

Use CaseData Gap Summary
Automating individual re-identification for wildlife

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Give feedback
Enhancing digital reconstructions of the environment

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Give feedback
Facilitating forest restoration monitoring

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Give feedback
Improving terrestrial wildlife detection and species classification

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Give feedback
ECMWF ENS (global 9-km 15-day ahead weather model ensemble)
Details (click to expand)

Ensemble forecast up to 15 days ahead, generated by ECMWF numerical weather prediction model; used as a benchmark/baseline for evaluating ML-based weather forecasts. Data can be found here.

Use CaseData Gap Summary
Bias-correction of weather forecasts

Same as HRES, the biggest challenge of ENS is that only a portion of it is available to the public for free.

Give feedback
Weather forecasting: Short-to-medium term (1-14 days)

The biggest challenge of ENS is that only a portion of it is available to the public for free.

Give feedback
ECMWF ERA5 Atmospheric Reanalysis
Details (click to expand)

ERA5 is a comprehensive atmospheric reanalysis dataset covering 1940 to present that integrates in-situ and remote sensing observations from weather stations, satellites, and radar into a global, hourly gridded product at 31 km resolution. The dataset is continuously updated and available for download through the Copernicus Climate Data Store.

View dataset

Use CaseData Gap Summary
Bias-correction of climate projections

ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.

Give feedback
Hybrid ML-physics climate models for enhanced simulations

While ERA5 is widely used due to its good structure and global coverage, users face significant challenges with downloading times that can take days to months, and the sheer data volume presents processing difficulties for many users. 

Give feedback
Weather forecasting: Subseasonal horizon

ERA5 is widely used due to its high resolution and global coverage, but faces significant accessibility and reliability challenges. Download times from the Copernicus Climate Data Store can take days to months due to high demand and data storage on tape systems. ERA5’s own biases and uncertainties, particularly in precipitation fields, limit its effectiveness as ground truth for ML bias correction. Enhanced download infrastructure and improved reanalysis methods incorporating ML-based data assimilation can address these limitations.

Give feedback
ECMWF HRES (global 9-km 10-day ahead weather model)
Details (click to expand)

Single high-resolution forecast up to 10 days ahead generated by ECMWF numerical weather prediction model, the Integrated Forecasting system (IFS). It is usually used as a benchmark/baseline for evaulating ML-based weather forecast. Data can be found here.

Use CaseData Gap Summary
Bias-correction of weather forecasts

The biggest challenge with using HRES data is that only a portion of it is available to the public for free.

Give feedback
Weather forecasting: Short-to-medium term (1-14 days)

The biggest challenge of HRES is that only a portion of it is available to the public for free.

Give feedback
EPRI10 (transmission control center alarm and operational data set)
Details (click to expand)

Supervisory Control and Data Acquisition (SCADA) systems collect data from sensors throughout the power grid. Alarm operational data, a portion of the data received by SCADA, provides discrete event-based information on the status of protection and monitoring devices in a tabular format, which includes semi-structured text descriptions of individual alarm events. 

Often, the data is formatted based on timestamp (in milliseconds), station, signal identification information, location, description, and action. Encoded within the identification information is the alarm message.

View dataset

Use CaseData Gap Summary
Facilitating grid reliability events analysis

Access to EPRI10 grid alarm data is currently limited within EPRI. Data gaps with respect to usability are the result of redundancies in grid alarm codes, requiring significant preprocessing and analysis of code ids, alarm priority, location, and timestamps. Alarm codes can vary based on sensor, asset and line. Resulting actions as a consequence of alarm trigger events need field verifications to assess the presence of fault or non-fault events.

Give feedback
Electric vehicle charge station data
Details (click to expand)

Electric vehicle charging station datasets typically include location, charger specifications, energy delivery amounts, charge duration, costs, and usage patterns for both AC slow charging (depot-based) and DC fast charging (en-route) stations, though specific datasets vary by provider and region.

Use CaseData Gap Summary
Optimizing electrified bus fleet in urban vehicle-to-grid systems

Critical gaps include limited findability of station-specific usage data due to proprietary restrictions and scattered data sources requiring aggregation from multiple providers. Manufacturer partnerships and utility collaboration can improve data access, while standardized reporting frameworks can consolidate fragmented datasets to enable comprehensive fleet optimization

Give feedback
Emission dataset compiled from FAO statistics
Details (click to expand)

Dataset Introduction: This dataset comprises agricultural emissions data compiled from Food and Agriculture Organization (FAO) statistics and spatially extrapolated to provide geospatial coverage. It includes estimates of greenhouse gas emissions related to agricultural practices across different regions worldwide and is periodically updated as new FAO statistics become available.

Use CaseData Gap Summary
Modeling effects of soil processes on soil organic carbon

Data is extrapolated from statistics on a national level. It is unknown how accurate this data is when focusing on local information

Give feedback
Environmental DNA (eDNA)
Details (click to expand)

Environmental DNA (eDNA) datasets consist of genetic material obtained from environmental samples, like soil and water, after being shed by living or dead organisms. By analyzing this genetic material, researchers can detect and monitor species present in a non-invasive and efficient manner, aiding biodiversity studies, conservation efforts, and environmental monitoring. Some eDNA data can be found in GBIF (the Global Biodiversity Information Facility). BIOSCAN-5M is another relevant, comprehensive dataset containing multi-modal information, including DNA barcode sequences and taxonomic labels for over 5 million insect specimens, presenting as a large reference library on species- and genus-level classification tasks. 

Use CaseData Gap Summary
Automating individual re-identification for wildlife

A significant challenge for eDNA-based monitoring is the incomplete barcoding reference databases, limiting the ability to accurately identify species from genetic material. Initiatives like the BIOSCAN project are actively working to address this gap by expanding reference collections for diverse taxonomic groups, particularly for understudied regions and species.

Give feedback
Enhancing digital reconstructions of the environment

 One gap in data is the incomplete barcoding reference databases.

Give feedback
Improving terrestrial wildlife detection and species classification

 One gap in data is the incomplete barcoding reference databases.

Give feedback
Equivalent circuit models
Details (click to expand)

Equivalent circuit models are simplified representations of batteries represented by networks of resistors and capacitors to model battery behavior due to electrochemical reactions. Due to their ease of use, they can integrate easily into battery management control systems and customized to model a variety of battery chemistries and conditions. Different types of equivalent circuit models include the Rint model, hysteresis models, Randles models, and Thevenin models. These models differ in complexity with respect to the extent with which battery behavior is captured. For example, the simplest model, the Rint model, is static while other models vary in their representation of dynamic properties such as state of charge and battery lifetime.

Use CaseData Gap Summary
Improving battery management systems

While ECMs enable real-time battery SoC predictions due to their computational efficiency, they often oversimplify real-life operating conditions which limits the accuracy of SoH and RUL estimates. Additionally, verification with data from physical battery systems is required to validate simulated outcomes and improve prediction reliability across diverse operational environments.

Give feedback
Faraday (Synthetic smart meter data)
Details (click to expand)

Due to consumer privacy protections, advanced metering infrastructure (AMI) data is unavailable for realistic demand response studies. In an effort to open smart meter data, the Octopus Energy’s Centre for Net Zero has generated a synthetic dataset conditioned on the presence of low carbon technologies, energy efficiency, and property type from a model trained on 300 million actual smart meter readings from a United Kingdom (UK) energy supplier. Faraday is currently accessible through the Centre for Net Zero’s API.

View dataset

Use CaseData Gap Summary
Improving short-term electricity load forecasting

Despite the model being trained on actual AMI data, synthetic data generation requires a priori knowledge with respect to building type, efficiency, and presence of low-carbon technologies. Furthermore, since the dataset is trained on UK building data, AMI time series generated may not be an accurate representation of load demand in regions outside the UK. Finally, since data is synthetically generated, studies will require validation and verification using real data or aggregated per substation level data to assess its effectiveness.

Give feedback
FathomNet (marine wildlife annotated imagery)
Details (click to expand)

FathomNet is an open-source image database that standardizes and aggregates expertly curated labeled data. The data can be used to train, test, and validate  ML algorithms to help us understand our ocean and its inhabitants.

Use CaseData Gap Summary
Enhancing marine wildlife detection and species classification

The biggest challenge for the developer of FathomNet is enticing different institutions to contribute their data.

Give feedback
Grid2Op and PandaPower (power systems simulation outputs))
Details (click to expand)

Grid2Op is a power systems simulation framework to perform reinforcement learning for electricity network operation that focuses on the use of topology to control the flows of the grid. 

Grid2Op allows users to control voltages by manipulating shunts or changing setpoint values of generators, influence active generation by use of redispatching, and manipulate storage units such as batteries or pumped storage to produce or absorb energy from the grid when needed. The grid is represented as a graph with nodes being buses and edges corresponding to power lines and transformers. Grid2Op has several available environments with different network topologies as well as variables that can be monitored as observations. The environment is designed for reinforcement learning agents to act upon with a variety of actions some of which are binary or continuous. This includes changes in topology such as changing bus, changing line status, setting storage, curtailment, redispatching, setting bus values, and setting line status. Multiple reward functions are also available in the platform for experimentation with different agents. It is important to note that Grid2Op has no internal modeling of equations of the grids or what kind of solver is necessary to adopt. Data on how the power grid is evolving is represented by the “Chronics.” The solver that computes the state of the grid is represented by the “Backend” which utilizes PandaPower to compute power flows.

Use CaseData Gap Summary
Improving power grid optimization

Grid2Op faces several data gaps related to usability, reliability, and coverage. Key issues include poor documentation, limited customization options (especially for reward functions and cascading failure scenarios), and a lack of support for multi-agent setups. The framework also lacks realistic system dynamics, fine time resolution, and flexible backend modeling, making it challenging to use for advanced research or real-world grid simulations without significant modification. These gaps can hinder the framework’s ability to accurately train reinforcement learning agents and simulate real-world power grid behavior.

Give feedback
Ground survey of land use and land management
Details (click to expand)

 Ground surveys collect direct field observations on land use practices and management approaches, providing critical ground-truth data that complements remote sensing. This information is essential for understanding human impacts on ecosystems and validating satellite-derived land cover classifications.

Use CaseData Gap Summary
Facilitating the detection of climate-induced ecosystem changes

Access to comprehensive ground survey data is restricted due to institutional barriers and privacy concerns, limiting its availability for ecosystem change analysis.

Give feedback
Ground-Based Weather Station Observations
Details (click to expand)

Ground-based weather station data provides point measurements of atmospheric variables including temperature, precipitation, and humidity from meteorological networks worldwide. These observations serve as ground truth for validating and bias-correcting climate model outputs, though spatial coverage varies significantly by region and is particularly sparse in developing countries.

Use CaseData Gap Summary
Bias-correction of climate projections

Irregular spatial distribution and point-based measurements require extensive preprocessing to create gridded datasets suitable for ML applications. Limited station density in many regions, especially over oceans and remote areas, constrains bias-correction accuracy. Enhanced observation networks and improved interpolation techniques can provide more comprehensive spatial coverage for model validation.

Give feedback
Bias-correction of weather forecasts

Data is not regularly gridded and needs to be preprocessed before being used in an ML model.

Give feedback
Ground-survey based forest inventory data
Details (click to expand)

Forest information collected directly from forested areas through on-the-ground observations and measurements serves as ground truth for training and validating estimates. This data is crucial for accurate assessments, such as estimating forest canopy height using machine learning models. https://research.fs.usda.gov/programs/fia#data-and-tools

Use CaseData Gap Summary
Improving estimations of forest carbon stock

Manual collection results in data quality issues and limited spatial coverage, requiring improved collection protocols and integration with remote sensing to expand usability.

Give feedback
Health data
Details (click to expand)

Health data refers to information related to individuals’ physical and mental well-being. This can include a wide range of data, such as medical records, health surveys, healthcare utilization, and epidemiological data.

Use CaseData Gap Summary
Assessment of climate impacts on public health

The biggest issue for health data is its limited and restricted access.

Give feedback
High-Resolution Rapid Refresh (HRRR) weather forecast
Details (click to expand)

The High-Resolution Rapid Refresh (HRRR) dataset contains near-term weather forecasts produced at 3-km resolution with hourly updates. It is a cloud-resolving, convection-allowing atmospheric model that assimilates radar data every 15 minutes over a 1-hour period. See https://rapidrefresh.noaa.gov/hrrr/

Use CaseData Gap Summary
Accelerating and improving weather forecasting: Near-term (< 24 hours)

Data volume is large, and only data covering the US is available.

Give feedback
Historical climate observations
Details (click to expand)

Historical climate observations provide essential baseline data for tracking ecosystem changes over time. This dataset includes both global reanalysis products like ERA5 that offer comprehensive but coarse-resolution data, and more granular observations aggregated from local weather stations that provide detailed climate information at specific locations.

Use CaseData Gap Summary
Assessment of climate impacts on public health

Processing climate data and Integrating climate data with health data is a big challenge.

Give feedback
Facilitating the detection of climate-induced ecosystem changes

For highly biodiverse regions, like tropical rainforests, there is a lack of high-resolution data that captures the spatial heterogeneity of climate, hydrology, soil, and the relationship with biodiversity.

Give feedback
Lab measurements of material property and carbon absorption
Details (click to expand)

Lab measurements of material properties (such as chemical composition and physical properties) and their performance on carbon absorption (such as absorption capacity).

Use CaseData Gap Summary
Accelerating the design of new carbon-absorbing materials

The major challenge is that data is not shared with the public.

Give feedback
Large-eddy simulations (atmospheric processes)
Details (click to expand)

Large-eddy simulations are very high-resolution atmospheric simulations (finer than 150 m) where atmospheric turbulence is explicitly resolved in the model, providing detailed insights into small-scale atmospheric processes.

Use CaseData Gap Summary
Hybrid ML-physics climate models for enhanced simulations

These simulations are essential for resolving turbulent processes that current climate models cannot capture, but they require significant computational resources and are not readily available as benchmark datasets for the wider research community.

Give feedback
LiDAR point cloud – airbone
Details (click to expand)

Airborne LiDAR (Light Detection and Ranging) collects high-resolution, three-dimensional point clouds of forest structure using sensors mounted on aircraft or drones. This technology captures precise data about forest canopies, enabling detailed assessment of biomass and carbon stocks at local to regional scales.

Use CaseData Gap Summary
Improving estimations of forest carbon stock

Limited geographical coverage due to high collection costs, combined with the need for domain expertise to process the complex point cloud data, restricts the use of this high-value data source.

Give feedback
Micro-synchrophasors (µPMU data)
Details (click to expand)

Micro-phasor measurement units (µPMUs) provide synchronized voltage and current measurements with higher accuracy, precision, and sampling rate making it ideal for distribution network monitoring. 

For example, µPMUs have an angle accuracy to the allowance of .01 degrees and a total vector error allowance of .05% in contrast to 1 degree and 1% total vector error allowance for classic PMUs. With sampling rates of 10-120 samples per second, µPMUs are capable of capturing dynamic and transient states within the low voltage distribution network allowing for improved event and fault detection and localization. Today most µPMU datasets can be accessed through manual field deployments in test-beds, collaborative research studies, or through publicly available datasets.

View dataset

Use CaseData Gap Summary
Facilitating fault detection in low voltage distribution grids

For effective fault localization using µPMU data, it is crucial that distribution circuit models supplied by utility partners or Distribution System Operators (DSOs) are accurate, though they often lack detailed phase identification and impedance values, leading to imprecise fault localization and time series contextualization. Noise and errors from transformers can degrade µPMU measurement accuracy. Integrating additional sensor data affects data quality and management due to high sampling rates and substantial data volumes, necessitating automated analysis strategies. Cross-verifying µPMU time series data, especially around fault occurrences, with other data sources is essential due to the continuous nature of µPMU data capture. Additionally, µPMU installations can be costly, limiting deployments to pilot projects over a small network. Furthermore, latencies in the monitoring system due to communication and computational delays challenge real-time data processing and analysis.

Give feedback
Museum specimens
Details (click to expand)

Museum specimens contain detailed biological records documenting species’ characteristics, including morphological traits. Data on where and when they were collected is also often recorded. This offers documentation on the occurrence of species in both space and time. Museum specimens are valuable resources for various applications, such as species classification and species distribution modeling. 

Use CaseData Gap Summary
Improving terrestrial wildlife detection and species classification

The majority of the world’s museum specimens remain undigitized, creating a significant barrier to using these records in machine learning applications for biodiversity monitoring and climate change research.

Give feedback
NEX-GDDP-CMIP6 (Global daily downscaled long-term climate projections)
Details (click to expand)

The NEX-GDDP-CMIP6 dataset provides high-resolution, bias-corrected global climate projections derived from Coupled Model Intercomparison Project Phase 6 (CMIP6) across four greenhouse gas emissions scenarios (Shared Socioeconomic Pathways). It includes daily climate variables such as temperature, precipitation, humidity, and radiation from 2015 to 2100 at approximately 25km resolution, enabling detailed analysis of climate change impacts sensitive to local topography and fine-scale climate gradients. For more information, see https://www.nccs.nasa.gov/services/data-collections/land-based-products/nex-gddp-cmip6.

Use CaseData Gap Summary
Improving long-term extreme heat prediction

The dataset’s massive size (petabytes of data) creates significant barriers for access, transfer, and analysis, requiring specialized computing infrastructure and technical expertise that many researchers lack. Additionally, efficiently extracting relevant extreme heat information from this comprehensive climate dataset presents computational and methodological challenges.

Give feedback
NIST campus photovoltaic arrays and weather station data
Details (click to expand)

This dataset contains measurements from PV arrays at the National Institute of Standards and Technology campus from August 2014-July 2017, including electrical, temperature, meteorological, and radiation data sampled at high frequency with one-minute averages.

Use CaseData Gap Summary
Improving solar power forecasting: nowcasting/very-short-term (0-30min)

The dataset has limited spatial coverage (Gaithersburg, MD only) and is no longer maintained after July 2017, limiting its usefulness for current applications.

Give feedback
NOAA's SOLRAD Network Solar Radiation Data
Details (click to expand)

The National Oceanic and Atmospheric Administration’s SOLRAD Network monitors surface radiation at nine locations across the United States. The data includes high-precision measurements from various instruments, including pyrheliometers, pyranometers, and UV radiometers that collect minute-interval measurements of incoming solar radiation. These measurements characterize the Earth’s surface radiation budget and can be used to accurately forecast solar energy generation for grid planning and management.

Use CaseData Gap Summary
Improving solar power forecasting: short-term (30 min-6 hours)

While NOAA’s SOLRAD is an excellent data source for long term solar irradiance and climate studies,  it has limitations for short-term solar forecasting applications. Key gaps include lower quality hourly averages compared to native resolution data, and limited geographic coverage with only nine monitoring stations across the United States. These constraints impact the effectiveness of forecasting for real-time energy management, grid stability, and market operations.

Give feedback
NREL Physical Solar Model Solar Radiation Database
Details (click to expand)

The National Renewable Energy Laboratory (NREL)’s Solar Radiaion Database provides hourly and half-hourly solar radiation data modeled using NREL’s Physical Solar Model (PSM). The data is derived from multiple satellite sources including NOAA’s Geostationary Operational Environmental Satellites (GOES), the Interactive Multisensor Snow and Ice Mapping System (IMS), MODIS, and MERRA-2 reanalysis. The PSM derives cloud and aerosol properties as inputs for the Fast All-sky Radiation Model for Solar applications (FARMS), enabling users to access spectral irradiance data based on time, location, and PV orientation.

Use CaseData Gap Summary
Improving solar power forecasting: short-term (30 min-6 hours)

While NSRDB offers global coverage using satellite-derived data, several challenges exist. The dataset requires periodic recalculation and updating to remain current, with unbalanced temporal coverage favoring the United States. Satellite-based estimations may be inaccurate in regions with frequent cloud cover, snow, or bright surfaces, requiring ground-based verification. Additionally, data derived from satellite imagery may need preprocessing to account for parallax effects and field-of-view issues that aren’t fully addressed in the higher-level FARMS products.

Give feedback
NREL SRRL Baseline Measurement System for Multi-Variable Solar Research
Details (click to expand)

The NREL Solar Radiation Research Laboratory’s Baseline Measurement System (SRRL BMS) provides 130 variables at 60-second intervals for site-specific environmental factors at its Golden, Colorado facility. This comprehensive dataset includes co-located measurements of temperature, pressure, precipitation, wind parameters, humidity, UV index, aerosol optical depth, albedo, and cloud cover categorized as opaque, thin, and clear. This multi-variable dataset supports photovoltaic potential studies and renewable resource climatology research.

Use CaseData Gap Summary
Improving solar power forecasting: short-term (30 min-6 hours)

While NREL’S SRRL BMS provides real-time joint variable data from ground-based sensors, its coverage is limited to the single location in Golden, CO in the United States. The diverse sensor network requires regular maintenance, and instrument malfunctions or calibration issues may lead to data inaccuracies if not promptly detected and addressed, affecting the reliability of solar forecasting applications.

Give feedback
NREL Wind Active Power Control Simulation Tools
Details (click to expand)

NREL has developed simulation tools to understand the effects of wind power on interconnection system frequency, including the Flexible Energy Scheduling Tool for Integrating Variable Generation (FESTIV) and Multi-Area Frequency Response Integration Tool (MAFRIT). These tools use traditional commercial software and custom-developed models to perform dynamic simulations and wind generation studies for active power control of the grid.

These simulation tools include:

NREL Flexible Energy Scheduling Tool for Integrating Variable Generation (FESTIV)

NREL Multi-Area Frequency Response Integration Tool (MAFRIT)

Use CaseData Gap Summary
Enhancing wind power grid integration and stability

Access to NREL’s FESTIV model requires special permission, limiting broader research applications. The model’s hourly temporal resolution cannot capture sub-hourly dynamics critical for frequency response and system stability. Additionally, the simulation-based approach requires validation with real-world operational data to ensure accuracy for practical grid applications.

Give feedback
NREL solar power data for integration studies
Details (click to expand)

The NREL Solar Power Data for Integration Studies provides one year (2006) of 5-minute solar power data and hourly day-ahead forecasts for 6,000 simulated PV plants across the United States. The dataset was created using sub-hour irradiance algorithms and Numeric Weather Prediction simulations, covering both utility-scale (with single-axis tracking) and distributed-scale (fixed-tilt) PV systems. 

Use CaseData Gap Summary
Improving solar power forecasting: long-term (>24 hours)

While valuable for renewable energy integration studies, this dataset has limitations in geographic coverage (limited to the US), temporal scope (only 2006 data), and relies on simulated rather than measured PV outputs. Addressing these gaps would enable more accurate and globally applicable ML-based solar forecasting models.

Give feedback
Natural hazards forecasts
Details (click to expand)

Natural hazard data used for risk assessments can usually be modeled with characteristics derived from, and statistically consistent with, the observational record. Some hazard data catalogs can be found here https://sedac.ciesin.columbia.edu/theme/hazards/data/sets/browse, as well as from the Risk Data Library of the World Bank.

Use CaseData Gap Summary
Facilitating disaster risk assessments

The resolution of current natural hazard forecast data is not sufficient for effective physical risk assessment.

Give feedback
Ocean observations from floating infrastructure (FINO3)
Details (click to expand)

FINO3 is an off-shore wind mast based wind speed and wind direction research platform datasets which include time series data with respect to temperature, air pressure, relative humidity, global radiation, and precipitation. Images from the perspective of the platform provide a snapshot of of environmental conditions directly. 

The platform is located in the northern part of the German Bight, 80km northwest of the island of Sylt in the midst of wind farms. Wind measurements are taken between 32 to 102 meters above sea level with wind speed measurements taken every 10 meters. Data is collected from August 2009 until the present day.

Use CaseData Gap Summary
Improving offshore wind power forecasting: short-to long-term (3 hours–1 year)

Due to the location, FINO platform sensors are prone to failures of measurement sensors due to adverse outdoor conditions such as high wind and high waves. Often, when sensors fail, manual recalibration or repair is nontrivial requiring weather conditions to be amenable for human intervention. This can directly affect the data quality which can last for several weeks to a season. Coverage is also constrained to the platform and associated wind farm location.

Give feedback
Offshore wind data from masts and LiDAR
Details (click to expand)

Offshore wind data from mast measurements and LiDAR can be found from several providers. 

LiDAR based-wind mapping has advantages over traditional wind mast tower measurements, namely higher resolution, larger coverage, and improved data quality. This is because LiDAR can measure wind speeds at various heights from the ground reducing the impact of turbulence on measurements that would typically affect mast measurements. Furthermore, LiDAR based wind mapping can provide near real time wind data suitable for control optimization and load forecasting applications. 

Datasets include:TNO: Offshore wind measurements, Lichteiland Goeree (LEG), Europlatform (EPL), K13A, L2-FA-1, Meetmast IJmuiden (MMIJ), Offshore Wind Egmond aan Zee (OWEZ), Orsted Offshore Meteorological Data, Anholt offshore wind farm (ANH), Westermost Rough offshore wind farm (WMR), FINO2 offshore meteorological data.

Access to the above datasets can be requested at TNO and Orsted which are all based in Europe.

Use CaseData Gap Summary
Improving offshore wind power forecasting: short-to long-term (3 hours–1 year)

The spatiotemporal coverage of the offshore windspeed mast data is restricted to the dimensions of the platform/tower itself as well as the time of construction. Depending on the data provider access to the data may require the signing of a non-disclosure agreement.

Give feedback
Offshore wind farm operation data (Orsted)
Details (click to expand)

The offshore operation data from the Danish energy company Orsted provides 2 years worth of 10-minute Supervisory Control and Data Acquisition (SCADA) information for nacelle wind speed, electrical power, rotor speed, yaw position, as well as pitch angle for turbines with on-site wave buoy data and ground based LiDAR from different offshort wind farm sites. 

For one site, the Anholt Westermost Rough offshore wind farm, data is collected from 111 Siemens SWT-120-3.6 MW wind turbines arranged in a layout of 20 km by 8 km with internal spacing between turbines being 5-7 rotors and a depth of 15-19 m. In another site, The Northeast of Withernsea off Holderness coast in North Sea, England, has a windfarm with a 35 km by 35 km spatial coverage area.

Use CaseData Gap Summary
Improving offshore wind power nowcasting (10 min)

Data can be accessed by requesting access via the Orsted form.  Sufficiency of the dataset is constrained by volume where only a finite amount of short term off-shore wind farms exist to which expanding the coverage area, volume and time granularity of data to under 10 minutes may enable transient detection from generated active power.

Give feedback
OpenStreetMap (land use map)
Details (click to expand)

Open Street Map is an open-source map database providing worldwide geographic features such as buildings, roads, and land uses, maintained by a community of mappers who add objects manually or trace them from remote sensing imagery.

Use CaseData Gap Summary
Facilitating disaster risk assessments

The quality of OpenStreetMap is very variable in terms of coverage of geometries e.g. buildings and attributes. Roads are better mapped than buildings in general. The very permissive data model from OpenStreetMap enables users to provide a variety of information, but it is often not well harmonized. Recent corporate editing efforts have increased dramatically the coverage in previously poorly mapped regions.

Give feedback
Optimal power flow simulation outputs
Details (click to expand)

PowerWorld Simulator and MATPOWER are software tools used for optimizing power systems and include representation of both alternating current (AC) and direct current (DC) systems. PowerWorld Simulator models, analyzes, and optimizes power systems for a wide range of configurations and scenarios with the ability to model small distribution networks as well as transmission systems. 

MATPOWER is an open source alternative and also solves both the AC and DC versions of optimal power flow (OPF) with DC OPF simplified into a quadratic program using DC modeling assumptions and reducing polynomial costs to second order using real power flows as a function of voltage angles (thereby eliminating voltage magnitude and reactive power). PowerWorld Simulator utilizes a combination of iterative algorithms (Newton-Raphson) with traditional power flow equations.

MATPOWER is open source and PowerWorld Simulator has several options for industry practitioners as well as those who would like to use it for academic purposes. Demo software that is licensed for educational use that includes simulator features such as available transfer capability, optimal power flow, security-constrained OPF, OPF reserves, PV/QV curve tool, transient stability, and geomagnetically induced current. In terms of topology, the free version contains up to 13 buses while the full version of the simulator can handle 250,000 buses.

Use CaseData Gap Summary
Improving power grid optimization

Traditional OPF simulation software may require the purchase of licenses for advanced features and functionalities. To simulate more complex systems or regions, additional data regarding energy infrastructure, region-specific load demand, and renewable generation may be needed to conduct studies. OPF simulation output would require verification and performance evaluation to assess results in practice. Increasing the granularity of the simulation model by increasing the number of buses, limits, or additional parameters increases the complexity of the OPF problem, thereby increasing the computational time and resources required.

Give feedback
Outputs from distribution connected inverter systems simulations
Details (click to expand)

There is a need to enhance existing simulation tools to study inverter based power systems rather than traditional machine based. Simulations should be able to represent a large number of distribution connected inverters which incorporate DER models with the larger power system. Uncertainty surrounding model outputs will require verification and testing.

NREL’s PREconfiguring and Controlling Inverter SEt-points (PRECISE) can identify interconnection located on network based on PV customer’s address and model the distribution feeder and preconfigure advanced inverter modes to provide grid support and minimize energy curtailment. The tool can allow utilities to perform power flow analysis and analyze inverter modes.

Furthermore, NREL’s Energy Systems Integration Facility (ESIF) has real-time simulation connected with power hardware that allows for smart inverter manufacturers to test operational control with simulated dynamics and scenarios.

Use CaseData Gap Summary
Optimizing smart inverter management for distributed energy resources

There is a need to enhance existing simulation tools to study inverter-based power systems rather than traditional machine-based based. Simulations should be able to represent a large number of distribution-connected inverters that incorporate DER models with the larger power system. Uncertainty surrounding model outputs will require verification and testing. Furthermore, accessibility to simulations and hardware in the loop facilities and systems requires user access proposal submission for NREL’s Energy Systems Integration Facility access. Similar testing laboratories may require access requests and funding.

Give feedback
Passive acoustic monitoring for biodiversity assessment
Details (click to expand)

Passive acoustic recording provides continuous monitoring of both environment and species vocalizations.While some annotated datasets are available through repositories like ARBIMON (https://arbimon.org/), Macaulay Library (www.macaulaylibrary.org), and Xeno-canto (www.xeno-canto.org), there remains a general lack of robust, large, and diverse annotated bioacoustic datasets for machine learning applications.

Use CaseData Gap Summary
Facilitating forest restoration monitoring

There is a lack of standardized protocol to guide data collection for restoration projects, including which variables to measure and how to measure them. Developing such protocols would standardize data collection practices, enabling consistent assessment and comparison of restoration successes and failures on a global scale.

Give feedback
Facilitating the detection of climate-induced ecosystem changes

The major data gap is the limited institutional capacity to process and analyze data promptly to inform decision-making.

Give feedback
Improving terrestrial wildlife detection and species classification

The first and foremost challenge of bioacoustic data is its sheer volume, which makes its data sharing especially difficult due to limited storage options and high costs. Urgent solutions are needed for cheaper and more reliable data hosting and sharing platforms.

Additionally, there’s a significant shortage of large and diverse annotated datasets, much more severe compared to image data like camera trap, drone, and crowd-sourced images.

Give feedback
Pecan Street (appliance-level consumption data)
Details (click to expand)

Pecan Street DataPort began as a Smart Grid Demonstration program through the Pecan Street energy research nonprofit organization which worked closely with the University of Texas at Austin. Funded by the DOE in 2014, the project signed up 1000 research participants from the Mueller community in Austin, Texas to share green button, smart meter, and home energy management system (HEMS) data in 750 homes and 25 commercial properties. Financial incentivization of plug-in electric vehicle use and rooftop solar installation by Austin Energy encouraged residential lifestyle shifts. In addition to providing access to sub-metered appliance level consumption data, Pecan Street includes electric vehicle charging, rooftop solar, heating, cooling, and water usage data. Data coverage has expanded to volunteer households from California, New York and Colorado. Previously open for use, Pecan Street has been privatized and now data access and products are available for commercial and academic purchase depending on the level of access requested.

View dataset

Use CaseData Gap Summary
Enabling non-intrusive electricity load monitoring

Pecan Street DataPort requires non-academic and academic users to purchase access via licensing which varies depending on the building data features requested. Coverage area of data is primarily concentrated in the Mueller planned housing community in Austin, Texas–a modern built environment which is not representative of older historical buildings that may be in need of energy efficient upgrades and retrofits. In customer segmentation studies and consumer-in-the-loop load consumption modeling, annual socio-demographic survey data may be too coarse and not provide insight into behavioral effects of household members on consumption profiles with time.

Give feedback
Population and asset exposure to natural hazards
Details (click to expand)

Exposure is defined as the representative value of populations and assets potentially exposed to a natural hazard occurrence.  such as population, physical assets (e.g. buildings), economic output (e.g. measured by GDP),, buildings, or agriculture output, depending on the risk exposed to.

There areopen datatasets with global coverage, for example, the Global Exposure Model, as well as proprietary data with more detailed information coming from well-established insurance companies. 

Use CaseData Gap Summary
Facilitating disaster risk assessments

Accessibility and reliability are the most significant challenges with exposure data.

Give feedback
Power Grid Lib (optimal power flow benchmark library)
Details (click to expand)

The Power Grid Library (PGLib-OPF) is a collection of git repositories that house benchmark data for validating power system simulations. 

It contains 36 networks with 3-13,659 buses sourced from IEEE Power Flow Test Cases, IEEE Dynamic Test Cases, IEEE Reliability Test System, Polish Test Cases, PEGASE Test Cases, and RTE Test Cases which have been modified to raise optimality gaps to values between 1-10% thereby creating more challenging suboptimal solutions to AC-OPF. 

By curating and collecting this data, users who want to study more realistic AC-OPF simulation scenarios can directly retrieve compiled bus IDs, branch IDs, generator IDs, power demand, shunt admittance, voltage magnitude range for buses, power injection range for generators, quadratic active power cost function coefficients  for generators, branch parameters like series admittance, line charge, transformer parameters, thermal limits, and branch voltage angle difference range which are more realistic. All parameters are conveniently standardized to MATPOWER data file format for direct use. PGLib-OPF is open source.

Use CaseData Gap Summary
Improving power grid optimization

While network datasets are open source, maintenance of the repository requires continuous curation and collection of more complex benchmark data to enable diverse AC-OPF simulation and scenario studies. Industry engagement can assist in developing more realistic data though such data without cooperative effort may be hard to find.

Give feedback
Power line robot inspection imagery
Details (click to expand)

Cable inspection robot data includes LiDAR and image captures of Specific Power Line (SPL) components such as dampers, insulators, broken strands, and attachments that may have degraded due to exposure to natural elements. The data also focuses on assessing risk at the lowest part of power lines near trees, roofs, and other crossing power lines. Since the robots physically traverse the lines, this data is particularly valuable for degradation detection of high voltage transmission lines and for maintenance scheduling.

Use CaseData Gap Summary
Enhancing power grid-vegetation management for wildfire risk mitigation

Grid inspection robot imagery requires coordination with local utilities foraccess, multiple robot trips for complete coverage, image preprocessing to remove ambient artifacts, position and location calibration, and may be limited by camera resolution for detecting subtle degradation patterns.

Give feedback
Regularly gridded high-resolution atmospheric observations
Details (click to expand)

Though a lot of data is available, a set of regularly gridded 3D high-resolution observations of the atmosphere state (like a higher-resolution version of ERA5) is still needed. This is essential for both an improved understanding of the atmospheric processes and the development of ML-based weather forecast models and climate models.  

Use CaseData Gap Summary
Accelerating and improving weather forecasting: Near-term (< 24 hours)

An enhanced version of ERA5 with higher granularity and fidelity is needed. Many surface observations and remote sensing data are available but underutilized for developing such a dataset.

Give feedback
Hybrid ML-physics climate models for enhanced simulations

While conceptually needed, this dataset does not exist in the form required. An enhanced version of ERA5 with higher resolution and fidelity would significantly improve ML model training and validation. 

Give feedback
Residential daylight performance metric (DPM) data
Details (click to expand)

The amount of daylight that buildings are exposed to through windows is an important parameter for heating demand (via heat gains from solar radiations) and electricity demand for lighting (via the illumination of indoor spaces by natural light). Architects can optimize these dimensions via adjusting window placement and window-to-window ratios. 

Daylight performance metrics (DPMs) have been developed by building researchers and architects based on daylight access simulation output to quantify the illumination of indoor spaces by natural light.

Residential daylight performance metric data (DPM) with respect to daylight autonomy (DA), continuous daylight autonomy (cDA), spatial daylight autonomy (sDA), and useful daylight illuminance (UDI) can be generated using physics-based ray tracing simulations that calculate illuminances over a prototype building layout. Some simulation software available to calculate DPMs include IES virtual environment (IESVE), DesignBuilder, VELUX daylight visualizer, and the open source RADIANCE 5.0. To generate synthetic data from these simulation frameworks, the user must provide a geometric model of the building, climate data with respect to the building location, reflectance and transmittance values for materials, desired radiance parameters, occupancy schedule, and a virtual sensor grid over which the incident illuminance is to be calculated. Strategies based on the output of the simulations can assist architects in optimizing window placement and size, incorporation of shading devices, and the design of floor plans to control building direct and diffuse natural light.

Use CaseData Gap Summary
Accelerating building energy models

While daylight performance metric (DPM) evaluation is an important step in the planning of commercial buildings, residential buildings do not have a similar focus, which is unusual given that most new building construction occurs within the residential sector. Residential DPMs often lack metrics associated with direct sunlight access, rely on annual averages for seasons, and utilize fixed occupancy schedules that are overly simplified for residential spaces. Additionally, illuminance metrics and thresholds utilized in commercial spaces do not translate well for residential spaces where people may prefer higher or lower illuminances depending on their location and lifestyles. Lastly, DPM optimization is based on operational metrics and assumptions on illumination in a space and its effects on the resulting thermal comfort and operational consumption of a traditional urban residential spaces, vernacular architecture which is specific to a local region and culture may not share similar objectives, preferring more indoor-outdoor transitional spaces, earthen materials, and less focus on windows and incident natural sunlight.

Give feedback
S2S forecast data
Details (click to expand)

NWP model output from S2S experiment https://confluence.ecmwf.int/display/S2S/Models

Use CaseData Gap Summary
Weather forecasting: Subseasonal-to-seasonal horizon

More data is needed to take advantage of the large ML models.

Give feedback
SMA Solar Technology PV System Performance database
Details (click to expand)

PV Anlage-Reinhart System provides hourly photovoltaic power, energy production, CO2 emissions avoided, and system configuration information for publicly available PV installations worldwide. SMA, a leading German manufacturer of solar inverters, has compiled data from their international deployments across multiple countries including Germany, the US, Chile, Brazil, Mexico, Canada, Spain, Italy, France, China, Australia, Belgium, India, Poland, Japan, UK, South Africa, Türkiye, and the UAE. This dataset, which includes inverter specifications, module information, and sometimes battery data, supports microgrid studies and distributed energy resource forecasting.

Use CaseData Gap Summary
Improving solar power forecasting: short-term (30 min-6 hours)

The SMA PV monitoring system requires user profile creation and specific system access requests, with documentation primarily in German creating potential language barriers. Data representation is geographically unbalanced with stronger coverage in Germany, Netherlands, and Australia despite its global presence. Additionally, only a subset of systems includes energy storage data, which would be valuable for comprehensive distributed energy resource load forecasting studies.

Give feedback
SOLETE Hybrid Solar-Wind Generation dataset
Details (click to expand)

SOLETE, developed by the Energy System Integration Lab (SYSLAB) at the Technical University of Denmark, provides 15 months of measurements at multiple resolutions (seconds to hours) from June 2018 to September 2019. The dataset includes timestamps, meteorological data (temperature, humidity, pressure, wind speed and direction), solar irradiance measurements (global horizontal and plane of array), and active power generated by an 11 kW Gaia wind turbine and a 10 kW PV inverter. This comprehensive dataset supports time-series forecasting for hybrid solar-wind distributed energy resource systems.

Use CaseData Gap Summary
Improving solar power forecasting: short-term (30 min-6 hours)

While SOLETE offers valuable data for joint wind-solar distributed energy resource forecasting, several sufficiency gaps limit its application. The dataset’s 15-month temporal coverage doesn’t capture long-term seasonal variations, and it monitors only a single wind turbine and PV array, limiting analyseis of multi-source generation coordination. Additionally, maintenance schedule and system downtime data are missing, which would enhance realistic system dynamic modeling. Supplementing with external data sources or simulation could address these limitations.

Give feedback
SRRL TSI-880 (sky imagery)
Details (click to expand)

The SRRL TSI-880 contains data from ground-based sky imagers that provide high temporal and spatial resolution (<1 km) information at single locations to support cloud detection and solar forecasting.

Use CaseData Gap Summary
Improving solar power forecasting: nowcasting/very-short-term (0-30min)

Data coverage is limited by camera locations, temporal resolution is restricted to 10-minute increments, and image resolution is limited to 352x288 24-bit jpeg images.

Give feedback
SWINySEG (sky imagery)
Details (click to expand)

SWINySEG (Singapore whole sky Nychthemeron image SEGmentation database) contains 6,768 daytime and nighttime sky/cloud images with corresponding binary ground truth maps taken in Singapore over 12 months in 2016, with annotations by the Singapore Meteorological Services.

Use CaseData Gap Summary
Improving solar power forecasting: nowcasting/very-short-term (0-30min)

The dataset provides valuable annotated data for cloud detection and segmentation but is limited to Singapore, has an insufficient volume of samples (especially nighttime images), and restricts commercial use.

Give feedback
Satellite imagery
Details (click to expand)

This category encompasses satellite imagery of various spatial and spectral resolutions with global coverage captured at different time intervals. Open-access options include Sentinel-1/2, MODIS, VIIRS, and Landsat (resolution down to 5m), while commercial providers like Maxar offer higher resolutions (down to 30cm). Planet NICFI provides free high-resolution mosaics of the world’s tropics for non-commercial use.

Use CaseData Gap Summary
Accelerating post-disaster damage assessments

Satellite imagery for disaster assessment faces challenges with temporal currency and spatial resolution, with public datasets having insufficient resolution for accurate damage assessment and commercial high-resolution options being prohibitively expensive.

Give feedback
Earth observation for climate-related applications

Satellite images are intensively used for Earth system monitoring. One of the two biggest challenges of using satellite images is the sheer volume of data which makes downloading, transferring, and processing data all difficult. The other one is the lack of annotated data. For many use cases, the lack of publicly open high-resolution imagery is also a bottleneck.

Give feedback
Enhancing digital reconstructions of the environment

Satellite images provide environmental information for habitat monitoring. Combined with other data, e.g. bioacoustic data, they have been used to model and predict species distribution, richness, and interaction with the environment. High-resolution images are needed but most of them are not open to the public for free.

Give feedback
Improving solar power forecasting: medium-term (6-24 hours)

Satellite remote sensing data for solar forecasting faces several challenges: variability in spatial and temporal resolution across different satellite sources, complex preprocessing requirements for multispectral data, and the need to accurately translate cloud observations into ground-level irradiance predictions. Improving granularity through supplementation with ground-based measurements and developing standardized preprocessing pipelines would significantly enhance forecast accuracy for grid management applications.

Give feedback
Improving terrestrial wildlife detection and species classification

Some commercial high-resolution satellite images can also be used to identify large animals such as whales, but those images are not open to the public.

Give feedback
Satellite imagery – GEDI LiDAR
Details (click to expand)

The Global Ecosystem Dynamics Investigation (GEDI) is a NASA/University of Maryland mission that uses LiDAR to create detailed 3D maps of forest canopy height and structure. By measuring forests in 3D, GEDI data enables accurate estimation of forest biomass and carbon storage across global scales.

Use CaseData Gap Summary
Improving estimations of forest carbon stock

Quality uncertainties in GEDI data affect carbon stock estimation reliability, requiring validation methods and calibration procedures to improve measurement accuracy.

Give feedback
Satellite imagery – Hyperspectral
Details (click to expand)

This dataset consists of hyperspectral satellite imagery from platforms such as PRISMA and EnMAP, which capture hundreds of narrow spectral bands across the electromagnetic spectrum, providing detailed spectral information for detecting methane plumes with greater sensitivity than multispectral systems.

Use CaseData Gap Summary
Scaling methane emission detection

Very few actual hyperspectral images of methane plumes exist, creating a significant data volume limitation for training robust detection algorithms.

Give feedback
Satellite imagery – Multi-Radar/Multi-Sensor System
Details (click to expand)

The Multi-Radar Multi-Sensor (MRMS) system combines data from multiple radars, satellites, surface observations, lightning reports, rain gauges, and numerical weather prediction models to produce decision-support products every two minutes. It provides detailed depictions of high-impact weather events such as heavy rain, snow, hail, and tornadoes, enabling forecasters to issue more accurate and earlier warnings. See https://www.nssl.noaa.gov/projects/mrms/

Use CaseData Gap Summary
Accelerating and improving weather forecasting: Near-term (< 24 hours)

Obtaining and integrating radar data from various sources is challenging due to access restrictions, format inconsistencies, and limited global coverage.

Give feedback
Satellite imagery – Multispectral
Details (click to expand)

This dataset contains images captured by spectrometer-equipped satellites that record data at specific wavelengths to detect the spectral signatures associated with methane. Notable missions include the Sentinel-5P TROPOMI instrument and the upcoming MethaneSAT, which provide global coverage of methane concentrations in the atmosphere.

Use CaseData Gap Summary
Scaling methane emission detection

Current multispectral satellite data has insufficient spatial resolution to detect smaller methane leaks.

Give feedback
Satellite imagery – PALSAR radar images
Details (click to expand)

PALSAR (Phased Array type L-band Synthetic Aperture Radar) provides radar imagery that can capture the 3D structure of forests by penetrating cloud cover and forest canopies. This technology enables consistent monitoring regardless of weather conditions or time of day, making it valuable for continuous forest carbon stock estimation.

Use CaseData Gap Summary
Improving estimations of forest carbon stock

Domain expertise is needed to preprocess this data, limiting its accessibility to researchers and practitioners without specialized knowledge in radar imagery interpretation.

Give feedback
Simulated variables from process-based models of soil organic carbon dynamics
Details (click to expand)

This dataset contains soil data generated by physics-based or process-based soil models that simulate soil organic carbon dynamics based on environmental and management inputs. These simulations provide alternatives to direct measurements where field data collection is prohibitively expensive or impractical.

Use CaseData Gap Summary
Modeling effects of soil processes on soil organic carbon

Data collection is extremely expensive for some variables, leading to the use of simulated variables. Unfortunately, simulated values have large uncertainties due to the assumptions and simplifications made within simulation models.

Give feedback
Smart inverter devices database
Details (click to expand)

The California Energy Commission keeps a list of smart inverters that meet strict standards for safety and communication. These inverters must pass extra tests to show they can handle things like voltage, frequency, timing, and how they connect or disconnect from the grid, along with other technical functions to keep the power system safe and stable.

Those include: CEC Grid Support Solar Inverters, CEC Grid Support Battery Inverters, CEC Grid Support Solar/Battery Inverters, CEC Inverters with Power Control Systems functionality.

Additional vendors can also be contacted for smart inverter information:

SMA-America Solar Inverters.

Use CaseData Gap Summary
Optimizing smart inverter management for distributed energy resources

Smart inverter operational data is not publicly available and requires partnerships with research labs, utilities, and smart inverter manufacturers. However, the California Energy Commission maintains a database of UL 1741-SB compliant manufacturers and smart inverter models that can then be contacted for research partnerships. In terms of coverage area, while California and Hawaii are now moving towards standardizing smart inverter technology in their power systems, other regions outside of the United States may locate similar manufacturers through partnerships and collaborations.

Give feedback
Soil Survey Geographic Database (SSURGO)
Details (click to expand)

The Soil Survey Geographic Database (SSURGO) contains soil organic carbon data collected through field observations and laboratory analysis of soil samples. It provides comprehensive soil information for the United States, including physical and chemical soil properties.

View dataset

Use CaseData Gap Summary
Modeling effects of soil processes on soil organic carbon

In general, there is insufficient soil organic carbon data for training a well-generalized ML model (both in terms of coverage and granularity).

Give feedback
Soil measurements – NorthWyke Farms platform
Details (click to expand)

The NorthWyke Farms platform data is a collection of soil measurements from the UK’s North Wyke Farm Platform, providing quarterly soil organic carbon values along with other environmental parameters. The dataset covers experimental farm plots under different management practices and is continuously updated with new measurements.

View dataset

Use CaseData Gap Summary
Modeling effects of soil processes on soil organic carbon

The common and biggest challenges for use cases involving soil organic carbon is the insufficiency of data and the lack of high granularity data.

Give feedback
Solar panel PV system dataset
Details (click to expand)

The Solar Panel PV System Dataset (https://www.kaggle.com/datasets/arnavsharmaas/solar-panel-pv-system-dataset/data) is a tabular dataset from the National Renewable Energy Laboratory that includes specific feature data on PV system size, rebate, construction, tracking, mounting, module types, number of inverters and types, capacity, electricity pricing, and battery-rated capacity. 

The solar panel PV system dataset was created by collecting and cleaning data for 1.6 million individual PV systems, representing 81% of all U.S. distributed PV systems installed through 2018. The analysis of installed prices focused on a subset of roughly 680,000 host-owned systems with available installed price data, of which 127,000 were installed in 2018. The dataset was sourced primarily from state agencies, utilities, and organizations administering PV incentive programs, solar renewable energy credit registration systems, or interconnection processes. 

Use CaseData Gap Summary
Scaling solar photovoltaics site assessments

The solar panel PV system dataset excluded third-party-owned systems and systems with battery backup. Since data was self-reported, component cost data may be imprecise. The dataset also has a data gap in timeliness as some of the data includes historical data, which may not reflect current pricing and costs of PV systems.

Give feedback
Solcast (global solar forecasting and historical solar irradiance data)
Details (click to expand)

Solcast is a global solar forecasting and historical solar irradiance data provider that combines satellite imagery from Himawari 8, GOES-16, GOES-17, and Numeric Weather Prediction models to deliver 10-15 minute scale solar irradiance data products.

Use CaseData Gap Summary
Improving solar power forecasting: nowcasting/very-short-term (0-30min)

Solcast data is only accessible through academic or research institutions, uses coarse elevation models, has limited coverage (33 global sites), and provides data at 5-60 minute intervals, insufficient for very-short-term forecasting.

Give feedback
Sub-metered appliance-level data
Details (click to expand)

This collection includes multiple international datasets of sub-metered building electricity consumption, primarily from residential buildings across North America, Europe, and Asia collected between 2011-2020. These datasets provide granular appliance-level energy consumption data at varying sampling frequencies (1Hz-15kHz) along with aggregate building-level measurements. Some datasets include additional measurements such as occupancy information, environmental conditions, and utility billing data. The datasets vary in coverage from single households to hundreds of homes, with monitoring periods ranging from two months to several years.

- Almanac of Minutely Power dataset (AMPds2): A single building electricity, water, and natural gas consumption dataset from a home in Burnaby, British Columbia, Canada from 2012-2014 which includes environment and utility billing data as well. 

- Commercial building energy dataset (COMBED): A dataset of 6 commercial buildings on the Indraprastha Institute of Information Technology (IIIT-Delhi) from August 2013 to the present containing data with respect to the total power consumption, sub-metered data with respect to elevators, air handling units (AHUs), uninterruptible power supplies (UPS), and central campus heating, ventilation, and air conditioning (HVAC) pumps and chillers at a 30 second cadence.

- DEDDIAG: A dataset comprised of aggregate and disaggregated power consumption from 15 southern German homes monitored at 1Hz containing 50 appliances including dishwashers, washing machines, refrigerators and dryers over a span of 3.5 years (2016-2020). Aggregated data includes three-phase measurements. This dataset also contains event start and stop timestamps for 14 appliances.

- Dutch Residential Energy Dataset (DRED): Requires request. Consists of data collected from a single household in the Netherlands which contains the appliance level and total energy consumption over two months. Appliance consumption measured was a refrigerator, washing machine, central heating, microwve, oven, cooker, blender, toaster, television, fan, living room outlets, and a laptop recorded with a sampling frequency of 1 Hz. DRED additionally has data on human occupancy based on WiFi and bluetooth signals received from occupant smartphones and wearable devices to allow for locating the consumer without setting up the home with more intrusive monitoring devices. DRED can be accessed by request.

- Electricity Consumption and Occupation (ECO): A dataset collected from June 2012-January 2013 covering 6 home in Switzerland where 6-10 smart plugs were deployed in each household. Aggregate consumption at the building level was measured in three phases to capture voltage, current, and phase shifts. Occupancy data was tracked by residents manually and via a passive infrared entry door sensor.

- Greend: A dataset of 9 households in Austria and Italy for one year covering December 2013-April 2014. Data included aggregated and submetered appliance level data which varied depending on the appliance inventory of the household covering active power measurements taken at a frequency of 1Hz. GREEND can be requested by form

- HIPE: A dataset from October 2017-December 2017 recording smart meter measurements from 10 machines and the main terminal of an electronics production site operated by the Institute of Data Processing and Electronics (IPE) at Karlsruhe Institute of Technology (KIT) in Germany at a cadence of 5 seconds with measurements with respect to active power, reactive power, voltage, frequency, and distortion.

- Indian data for Ambient Water and Electricity Sensing (iAWE): Total consumption, appliance level, as well as circuit panel level in a single family home in New Delhi, India was collected in summer of 2013 over the course of 73 days. Additional quantities such as water usage from an overhead tank, and network strength based on packet loss was also jointly measured.

- IDEAL: A joint electricity, gas, temperature, humidity, and light dataset for 255 homes in the UK from August 2016 to June 2018. Aggregate and sub-metered consumption was measured at 1 second intervals, while temperature, humidity and light were measured at 12 second intervals. Household occupancy was measured through initial surveys with respect to socio-demographic data and self-reported updates to the data in the event that there was a change in occupancy.

- Reference Energy Disaggregation Dataset (REDD): Contains 119 days worth of aggregate consumption taken in 2011 from 10 residential buildings located in the greater Boston area. The data includes meter level phases of power, and voltage recorded at 15kHz as well as sub-meter level 24 circuits labeled by appliance category and measured at a cadence of 0.5Hz and 1Hz for large and small plug level appliances respectively.

- REFIT: A dataset containing aggregate and individual appliance monitor sub-meter data taken every 8 seconds from 20 UK households from September 2013 to September 2015. Of the 8 households, 6 households had rooftop solar panels however, 3 were rewired to remove the effect of generation.

- UMass Smart Home data set: This dataset is comprised of metered and sub-metered data from three homes in west Massachussetts taken over a period of three years. Measurements included average household load, circuit-level load, and plug load per second. Accompanying generation data from solar panels and wind turbines is available for one of the three homes. Environmental data with respect to the outdoor weather and indoor temperature and humidity are provided as well as occupancy information through wall switch data, doors, and motion sensors. HVAC trigger events and corresponding temperature settings and operational status are also provided. 

- UK Domestic Appliance-Level Electricity data set (UK-DALE): A dataset comprised of measurements of aggregated as well as individual appliance level consumption recorded every 6 seconds from 5 UK homes taken from researchers at Imperial College. The continuous coverage varied per house ranging from 39 to 786 days spanning dates from 2012 to 2015. Data included whole house active power, apparent power, and RMS voltage. Appliance level measurements were taken every 6 seconds using individual appliance monitors for up to 54 appliances per residence. 

View dataset

Use CaseData Gap Summary
Enabling non-intrusive electricity load monitoring

 For accurate NILM studies, benchmark datasets are required to include not only consumption but local power generation (e.g., from rooftop solar), as it can affect the overall aggregate load observed at the building level. While some datasets may include generation information, most studies do not take rooftop solar generation into account. Additionally, devices that can behave both as a load and generator such as electric vehicles or stationary batteries were also not included. The majority of building types are single family housing units limiting the diversity of representation. Furthermore, most datasets are no longer maintained following study close.

Give feedback
The Public Utility Data Liberation (PUDL)
Details (click to expand)

The Public Utility Data Liberation (PUDL) project, maintained by Catalyst Cooperative, integrates and standardizes energy sector data from US government agencies including EIA, FERC, EPA, and system operators into analysis-ready formats. This continuously updated database covers power generation, fuel consumption, emissions, and financial data from 2009 to present across the United States. 

View dataset

Use CaseData Gap Summary
Enhancing energy policy and market analysis

Government energy datasets suffer from inconsistent formats, missing documentation, and aggregation challenges that prevent ready analysis. Key gaps include complex pre-processing requirements due to format changes, limited documentation maintenance, and missing weather and transmission data. Standardized reporting formats across agencies, improved documentation practices, and expanded data collection could significantly enhance the utility of integrated energy datasets for policy analysis.

Give feedback
US large-scale solar photovoltaic database
Details (click to expand)

The US Large-scale Solar Photovoltaic Database (USPVDB) contains polygon representation of large-scale photovoltaic installations,  associated with facility-specific data attributes. 

They were mined from the US Energy Information Administration (EIA) form 860 and facility type designation by the US Environmental Protection Agency (EPA). The dataset also has information on whether the large-scale PV installations are for agrivoltaic purposes. Overall, 3,699 US ground mounted facilities with capacity greater than or equal to 1MWdc are represented. The USPVDB data must be accessed through the United States Geological Survey (USGS) mapper browser application or for download as GIS data in the form of shapefiles or geojsons. Tabular data and metadata are provided in CSV and XML format.

Use CaseData Gap Summary
Scaling solar photovoltaics site assessments

Only the US are covered in this dataset.. Enhancing the data by supplementing it with international large-scale photovoltaic satellite imagery can expand the coverage area of the dataset.

Give feedback
US school bus fleet dataset
Details (click to expand)

The US school bus fleet dataset compiled by the World Resources Institute contains information on school district, model year, fuel type, manufacturer, seating capacity, and ownership mode for over 450,000 buses from 46 states and the District of Columbia, covering data collected from March to November 2022.

View dataset

Use CaseData Gap Summary
Optimizing electrified bus fleet in urban vehicle-to-grid systems

The dataset suffers from inconsistent state-level reporting structures and missing data from 4 US states, limiting comprehensive national analysis. Standardizing reporting formats and expanding state participation could enable more robust AI models for fleet electrification planning across diverse geographic and operational contexts.

Give feedback
WeatherBench 2
Details (click to expand)

Benchmark for global, medium-range (1-14 day) data-driven weather forecasting https://weatherbench2.readthedocs.io/en/latest/data-guide.html

Use CaseData Gap Summary
Weather forecasting: Short-to-medium term (1-14 days)

Weather Bench 2 is based on ERA5, so the issues of ERA5 are also inherent here, that is, data has biases over regions where there are no observations.

Give feedback
subX
Details (click to expand)

NWP model output from subseasonal forecast experiment https://iridl.ldeo.columbia.edu/SOURCES/.Models/.SubX/.

Use CaseData Gap Summary
Weather forecasting: Subseasonal horizon

More data is needed to develop a more accurate and robust ML model. It is also important to note that SUBX data contains biases and uncertainties, which can be inherited by ML models trained with this data.

Give feedback
xBD Dataset (pre- and post-disaster satellite imagery)
Details (click to expand)

xBD is an annotated benchmark dataset containing pre- and post-disaster satellite imagery used for training and evaluating ML models in disaster damage assessment. The dataset is publicly available at https://paperswithcode.com/dataset/xbd

View dataset

Use CaseData Gap Summary
Accelerating post-disaster damage assessments

The xBD dataset has two significant limitations: it is geographically biased toward North America and lacks granular damage severity classification, limiting its global applicability and assessment precision.

Give feedback