Data Gaps (Beta)

Artificial intelligence (AI) and machine learning (ML) offer a powerful suite of tools to accelerate climate change mitigation and adaptation across different sectors. However, the lack of high-quality, easily accessible, and standardized data often hinders the impactful use of AI/ML for climate change applications.

In this project, Climate Change AI, with the support of Google DeepMind, aims to identify and catalog critical data gaps that impede AI/ML applications in addressing climate change, and lay out pathways for filling these gaps. In particular, we identify candidate improvements to existing datasets, as well as "wishes" for new datasets whose creation would enable specific ML-for-climate use cases. We hope that researchers, practitioners, data providers, funders, policymakers, and others will join the effort to address these critical data gaps.

This project is currently in its beta phase, with ongoing improvements to content and usability. We encourage you to provide input and contributions via the routes listed below, or by emailing us at datagaps@climatechange.ai. We are grateful to the many stakeholders and interviewees who have already provided input.

Use Case Gap Types Sectors
Analysis of grid reliability events
Details (click to expand)

Due to rapid fluctuations in power generation, renewables introduce variability into the grid. These signals are capable of triggering safety monitoring systems related to grid stability. Power grid control centers receive multiple streams of data from these systems (e.g. alarms, sensors, and field reports) that are semi-structured and arriving at a high volume. For operators, these alarm triggers and associated data can be overwhelming to rationalize, reduce, and contextualize to diagnose grid conditions. ML can assist in interpreting these data to better understand the sequence of events leading up to an incident as well as to identify and detect the causes behind system disturbances affecting grid reliability.

DatasetData Gap Summary
EPRI10: Transmission control center alarm and operational data set

Access to EPRI10 grid alarm data is currently limited within EPRI. Data gaps with respect to usability are the result of redundancies in grid alarm codes, requiring significant preprocessing and analysis of code ids, alarm priority, location, and timestamps. Alarm codes can vary based on sensor, asset and line. Resulting actions as a consequence of alarm trigger events need field verifications to assess the presence of fault or non-fault events.

Give feedback
Assessing forest restoration outcomes
Details (click to expand)

Efforts are being made to restore ecosystems like forests and mangroves. ML can be used to monitor biodiversity changes before and after restoration efforts, in order to quantify their effectiveness and outcomes.

DatasetData Gap Summary
Bioacoustic recordings

There is a lack of standardized protocol to guide data collection for restoration projects, including which variables to measure and how to measure them. Developing such protocols would standardize data collection practices, enabling consistent assessment and comparison of restoration successes and failures on a global scale.

Give feedback
Camera trap images

There is a lack of standardized protocol to guide data collection for restoration projects, including which variables to measure and how to measure them. Developing such protocols would standardize data collection practices, enabling consistent assessment and comparison of restoration successes and failures on a global scale.

Give feedback
Drone images for biodiversity

There is a lack of standardized protocol to guide data collection for restoration projects, including which variables to measure and how to measure them. Developing such protocols would standardize data collection practices, enabling consistent assessment and comparison of restoration successes and failures on a global scale.

Give feedback
Assessment of climate impacts on public health
Details (click to expand)

Climate change has major implications for public health. ML can help analyze the relationships between climate variables and health outcomes to assess how changes in climate conditions affect public health.

DatasetData Gap Summary
Health data

The biggest issue for health data is its limited and restricted access.

Give feedback
Historical climate observations

Processing climate data and Integrating climate data with health data is a big challenge.

Give feedback
Automatic individual re-identification for wildlife
Details (click to expand)

Identification of individuals in wildlife (e.g., individual animals) refers to the process of recognizing and confirming the identity of an animal during subsequent encounters. It is crucial for identifying and monitoring endangered species to better understand their needs and threats, and to aid in conservation efforts. Computer vision related ML techniques are widely used for automatic individual identification.

DatasetData Gap Summary
Camera trap images

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Give feedback
Drone images for biodiversity

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Give feedback
eDNA

 One gap in data is the incomplete barcoding reference databases.

Give feedback
Bias-correction of climate projections
Details (click to expand)

Climate projection provides essential information about future climate conditions, guiding efforts in mitigation and adaptation, such as disaster risk assessments and power grid optimization. ML enhances the accuracy of these projections by bias-correcting forecasts generated by physics-based climate models (e.g., CMIP6). ML achieves this by learning the relationship between historical climate simulations (e.g., CMIP6 data) and observed ground truth data (such as ERA5 or weather station observations).

DatasetData Gap Summary
CMIP6

The large uncertainties in future climate projection is a big problem of CMIP6. The large volume of data and the lack of uniform structure—such as inconsistent variable names, data formats, and resolutions across different CMIP6 models—make it challenging to utilize data from multiple models effectively.

Give feedback
ERA5

ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.

Give feedback
Weather station data in general

Data is not regularly gridded and needs to be preprocessed before being used in an ML model.

Give feedback
Bias-correction of weather forecasts
Details (click to expand)

ML can be used to improve the fidelity of high-impact weather forecasts by post-processing outputs from physics-based numerical forecast models and by learning to correct the systematic biases associated with physics-based numerical forecasting models.

DatasetData Gap Summary
ENS

Same as HRES, the biggest challenge of ENS is that only a portion of it is available to the public for free.

Give feedback
HRES

The biggest challenge with using HRES data is that only a portion of it is available to the public for free.

Give feedback
Weather station data in general

Data is not regularly gridded and needs to be preprocessed before being used in an ML model.

Give feedback
Data-driven generation of climate simulations
Details (click to expand)

Generating climate simulations by running physics-based climate models is time consuming. ML can be used to more quickly generate climate simulations corresponding to different greenhouse gas emissions scenarios. Specifically, ML can be used to learn a surrogate model that approximates computationally-intensive climate simulations generated via Earth system models.

DatasetData Gap Summary
ClimateBench v1.0

The dataset currently includes simulations from only one model. To enhance accuracy and reliability, it is important to include simulations from multiple models.

Give feedback
CMIP6

The large data volume and lack of uniform structure (no consistent variable names, data strucuture, and data resolution across all models) makes it difficult to use data from more than one model of CMIP6.

Give feedback
Detection of climate-induced ecosystem changes
Details (click to expand)

Climate change is inducing significant changes in ecosystems. ML can be used to assess the impact of climate change on biodiversity and identify critical areas for conservation.

DatasetData Gap Summary
Bioacoustic recordings

The major data gap is the limited institutional capacity to process and analyze data promptly to inform decision-making.

Give feedback
Camera trap images

The major data gap is the limited institutional capacity to process and analyze data promptly to inform decision-making.

Give feedback
Ground survey of land use and land management

Data access is restricted due to institutional barriers and other restrictions.

Give feedback
Historical climate observations

For highly biodiverse regions, like tropical rainforests, there is a lack of high-resolution data that captures the spatial heterogeneity of climate, hydrology, soil, and the relationship with biodiversity.

Give feedback
Development of hybrid-climate models
Details (click to expand)

Physics-based climate models incorporate numerous complex components, such as radiative transfer, subgrid-scale cloud processes, deep convection, and subsurface ocean eddy dynamics. These components are computationally intensive, which limits the spatial resolution achievable in climate simulations. ML models can emulate these physical processes, providing a more efficient alternative to traditional methods. By integrating ML-based emulations into climate models, we can achieve faster simulations and enhanced model performance.

DatasetData Gap Summary
ClimSim

An ML-ready benchmark dataset designed for hybrid ML-physics research, e.g. emulation of subgrid clouds and convection processes.

Give feedback
DYAMOND (DYnamics of the Atmospheric general circulation Modeled On Non-hydrostatic Domains)

Intercomparison of global storm-resolving (5km or less) model simulations; used as the target of the emulator. Data can be found here.

Give feedback
ERA5

ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.

Give feedback
Large-eddy simulations

Extremely high-resolution simulations, such as large-eddy simulations, are needed. By explicitly resolving processes that are not resolved in current climate models, these simulations may serve as better ground truth for training machine learning models that emulate the physical processes and offer a more accurate basis for understanding and predicting climate phenomena.

Give feedback
Regularly gridded high-resolution atmospheric observations

An enhanced version of ERA5 with higher resolution and fidelity is needed. 

Give feedback
Digital reconstruction of the environment
Details (click to expand)

Modeling digital representations of environmental conditions and habitats using remote sensing data, such as satellite images, is crucial for understanding how environmental factors impact animal behavior and conservation efforts. This approach provides valuable insights into habitat conditions and changes, which are essential for effective wildlife conservation and management. ML can enhance this process by efficiently processing large volumes of data from various sources, leading to more detailed and accurate environmental reconstructions.

DatasetData Gap Summary
Drone images for biodiversity

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Give feedback
eDNA

 One gap in data is the incomplete barcoding reference databases.

Give feedback
Satellite Images

Satellite images provide environmental information for habitat monitoring. Combined with other data, e.g. bioacoustic data, they have been used to model and predict species distribution, richness, and interaction with the environment. High-resolution images are needed but most of them are not open to the public for free.

Give feedback
Disaster risk assessment
Details (click to expand)

As climate change progresses, extreme weather events and related hazards are expected to become more frequent and severe. To effectively address these challenges, robust disaster risk assessment and management are crucial. ML can be used within these efforts to  analyze satellite imagery and geographic data, in order to pinpoint vulnerable areas and produce comprehensive risk maps.

DatasetData Gap Summary
Building footprint

More information, such as age of the building, should be included in the dataset.

Give feedback
Exposure data

Accessibility and reliability is a big issue.

Give feedback
Financial loss datasets related to the impacts of disasters

The financial loss data is usually proprietary and not open to the public.

Give feedback
Hazard data

Resolution of current hazard data is not sufficient for effective physical risk assessment

Give feedback
Open Street Map

Doesn’t have meta-data regarding when the infrastructures, e.g. building was built, whereas this information is important to identify age of the building which in the end characterises the exposure to hazard.

Give feedback
Socioeconomic data

The availability, usability, and reliability of socioeconomic data are difficult. In general, there is a notable scarcity of data from the Global South. Data at a more granular scale is usually missing for the Global North. When data does exist, they lack consistency across multiple sources.

Give feedback
Surface elevation data

Very high-resolution reference data, for example, DEM currently is not freely open to the public.

Give feedback
Distribution-side hosting capacity estimation
Details (click to expand)

Historically the power grid has been designed for unidirectional flow from carbon-based generating sources to consumers. However, in the effort to lower greenhouse gas emissions, transition to and integration of renewable generation has become increasingly important in all aspects (e.g. transmission and distribution) of the grid from large scale generation farms to consumer-level rooftop solar and community wind turbine installations. The transition necessitates a restructuring of the grid from a unidirectional to a bidirectional energy network thereby stressing pre-existing systems–especially at the low-voltage distribution level. Due to its intermittent behavior, renewable integration at the low-voltage consumer level depends on the hosting capacity of the nearest substation feeder circuit. The hosting capacity determines the amount of generation from distributed energy resources (DERs) that a circuit can safely accommodate without setting off safety equipment. This can occur when generation exceeds consumption leading to overvoltage conditions or high current demand due to sudden peaks in demand leading to voltage sags. Faults may also lead to voltage sags.Operationally, distribution level substation feeders must surmount these conditions to ensure power quality. Traditional methods of assessing the hosting capacity of low-voltage distribution networks involve power flow analysis simulations which can be computationally expensive and difficult to perform in real-time operating conditions for large distribution circuits. For example, to analyze a particular feeder circuit, scenarios must be built by varying loads, DER generation, environmental conditions, power equipment availability, and human activity. Violations must then be identified with respect to voltage limits, thermal loads, and protection equipment to estimate hosting capacity. Machine learning models can serve as a surrogate to traditional models by capturing the spatio-temporal patterns of multiple streams of data for each node in the distribution network enabling real-time estimation capabilities. Additionally, reinforcement learning can enable accelerated scenario building and online control strategy evaluation. One such strategy, for example, may utilize inverter technology to modulate generation to match the larger power system’s needs and protect it from faults and overloads.

DatasetData Gap Summary
Distribution system simulators

While OpenDSS and GridLab-D are free to use as an alternative when real distribution substation circuit feeder data is unavailable, to perform site-specific or scenario studies, data from different sources may be needed to verify simulation results. Actual hosting capacity may vary from simulation results due to differences in load, environmental conditions, and the level of DER penetration.

Give feedback
Early detection of fire
Details (click to expand)

Climate change is expected to increase both the frequency and intensity of wildfires, as well as lengthen the fire season due to rising temperatures and shifting precipitation patterns. ML can play a crucial role in wildfire detection and monitoring by synthesizing data from various sources in order to provide more timely and precise information. For instance, ML algorithms can analyze satellite imagery from different regions to detect early signs of fires and track their progression. Additionally, ML can enhance automatic fire detection systems, improving their accuracy and responsiveness.

DatasetData Gap Summary
Drone images

Thermal images captured by drones have high value but the cost of high-resolution sensors is high.

Give feedback
Energy data fusion for policy and market analysis in energy systems
Details (click to expand)

Data collected from public utilities, energy companies, and government agencies by energy regulatory committees can provide detailed information with respect to generation, fuel consumption, emissions, and financial reports that better inform domestic policies to enforce and promote reduction of gas emissions through carbon pricing and renewable incentives, grid modernization and resilience planning for severe weather events, and equitable energy transitions. By providing continuously updated, well curated, analysis-ready energy system data, climate advocates will have better quantitative tools to influence political and administrative process thereby encouraging energy transition.

DatasetData Gap Summary
The Public Utility Data Liberation (PUDL)

Public datasets from government agencies such as the EIA, EPA, FERC, and PHMSA are not ready for use in analysis ready data products. Data is often tabular as zip files with different file formats that may not share common identifiers or schema to readily join data. Collating, collecting, and merging these datasets can often provide greater context to the state of the energy system and the effectiveness of policy measures. Data can also be missing based on reporting gaps and redacted per-plant pricing information. While PUDL seeks to overcome the gaps by merging datasets based on entity matching and interpolation challenges still remain in terms of maintenance as usability can be sensitive to original source data format changes, updates, and new initiatives. The datagaps experienced in the maintenance of this dataset will be highlighted with respect to the source data that PUDL mines.

Give feedback
Energy-efficient new building design
Details (click to expand)

The built environment contributes significantly to global carbon dioxide emissions both through the embodied carbon associated with building materials and through operational emissions associated with thermal comfort, ventilation, and lighting. Detailed analysis is often applied too late into the building design process, thereby leaving out significant energy-saving potential. The integration of building performance simulation (BPS) in the initial phase can be critical to sustainable and energy efficient design thereby influencing subsequent construction as well as overall building lifecycle. However, traditional BPS relies on complex physics models with respect to fluid dynamics, thermodynamics, sunlight, and acoustics, increasing computational complexity and processing time associated with the evaluation of a candidate design. Machine learning models can significantly enhance evaluation by emulating BPS based on synthetic and real-world data enabling rapid prototyping and optimization of building topology along multiple comfort, consumption, and environmental objectives. Machine learning can also be introduced at the prototyping phase in response to evaluation, with generative and genetic algorithms based refinement of layouts.

DatasetData Gap Summary
Benchmark datasets of building environmental conditions and occupancy

Datasets featured can vary in types of data gaps depending on the content, coverage area, location, building type, building spatial plan, quantity measured, ambient environment, or power consumption or metered data availability.

Give feedback
Computational fluid dynamics simulation

Despite its usefulness in ventilation studies for new construction, CFD simulations are computationally expensive making it difficult to include in the early phase of the design process where building morphosis can be optimized to reduce future operational consumption associated with building lighting, heating, and cooling. Simulations require accurate input information with respect to material properties that may not be present in traditional urban building types. Output of models  require the integration of domain knowledge to interpret results from large volumes of synthetic data for different wind directions becoming challenging to manage. Future data collection with respect to simulation output verification can benefit surrogate or proxy approaches to computationally expensive Navier-Stokes equations, and coverage is often restricted to modern building approaches, leaving out passive building techniques known as vernacular architecture from indigenous communities from being taken into design consideration.

Give feedback
Residential daylight performance metric (DPM) data

Daylight performance metrics (DPMs) have been developed by building researchers and architects based on daylight access simulation output to quantify the illumination of indoor spaces by natural light. While DPM evaluation is an important step in the planning of commercial buildings, residential buildings do not have similar focus, which is unusual given that most new building construction occurs within the residential sector. Data gaps are provided in the context of residential DPMs which lack metrics associated with direct sunlight access, rely on annual averages for seasons, and utilize fixed occupancy schedules that are overly simplified for residential spaces. Additionally, illuminance metrics and thresholds utilized in commercial spaces do not translate well for residential spaces where people may prefer higher or lower illuminances depending on their location and lifestyles. Lastly, DPM optimization is based on operational metrics and assumptions on illumination in a space and its effects on the resulting thermal comfort and operational consumption of a traditional urban residential spaces, vernacular architecture which is specific to a local region and culture may not share similar objectives, preferring more indoor-outdoor transitional spaces, earthen materials, and less focus on windows and incident natural sunlight.

Give feedback
Estimation of forest carbon stock
Details (click to expand)

Forests are one of the Earth’s major carbon sinks, absorbing carbon dioxide (CO₂) from the atmosphere through photosynthesis and storing it in biomass (trees and vegetation) and soil. Accurate estimates of carbon stock help quantify the amount of CO₂ forests are sequestering, which is essential for climate change mitigation efforts. ML can help by providing more precise and large-scale estimates of forest carbon through the analysis of satellite imagery. This approach can significantly improve upon traditional, labor-intensive forest inventory surveys, making carbon stock assessments more efficient and scalable.

DatasetData Gap Summary
GEDI lidar

There is uncertainty in the data.

Give feedback
Ground-survey based forest inventory data

The data is manually collected and recorded, resulting in errors, missing values, and duplicates. Additionally, it is limited in coverage and collection frequency.

Give feedback
Estimation of methane emissions from rice paddies
Details (click to expand)

Rice paddies are a major source of global anthropogenic methane emissions. Accurate quantification of CH₄ emissions, especially how they vary with different agricultural practice is crucial for addressing climate change. ML can enhance methane emission estimation by automatically processing and analyzing remote-sensing data, leading to more efficient assessments.

DatasetData Gap Summary
Direct measurement of methane emission of rice paddies

There is a lack of direct observation of methane emissions from rice paddies.

Give feedback
Extreme heat prediction
Details (click to expand)

Extreme heat is becoming more common in a changing climate, but predicting and accurately modeling extreme heat is difficult. ML can help by improving extreme heat prediction.

DatasetData Gap Summary
NEX-GDDP-CMIP6

The major challenge is handling the size of data

Give feedback
Fault detection in low voltage distribution grids
Details (click to expand)

The low voltage distribution portion of the grid directly supplies power to consumers. As consumers integrate more distributed energy resources (DERs) and dynamic loads (such as electric vehicles), low voltage distribution systems are susceptible to power quality issues that can affect the stability and reliability of the grid. Fault inducing harmonics can be challenging to monitor, diagnose, and control due to the number of nodes/buses that connect various grid assets and the short distances between them. Traditional fault detection and localization utilize impedance-based or traveling-wave methods. Both methods assess deviations between two points with respect to line-specific thresholds and work well in cases where faults tend to have low fault resistance values and networks are limited in the number of branches. As low voltage distribution network topologies grow increasingly complex, line parameters can vary, making it increasingly difficult for traditional methods to accurately diagnose and isolate faults. . Machine learning methods can overcome these limitations as they can be trained on large amounts of data, extract relevant features, and recognize patterns to automate fault diagnoses agnostic to specific line thresholds and topologies. If integrated into advanced monitoring systems, detecting and localizing faults can accelerate adaptive protection and network reconfiguration efforts to ensure reliability and stability.

DatasetData Gap Summary
Micro-synchrophasors (µPMU data)

For effective fault localization using µPMU data, it is crucial that distribution circuit models supplied by utility partners or Distribution System Operators (DSOs) are accurate, though they often lack detailed phase identification and impedance values, leading to imprecise fault localization and time series contextualization. Noise and errors from transformers can degrade µPMU measurement accuracy. Integrating additional sensor data affects data quality and management due to high sampling rates and substantial data volumes, necessitating automated analysis strategies. Cross-verifying µPMU time series data, especially around fault occurrences, with other data sources is essential due to the continuous nature of µPMU data capture. Additionally, µPMU installations can be costly, limiting deployments to pilot projects over a small network. Furthermore, latencies in the monitoring system due to communication and computational delays challenge real-time data processing and analysis.

Give feedback
Identification and mapping of climate policy
Details (click to expand)

Laws and regulations relevant to climate change mitigation and adaptation are essential for assessing progress on climate action and addressing various research and practical questions. ML can be employed to identify climate-related policies and categorize them according to different focus areas.

DatasetData Gap Summary
Academic literature databases

Data is not available in machine-readable formats and is limited to English-language literature from major journals.

Give feedback
Climate-related laws and regulations

Laws and regulations for climate action are published in various formats through national and subnational governments, and most are not labeled as a “climate policy”. There are a number of initiatives that take on the challenge of selecting, aggregating, and structuring the laws to provide a better overview of the global policy landscape. This, however, requires a great deal of work, needs to be permanently updated, and datasets are not complete.

Give feedback
Improving battery management systems
Details (click to expand)

With the shift from carbon based generation to renewable, energy storage becomes crucial to counter the intermittent nature of renewable energy availability. Battery efficiency and lifetime have a direct impact on the effectiveness of transportation electrification. Machine learning can be a valuable tool in accelerating operational efficiency by estimating state of charge (SoC), state of health (SoH), and remaining useful life (RUL). Techniques such as reinforcement learning can optimize and enhance charge/discharge strategies for battery management systems (BMS). ML can also process large real-world datasets that may contain battery health parameters, charge/discharge measurements, and load demand. If the load is a vehicle, the type of vehicle, and driving behavior may also be available.

DatasetData Gap Summary
Improving power grid optimization
Details (click to expand)

Traditionally optimal power flow (OPF) seeks to solve the objective of minimizing the cost of power generation to meet a given load (economic dispatch) such that line limits due to thermal, voltage, or stability along with generation limits are met while maintaining power balance at each bus in the transmission system. Traditional techniques formulate OPF as a non-linear, constrained, non-convex optimization problem which can be solved for AC and DC systems separately. Traditional OPF solvers use a linear program to determine generation needed to minimize cost and satisfy load demand while adhering to physical constraints of the system. However, as the grid integrates more renewable generation sources there are trends towards the development of hybrid AC/DC power grids to address the limitations of traditional AC transmission systems and the desire to access remote renewables. Such hybrid systems present new challenges to traditional OPF by enabling bidirectional power flow, requiring the adaptation of OPF objective function and constraints to account for new losses, increased costs and congestion. ML can be used to approximate OPF problems, in order to allow them to be solved at greater speed, scale, and fidelity.

DatasetData Gap Summary
Grid2Op and PandaPower

Grid2Op is a reinforcement learning framework that builds an environment based on topologies, selected grid observations, a selected reward function, and selected actions for an agent to select from. The framework relies on control laws rather than direct system observations which are subject to multiple constraints and changing load demand. Time steps representing 5 minutes are unable to capture complex transients and can limit the effectiveness of certain actions within the action space over others. Furthermore, customization of the Grid2Op can be challenging as the platform does not allow for single to multi-agent conversion, and is not a suitable environment for cascading failure scenarios due to game over rules.

Give feedback
Optimal power flow simulators

Traditional OPF simulation software may require the purchase of licenses for advanced features and functionalities. To simulate more complex systems or regions, additional data regarding energy infrastructure, region-specific load demand, and renewable generation may be needed to conduct studies. OPF simulation output would require verification and performance evaluation to assess results in practice. Increasing the granularity of the simulation model by increasing the number of buses, limits, or additional parameters increases the complexity of the OPF problem, thereby increasing the computational time and resources required.

Give feedback
Power Grid Lib: Optimal power flow benchmark library

While network datasets are open source, maintenance of the repository requires continuous curation and collection of more complex benchmark data to enable diverse AC-OPF simulation and scenario studies. Industry engagement can assist in developing more realistic data though such data without cooperative effort may be hard to find.

Give feedback
Marine wildlife detection and species classification
Details (click to expand)

Marine wildlife detection and species classification are crucial for understanding the impacts of climate change on marine ecosystems. These processes involve identifying and categorizing different marine species. ML can significantly enhance these efforts by automatically processing large volumes of data from diverse sources, improving accuracy and efficiency in monitoring and analyzing marine life.

DatasetData Gap Summary
Copernicus Marine Data Store

Data downloading is a bottleneck because it requires familiarity with APIs, which not all users possess.

Give feedback
FathomNet

The biggest challenge for the developer of FathomNet is enticing different institutions to contribute their data.

Give feedback
Ocean biodiversity data

Same as terrestrial biodiversity data, the lack of good annotated data is biggest bottleneck. Regarding existing data, enabling broader data sharing is the most critical challenge to address. We should also be strategic data collection efforts, targeting places where biodiversity is large but currently available data is sparse.

Give feedback
Sofar spotter archive

Data access is restricted.

Give feedback
Modeling effects of soil processes on soil organic carbon
Details (click to expand)

Understanding the causal relationship between soil organic carbon and soil management or farming practices is crucial for enhancing agricultural productivity and evaluating agriculture-based climate mitigation strategies. ML can significantly contribute to this understanding by integrating data from diverse sources to provide more precise spatial and temporal analyses.

DatasetData Gap Summary
Emission dataset compiled from FAO statistics

Data is extrapolated from statistics on a national level. It is unknown how accurate this data is when focusing on local information.

Give feedback
Simulated variables from process-based models

Data collection is extremely expensive for some variables, leading to the use of simulated variables. Unfortunately, simulated values have large uncertainties due to the assumptions and simplifications made in within simulation models.

Give feedback
Soil Survey Geographic Database (SSURGO)

In general, there is insufficient soil organic carbon data for training a well-generalized ML model (both in terms of coverage and granularity).

Give feedback
Non-intrusive electricity load monitoring
Details (click to expand)

Non-intrusive load monitoring (NILM) is a strategy to disaggregate the total electricity consumption profile of a building into individual appliance load profiles. This strategy can provide insight to individual consumer behavior for the purposes of real-time electricity pricing, can help target customers who may be due for an appliance upgrade, and can enable building energy management systems (EMS) to enact demand response strategies such as load shifting for sheddable or curtailable loads. These strategies can foster energy efficiency, reduce peaks in electricity demand, and help increase the utilization of low-carbon power by enabling better supply/demand matching, thereby fostering grid decarbonization and maintaining grid stability.

DatasetData Gap Summary
Pecan Street

Pecan Street DataPort requires non-academic and academic users to purchase access via licensing which varies depending on the building data features requested. Coverage area of data is primarily concentrated in the Mueller planned housing community in Austin, Texas–a modern built environment which is not representative of older historical buildings that may be in need of energy efficient upgrades and retrofits. In customer segmentation studies and consumer-in-the-loop load consumption modeling, annual socio-demographic survey data may be too coarse and not provide insight into behavioral effects of household members on consumption profiles with time.

Give feedback
Sub-metered appliance-level data

For accurate NILM studies, benchmark datasets are required to include not only consumption but local power generation (e.g., from rooftop solar), as it can affect the overall aggregate load observed at the building level. While some datasets may include generation information, most studies do not take rooftop solar generation into account. Additionally, devices that can behave both as a load and generator such as electric vehicles or stationary batteries were also not included. The majority of building types are single family housing units limiting the diversity of representation. Furthermore, most datasets are no longer maintained following study close.

Give feedback
Offshore wind power forecasting: Long-term (3 hours-1 year)
Details (click to expand)

Long-term wind forecasting can allow for resource assessment studies for offshore energy production, wind resource mapping, and wind farm modeling.

DatasetData Gap Summary
Floating INfrastructure for Ocean observations FINO3

Due to the location, FINO platform sensors are prone to failures of measurement sensors due to adverse outdoor conditions such as high wind and high waves. Often when sensors fail manual recalibration or repair is nontrivial requiring weather conditions to be amenable for human intervention. This can directly affect the data quality which can last for several weeks to a season. Coverage is also constrained to the platform and associated wind farm location.

Give feedback
Offshore wind meteorological data and LiDAR wind mapping

Spatiotemporal coverage of the offshore meteorological and windspeed platform data is restricted to the dimensions of the platform itself as well as the time of construction. Depending on the data provider access to the data may require the signing of a non-disclosure agreement.

Give feedback
Offshore wind power forecasting: Short-term (10 min)
Details (click to expand)

Short-term wind forecasting can enable estimation of active power generated by wind farms in the absence of curtailment.

DatasetData Gap Summary
Orsted: Offshore wind SCADA operation data

Data obtainability is achieved by requesting access via the Orsted form. Sufficiency of the dataset is constrained by volume where only a finite amount of short term off-shore wind farms exist to which expanding the coverage area, volume and time granularity of data to under 10 minutes may enable transient detection from generated active power.

Give feedback
Post-disaster damage assessment
Details (click to expand)

Post-disaster evaluations are crucial for identifying vulnerabilities exposed by climate-related events, which is essential for enhancing resilience and informing climate adaptation strategies. ML can help by rapidly identifying and quantifying damage, such as structural collapse or vegetation loss, thereby improving response and recovery efforts.

DatasetData Gap Summary
Financial loss datasets related to the impacts of disasters

Data is proprietary and not open to the public.

Give feedback
Satellite Images

The resolution of publicly available datasets is insufficient for accurate damage assessments. To improve this, some commercial high-resolution images should be made accessible for research purposes.

Give feedback
xBD

Data is highly biased towards North America. Similar datasets but focusing on other parts of the world are needed. Additionally, the dataset should include more detailed information on the severity of the damage.

Give feedback
Short-term electricity load forecasting
Details (click to expand)

Short-term load forecasting (STLF) is critical for utilities to balance demand with supply. Utilities need accurate forecasts (e.g. on the scale of hours, days, weeks, up to a month) to plan, schedule, and dispatch energy while decreasing costs and avoiding service interruptions. Furthermore, for grids that may have portions privatized, utilities rely on forecasts to procure (e.g. source and purchase) energy to meet demands. In peak conditions, where loads have been underestimated, utilities have limited options. One option is to utilize reserve capacity, or additional electric supply to ensure reliable power to customers. This usually entails recruitment of expensive peaker plants dependent on fossil-fuels in city centers to meet immediate demands over short distances. Another option is for the utility to initiate an outage to clip peaks. In the worst case, grid assets can be overloaded resulting in system failure and unplanned blackouts. Due to the reliance of historical electricity load data, weather forecasts, time with respect to the day, week, or month, and continuous streams of advanced metering infrastructure (AMI) data, machine learning models are well suited to handle large amounts of data and capture non-linearities which traditional linear models may struggle with.

DatasetData Gap Summary
Advanced metering infrastructure data

AMI data is challenging to obtain without pilot study partnerships with utilities since data collection on individual building consumer behavior can infringe upon customer privacy especially at the residential level. Granularity of time series data can also vary depending on the level of access to the data whether it be aggregated and anonymized or based on the resolution of readings and system. Additionally, coverage of data will be limited to utility pilot test service areas thereby restricting the scope and scale of demand studies.

Give feedback
Building data genome project

The building data genome project 2 compiles building data from public open datasets along with privately curated building data specific to university and higher education institutions. While the dataset has clear documentation, the lack of diversity in building types necessitates further development of the dataset to include other types of buildings as well as expansion to coverage areas and times beyond those currently available.

Give feedback
Faraday: Synthetic smart meter data

Faraday synthetic AMI data is a response to the bottlenecks faced in retrieval of building level demand data based on consumer privacy. However, despite the model being trained on actual AMI data, synthetic data generation requires a priori knowledge with respect to building type, efficiency, and presence of low carbon technology. Furthermore, since the dataset is trained on UK building data, AMI time series generated may not be an accurate representation of load demand in regions outside the UK. Finally, since data is synthetically generated studies will require validation and verification using real data or aggregated per substation level data to assess its effectiveness. Faraday is currently accessible through the Centre for Net Zero’s API.

Give feedback
Smart inverter management for distributed energy resources
Details (click to expand)

Distributed energy resources (DERs) such as solar photovoltaics and energy storage systems are a part of low-inertia power systems that do not rely on traditional rotating components. These DERs rely on distributed inverters to convert power from DC to AC which typically are configured to unity power factor. An alternative to unity power factor, inverters can be “smart” by dynamically managing effects of intermittancy prior to feeding power back to feeder circuits at the distribution substation level. Smart inverters can perform Volt-VAR (Voltage-VAR) and Volt-Watt (Voltage-Watt) operations, which involve adjusting the output voltage and frequency of the inverter to maintain grid stability. In other words, the DER inverter is controlled to dynamically adjust reactive power injection back into the grid. This is crucial for preventing voltage sags and swells that can occur due to the integration of DERs into the grid.

DatasetData Gap Summary
Simulation tools for distribution connected inverter systems

There is a need to enhance existing simulation tools to study inverter-based power systems rather than traditional machine-based based. Simulations should be able to represent a large number of distribution-connected inverters that incorporate DER models with the larger power system. Uncertainty surrounding model outputs will require verification and testing. Furthermore, accessibility to simulations and hardware in the loop facilities and systems requires user access proposal submission for NREL’s Energy Systems Integration Facility access. Similar testing laboratories may require access requests and funding.

Give feedback
Smart inverter (UL1741-SB compliant) devices database

Smart inverter operational data is not publicly available and requires partnerships with research labs, utilities, and smart inverter manufacturers. However, the California Energy Commission maintains a database of UL 1741-SB compliant manufacturers and smart inverter models that can then be contacted for research partnerships. In terms of coverage area, while California and Hawaii are now moving towards standardizing smart inverter technology in their power systems, other regions outside of the United States may locate similar manufacturers through partnerships and collaborations.

Give feedback
Solar installation site assessment
Details (click to expand)

Statistical analysis on solar PV system components for pricing, logistics, planning, and site capacity studies is an important part of the process for siting solar PV systems. Spatiotemporal generation forecasting using pre-existing site data can be used to inform future site recommendations, policy, and decision making with respect to new developments.

DatasetData Gap Summary
LBNL: Solar panel PV system dataset

The LBNL solar panel PV system dataset excluded third party owned systems and systems with battery backup. Since data was self-reported, component cost data may be imprecise. The dataset also has a data gap in timeliness as some of the data includes historical data which may not reflect current pricing and costs of PV systems.

Give feedback
US large-scale Solar Photovoltaic Database (USPVDB)

The USPVDB data must be accessed through the United States Geological Survey (USGS) mapper browser application or for download as GIS data in the form of shapefiles or geojsons. Tabular data and metadata are provided in CSV and XML format. Coverage of the dataset is isolated to the US specifically over densely populated regions. Enhancing the data by supplementing it with international large-scale photovoltaic satellite imagery can expand the coverage area of the dataset.

Give feedback
Solar power forecasting: Long-term (>24 hours)
Details (click to expand)

Longer-term solar forecasts are beneficial for energy market pricing, investment decisions, and integration with other renewable energy sources such as hydroelectric plants to allow for larger scale coordination and grid operational studies. Additionally, inclusion of energy storage systems to harvest solar energy on longer time scales can be better aligned with longer term demand forecasting and predicted solar peaks.

DatasetData Gap Summary
NREL solar power data for integration studies

While the synthetic PV plant data is beneficial to perform forecasting and control simulation case studies when actual data is not present there are limitations with respect verification for site specific projects, representation of coverage areas outside of the US, and modeling assumptions based on data proxies that have to be taken into account when interpreting results.

Give feedback
Solar power forecasting: Medium-term (6-24 hours)
Details (click to expand)

Medium-term solar forecasts can be beneficial for simulation case studies in demand response, microgrid behavior, electricity markets, and solar site planning.

DatasetData Gap Summary
Satellite remote sensing data

Depending on the region of interest, data can be retrieved from different open data satellites that are both geostationary as well as swath which may differ in spatial and temporal resolutions and coverage area. Additionally, multispectra data may have challenges with respect to preprocessing and preparing the data for analysis. Specifically for medium term solar forecasting, actual ground irradiance may differ from approximations made by models that utilize satellite derived cloud cover products. This is because different cloud types can have different impacts on irradiance. Supplementation with ground based measurements for verification and improvements in granularity are suggested solutions.

Give feedback
Solar power forecasting: Short-term (30 min-6 hours)
Details (click to expand)

Hourly site-specific solar forecasting can assist with solar energy estimates based on measured irradiance, photovoltaic inverter output energy, and turbine level output. Forecasting at this level can prove beneficial for joint distributed energy resource and energy storage microgrid scheduling studies, and system reliability studies.

DatasetData Gap Summary
NOAA's SOLRAD network

While NOAA’s SOLRAD is an excellent data source for long term solar irradiance and climate studies, data gaps exist for the short term solar forecasting use case (which requires hourly averages). Data quality of hourly averages is lower than that of native resolution data impacting effective short-term forecasting for real-time energy management for grid stability, demand response, real-time market price predictions, and dispatch. Coverage area is also constrained to certain parts of the United States based on the SURFRAD network location.

Give feedback
NREL Solar Radiation Database (NSRDB)

While data coverage is global and based on data derived from satellite imagery as input to the Fast All-sky Radiation Model (FARM), a radiative transfer model, the output is calculated over specific time frames and would require to be calculated and updated for modern times. Furthermore, data is unbalanced as the region that has the longest temporal coverage is the United States. Satellite based estimation of solar resource information may be susceptible to cloud cover, snow, and bright surfaces which would require additional verification from ground based measurements and collation of outside data sources. Additionally, since data is derived from satellites, data may require preprocessing to account for parallax effects when looking at particular regions based on the field of view of the coverage satellite and the region of interest which may not be expressed in the FARM higher level tabular products.

Give feedback
NREL Solar Radiation Research Laboratory (SRRL): Baseline Measurement System (BMS)

While NREL’S SRRL BMS provides real-time joint variable data from ground based sensors coverage is reserved to the sensor network in Golden, CO in the United States. Since the measurement system is comprised of diverse sensors, sensors may malfunction or go out of calibration requiring human intervention and maintenance following detection which may be delayed leading to inaccuracies in the data.

Give feedback
PV Anlage-Reinhart system

PV Anlage-Reinhart System information for PV systems collated and compiled by SMA with PV inverter data requires creating a user profile requests for specific system access, may lack clear instructions in languages outside of German, and have greater representation of systems located in Germany, Netherlands, and Australia, despite the presence of data globally. Furthermore, a subset of the systems cultivated contain joint energy storage data which may be valuable for DER specific load forecasting studies.

Give feedback
SOLETE

While SOLETE is advantageous to use for joint wind solar DER forecasting at the inverter level generation studies, the dataset can be improved by addressing several gaps in data sufficiency, namely expansion of the temporal coverage to include seasonal variations which may be addressed with additional outside data or simulation. Outside data or simulation may also improve scaling of the study to address multiple generation sources (more than one PV array and wind turbine) and the coordination between them to maintain grid reliability and stability. Additionally, a data wish for SOLETE includes the addition of maintenance schedules or system downtime data to more realistically model system dynamics with DERs.

Give feedback
Solar power forecasting: Very-short-term (0-30min)
Details (click to expand)

Very-short-term solar power forecasting is critical for time series irradiance forecasting and solar ramp event identification. Solar irradiance ramp events can be defined as sudden changes in solar irradiance within a short time interval. These events are often caused by transient clouds that can lead to abrupt fluctuations in the incoming solar energy. Cloud analysis using cloud segmentation and classification as a proxy to determining solar irradiance attentuation can assist in determining solar generation for photovoltaics and concentrated solar power towers. Solar generation predictions are important for real time electricity market and pricing studies, real-time dispatch of other generating sources, and energy storage control studies.

DatasetData Gap Summary
DOE Atmospheric Radiation Measurement (ARM) research facility data products

ARM dataset includes data from various DOE sites that include sensor information from sun-tracking photometers, radiometers, spectrometer data which is helpful in understanding hyperspectral solar irradiance and cloud dynamics. ARM sites generate large datasets which can be challenging to store, stream, analyze and archive, may be sensitive to sensor noise, and require further measurement verification especially with respect to aerosol composition. Additionally, ARM data coverage is limited to ARM sites motivating future collaboration with partner networks to enhance observational spatial coverage.

Give feedback
NIST campus photovoltaic (PV) arrays and weather station data sets

Data coverage is limited to Gaithersburg, MD NIST campus and is no longer being maintained after July 2017.

Give feedback
Solcast

Data from Solcast is accessible via academic or research institution. Solcast uses course surface elevation models aligned with reanalysis data leading to significant elevation differences between ground data sites and cell height. While a global dataset, coverage is limited to 33 sites with 18 in tropical/subtropical locations and 15 in temperate locations. Time granularity is also between 5-60min.

Give feedback
SRRL TSI-880 sky imager gallery

Data coverage and granularity is limited by the location of the cameras and constrained to 10-minute increments. Resolution is also limited to 352x288 24bit jpeg images (see device specifications).

Give feedback
SWINySEG (Singapore whole sky Nychthemeron image SEGmentation database)

There is a need for annotated labels sky image data for cloud detection and segmentation purposes for improved local and PV site-specific irradiance predictions. The data is ultimately constrained to the coverage area of Singapore and restricts users from its commercial use.

Give feedback
Terrestrial wildlife detection and species classification
Details (click to expand)

Terrestrial wildlife detection and species classification are essential for understanding the impacts of climate change on terrestrial ecosystems. Similarly to marine wildlife studies, ML can greatly improve these efforts by automatically processing large volumes of data from diverse sources, enhancing the accuracy and efficiency of monitoring and analyzing terrestrial species.

DatasetData Gap Summary
Bioacoustic recordings

The first and foremost challenge of bioacoustic data is its sheer volume, which makes its data sharing especially difficult due to limited storage options and high costs. Urgent solutions are needed for cheaper and more reliable data hosting and sharing platforms.

Additionally, there’s a significant shortage of large and diverse annotated datasets, much more severe compared to image data like camera trap, drone, and crowd-sourced images.

Give feedback
Camera trap images

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Give feedback
Community science data

The main challenge with community science data is its lack of diversity. Data tends to be concentrated in accessible areas and primarily focuses on charismatic or commonly encountered species.

Give feedback
Drone images for biodiversity

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Give feedback
eDNA

 One gap in data is the incomplete barcoding reference databases.

Give feedback
GBIF

While GBIF provides a common standard, the accuracy of species classification in the data can vary, and classifications may change over time.

Give feedback
Satellite Images

Some commercial high-resolution satellite images can also be used to identify large animals such as whales, but those images are not open to the public.

Give feedback
Variability analysis of wind power generation
Details (click to expand)

The shift from high-inertia generation sources such as thermal plants to low inertia distributed inverter-coupled generation from distributed energy resources introduces new stability and reliability issues. It is imperative to maintain the frequency of the system at a nominal level to prevent damage, instability, and blackouts. Wind generation from turbines can contribute some frequency response and inertia that may benefit the grid by providing a combination of synthetic inertial and primary frequency response to the grid system.

DatasetData Gap Summary
Simulation tools for active power control by wind

To gain access, particularly to NREL’s FESTIV model, permission must be requested. Since FESTIV is a simulation model, it may not account for all real-time system dynamics and complexities requiring validation and verification from real-world data. Furthermore, since the granularity of the model is hourly, it may not be able to account for very short-term impacts, frequencies, and reactive power flows that can affect power system stability.

Give feedback
Weather forecasting: Near-term (< 24 hours)
Details (click to expand)

Near-term weather forecasting (< 24 hours ahead) of temperature, precipitation, etc. at km-level spatial and minute-level temporal resolution, in an accurate and computationally-efficient manner, has implications for many climate change mitigation and adaptation applications. ML can help provide more accurate near-term weather forecasts.

DatasetData Gap Summary
Automatic surface observation (ASOS)

Data volume is large and only data specific to the US is available.

Give feedback
High-resolution weather forecast (HRRR)

Data volume is large, and only data covering the US is available.

Give feedback
Radar data (MRMS)

Obtaining and integrating radar data from various sources is challenging.

Give feedback
Regularly gridded high-resolution atmospheric observations

An enhanced version of ERA5 with higher granularity and fidelity is needed. In fact, a lot of surface observations and remote sensing data are in place for developing such a dataset.

Give feedback
Weather forecasting: Short-to-medium term (1-14 days)
Details (click to expand)

Weather forecasting at 1-14 days ahead has implications for real-time response and planning applications within both climate change mitigation and adaptation. ML can help improve short-to-medium-term weather forecasts.

DatasetData Gap Summary
ENS

The biggest challenge of ENS is that only a portion of it is available to the public for free.

Give feedback
ERA5

ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.

Give feedback
HRES

The biggest challenge of HRES is that only a portion of it is available to the public for free.

Give feedback
WeatherBench 2

Weather Bench 2 is based on ERA5, so the issues of ERA5 are also inherent here, that is, data has biases over regions where there are no observations.

Give feedback
Weather forecasting: Subseasonal horizon
Details (click to expand)

High-fidelity weather forecasts at subseasonal to seasonal scales (3-4 weeks ahead) are important for a variety of climate change-related applications. ML can be used to postprocess outputs from physics-based numerical forecast models in order to generate higher-fidelity forecasts.

DatasetData Gap Summary
ERA5

ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.

Give feedback
subX

More data is needed to develop a more accurate and robust ML model. It is also important to note that SUBX data contains biases and uncertainties, which can be inherited by ML models trained with this data.

Give feedback
Weather forecasting: Subseasonal-to-seasonal horizon
Details (click to expand)

High-fidelity weather forecasts at the subseasonal-to-seasonal (S2S) scale (i.e., 10-46 days ahead) are important for a variety of climate change-related applications. ML can be used to postprocess outputs from physics-based numerical forecast models in order to generate higher-fidelity forecasts.

DatasetData Gap Summary
CPC Precipitation

CPC Global Unified gauge-based analysis of daily precipitation https://psl.noaa.gov/data/gridded/data.cpc.globalprecip.html

Give feedback
S2S forecast data

More data is needed to take advantage of the large ML models.

Give feedback
Wildfire prediction: Short-term (3-7 days)
Details (click to expand)

Wildfires are becoming more frequent and severe due to climate change. Accurate early prediction of these events enables timely evacuations and resource allocation, thereby reducing risks to lives and property. ML can enhance fire prediction by providing more precise forecasts quickly and efficiently.

DatasetData Gap Summary
Active fire data

A huge data gap is that there is no active fire data in the afternoon (1-5 pm) when most fires ignite.

Give feedback
ERA5

ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.

Give feedback
ESA land cover map

Yearly land cover classification gridded map at 300-m resolution from 1992 to present produced by European Space Agency (ESA) Climate Change Initiative (CCI) https://catalogue.ceda.ac.uk/uuid/b382ebe6679d44b8b0e68ea4ef4b701c.

Higher resolution land cover maps (at 10-m resolution) are also available for years 2020 and 2021 (https://esa-worldcover.org/en).

For fire prediction, this provides fine-grained information of available fuel.

Give feedback
Socioeconomic data

Socioeconomic data, eg. human behaviors are significant predictors of fire. Other than the inherent challenges and gaps of socioeconomic data, aggregating those datasets and harmonizing them with other predictors of fire data in the spatial domain is especially tricky.

Give feedback
Dataset Gap Types Modalities Sectors
Academic literature databases
Details (click to expand)

Academic literature databases, such as Openalex, Web of Science, Scopus.

Use CaseData Gap Summary
Identification and mapping of climate policy

Data is not available in machine-readable formats and is limited to English-language literature from major journals.

Give feedback
Active fire data
Details (click to expand)

Active fire data derived from images taken by satellites such as MODIS, VIRRS, LANDSAT. They are at different spatial resolutions and temporal coverage. Data can be downloaded here: https://firms.modaps.eosdis.nasa.gov/active_fire.

Use CaseData Gap Summary
Wildfire prediction: Short-term (3-7 days)

A huge data gap is that there is no active fire data in the afternoon (1-5 pm) when most fires ignite.

Give feedback
Advanced metering infrastructure data
Details (click to expand)

Advanced Metering Infrastructure (AMI) facilitates communication between utilities and customers through smart meter device systems which collect, store, and analyze per building energy consumption.

AMI data can be retrieved through public data portals, individual data collection, or research partnerships with local utilities. Some examples of utility research partnerships include the Irvine Smart Grid Demonstration (ISGD) project conducted by Southern California Edison (SCE) and the smart meter pilot test from the Sacramento Municipal Utility. An example of publicly available data which is aggregated and anonymized is  the Commission for Energy Regulation (CER) Smart Metering Project hosted by the Irish Social Science Data Archive (ISSDA).

View dataset

Use CaseData Gap Summary
Short-term electricity load forecasting

AMI data is challenging to obtain without pilot study partnerships with utilities since data collection on individual building consumer behavior can infringe upon customer privacy especially at the residential level. Granularity of time series data can also vary depending on the level of access to the data whether it be aggregated and anonymized or based on the resolution of readings and system. Additionally, coverage of data will be limited to utility pilot test service areas thereby restricting the scope and scale of demand studies.

Give feedback
Automatic surface observation (ASOS)
Details (click to expand)

1-minute observations from automated surface observation system stations https://madis.ncep.noaa.gov/madis_OMO.shtml

Use CaseData Gap Summary
Weather forecasting: Near-term (< 24 hours)

Data volume is large and only data specific to the US is available.

Give feedback
Benchmark datasets for short-term wildfire prediction
Details (click to expand)

Benchmark datasets for wildfire prediction are standardized collections of data that include historical and real-time wildfire occurrences, remote sensing imagery, fuel information, and meteorological data. These datasets provide a common framework for training, validating, and testing machine learning models. By integrating various modalities and sources of data, benchmark datasets simplify the process of data collection, integration, and preprocessing, ensuring consistency and efficiency in developing and evaluating wildfire prediction models.

Use CaseData Gap Summary
Benchmark datasets of building environmental conditions and occupancy
Details (click to expand)

The US Office of Energy Efficiency and Renewable Energy hosts 15 building datasets for 10 states covering 7 climate zones and 11 different building types. The data covers energy, indoor air quality, occupancy, environment, HVAC, lighting, and energy consumption to name a few. Datasets are organized by name and points of contact.

All data featured on the platform is open access with standardization on metadata format to allow for ease of use and information specific to buildings based on type, location, and climate zone. Data quality and guidance on curation and cleaning in addition to access restrictions are specified in the metadata of each hosted dataset. Licensing information for each individual featured dataset is provided.

View dataset

Use CaseData Gap Summary
Energy-efficient new building design

Datasets featured can vary in types of data gaps depending on the content, coverage area, location, building type, building spatial plan, quantity measured, ambient environment, or power consumption or metered data availability.

Give feedback
Bioacoustic recordings
Details (click to expand)

Passive acoustic recording provides continuous monitoring of both the environment and the species.

There is in general a lack of robust, large, and diverse annotated datasets. Some of such datasets are hosted at https://arbimon.org/, www.macaulaylibrary.org, and www.xeno-canto.org.

Use CaseData Gap Summary
Assessing forest restoration outcomes

There is a lack of standardized protocol to guide data collection for restoration projects, including which variables to measure and how to measure them. Developing such protocols would standardize data collection practices, enabling consistent assessment and comparison of restoration successes and failures on a global scale.

Give feedback
Detection of climate-induced ecosystem changes

The major data gap is the limited institutional capacity to process and analyze data promptly to inform decision-making.

Give feedback
Terrestrial wildlife detection and species classification

The first and foremost challenge of bioacoustic data is its sheer volume, which makes its data sharing especially difficult due to limited storage options and high costs. Urgent solutions are needed for cheaper and more reliable data hosting and sharing platforms.

Additionally, there’s a significant shortage of large and diverse annotated datasets, much more severe compared to image data like camera trap, drone, and crowd-sourced images.

Give feedback
Building data genome project
Details (click to expand)

The Building Data Genome Project 2 dataset contains hourly whole building data from 3,053 energy meters from 1,636 non-residential buildings covering two years worth of metered data with respect to electricity, water, and solar in addition to logistical metadata with respect to area, primary building use category, floor area, time zone, weather, and smart meter type. The goal of the dataset to to allow for the development of generalizable building models for energy efficiency analysis studies.

View dataset

Use CaseData Gap Summary
Short-term electricity load forecasting

The building data genome project 2 compiles building data from public open datasets along with privately curated building data specific to university and higher education institutions. While the dataset has clear documentation, the lack of diversity in building types necessitates further development of the dataset to include other types of buildings as well as expansion to coverage areas and times beyond those currently available.

Give feedback
Building footprint
Details (click to expand)

The foundation for characterizing exposure involves understanding where people live and in what conditions. Building footprint data serves as a crucial layer in this context, offering detailed attributes of buildings, such as their age, materials, heights, rooftop material, and basement features. Notable sources of such data include OpenStreetMap, USGS Building Footprint, Google Open Buildings, Microsoft Building Footprints, and Meta open building footprint data.

Use CaseData Gap Summary
Disaster risk assessment

More information, such as age of the building, should be included in the dataset.

Give feedback
CMIP6
Details (click to expand)

Climate simulations from a consortium of state-of-art climate models. Data can be found here.

Use CaseData Gap Summary
Bias-correction of climate projections

The large uncertainties in future climate projection is a big problem of CMIP6. The large volume of data and the lack of uniform structure—such as inconsistent variable names, data formats, and resolutions across different CMIP6 models—make it challenging to utilize data from multiple models effectively.

Give feedback
Data-driven generation of climate simulations

The large data volume and lack of uniform structure (no consistent variable names, data strucuture, and data resolution across all models) makes it difficult to use data from more than one model of CMIP6.

Give feedback
CPC Precipitation
Details (click to expand)

CPC Global Unified gauge-based analysis of daily precipitation https://psl.noaa.gov/data/gridded/data.cpc.globalprecip.html

Use CaseData Gap Summary
Weather forecasting: Subseasonal-to-seasonal horizon

High-fidelity weather forecasts at the subseasonal-to-seasonal (S2S) scale (i.e., 10-46 days ahead) are important for a variety of climate change-related applications. ML can be used to postprocess outputs from physics-based numerical forecast models in order to generate higher-fidelity forecasts.

Give feedback
Cable inspection robot data
Details (click to expand)

Cable inspection robot LiDAR data is beneficial for Specific Power Line (SPL) partitions which include dampers, insulators, broken strands, and attachments which may have degraded due to exposure to natural elements. Specific Fitting Detection partition data focuses on assessing risk at the lowest part of the power line near trees, roofs, and other power lines that may cross. Since the robots physically crawl on the lines, degradation detection of high voltage transmission lines are useful for maintenance scheduling and obstruction detection at the lower levels of the power line.

Use CaseData Gap Summary
Grid asset management: Assessing vegetation-related wildfire risk

Grid inspection robot imagery may require coordination efforts with local utilities to gain access over multiple robot trips, image preprocessing to remove ambient artifacts, position and location calibration, as well as limitations in the identification of degradation patterns based on the resolution of the robot mounted camera.

Give feedback
Camera trap images
Details (click to expand)

Camera traps are likely the most widely used sensors in automated biodiversity monitoring due to their low cost and simple installation. This medium offers close-range monitoring over long-time scales. The image sequences can be used to not only classify species but to identify specifics about the individual, e.g. sex, age, health, behavior, and predator-prey interactions. Camera trap data has been used to estimate species occurrence, richness, distribution, and density.

In general, the raw images from camera traps need to be annotated before they can be used to train ML models. Some of the available annotated camera trap images are shared via Wildlife Insights (www.wildlifeinsights.org) and LILA BC (www.lila.science), while others are listed on GBIF (https://www.gbif.org/dataset/search?q=). However, the majority of camera trap data is likely scattered across individual research labs or organizations and not publicly available. Sharing such images shared could provide significant progress towards fill the gaps associated with the lack of annotated data that currently hinders the progress of efficiently using ML in biodiversity studies. This is what initiatives like Wildlife Insights are looking to do. 

Use CaseData Gap Summary
Assessing forest restoration outcomes

There is a lack of standardized protocol to guide data collection for restoration projects, including which variables to measure and how to measure them. Developing such protocols would standardize data collection practices, enabling consistent assessment and comparison of restoration successes and failures on a global scale.

Give feedback
Automatic individual re-identification for wildlife

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Give feedback
Detection of climate-induced ecosystem changes

The major data gap is the limited institutional capacity to process and analyze data promptly to inform decision-making.

Give feedback
Terrestrial wildlife detection and species classification

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Give feedback
Carbon stock estimate
Details (click to expand)

ESA aboveground biomass (AGB) estimate is the most updated public dataset on AGB.

Use CaseData Gap Summary
Changes in marine ecosystems
Details (click to expand)

Annual data on changes (e.g. extent) in marine ecosystems such as mangroves, seagrasses, salt marshes, and wetlands due to various factors including coastal erosion, aquaculture, and others.

Use CaseData Gap Summary
ClimSim
Details (click to expand)

An ML-ready benchmark dataset designed for hybrid ML-physics research, e.g. emulation of subgrid clouds and convection processes.

Use CaseData Gap Summary
Development of hybrid-climate models

Physics-based climate models incorporate numerous complex components, such as radiative transfer, subgrid-scale cloud processes, deep convection, and subsurface ocean eddy dynamics. These components are computationally intensive, which limits the spatial resolution achievable in climate simulations. ML models can emulate these physical processes, providing a more efficient alternative to traditional methods. By integrating ML-based emulations into climate models, we can achieve faster simulations and enhanced model performance.

Give feedback
ClimateBench v1.0
Details (click to expand)

A benchmark dataset derived from a full complexity Earth System Model (NorESM2; participant of CMIP 6) for for emulation of key climate variables https://zenodo.org/records/7064308.

Use CaseData Gap Summary
Data-driven generation of climate simulations

The dataset currently includes simulations from only one model. To enhance accuracy and reliability, it is important to include simulations from multiple models.

Give feedback
Community science data
Details (click to expand)

Images and recordings contributed by citizen scientists and volunteers represent another significant source of data in biodiversity and ecosystem. Crowdsourcing platforms, such as iNaturalist, eBird, Zooniverse, and Wildbook, facilitate the sharing of community science data. Many of these platforms also serve as hubs for collating and annotating datasets.

Use CaseData Gap Summary
Terrestrial wildlife detection and species classification

The main challenge with community science data is its lack of diversity. Data tends to be concentrated in accessible areas and primarily focuses on charismatic or commonly encountered species.

Give feedback
Computational fluid dynamics simulation
Details (click to expand)

Computational fluid dynamics (CFD) simulation output is a means of assessing natural ventilation for new building construction in relation to layout geometry, terrain, presence of neighboring buildings and infrastructure, as well as materials. Multi-directional CFD simulations are often run to account for different times in the year where wind can vary with season. Given the building geometry, terrain, presence of neighboring buildings, and boundary conditions Navier-Stokes or Reynolds-Averaged Navier-Stokes equations can be solved over a lattice or grid superimposed on the layout.

Use CaseData Gap Summary
Energy-efficient new building design

Despite its usefulness in ventilation studies for new construction, CFD simulations are computationally expensive making it difficult to include in the early phase of the design process where building morphosis can be optimized to reduce future operational consumption associated with building lighting, heating, and cooling. Simulations require accurate input information with respect to material properties that may not be present in traditional urban building types. Output of models  require the integration of domain knowledge to interpret results from large volumes of synthetic data for different wind directions becoming challenging to manage. Future data collection with respect to simulation output verification can benefit surrogate or proxy approaches to computationally expensive Navier-Stokes equations, and coverage is often restricted to modern building approaches, leaving out passive building techniques known as vernacular architecture from indigenous communities from being taken into design consideration.

Give feedback
Copernicus Marine Data Store
Details (click to expand)

https://data.marine.copernicus.eu/products Free-of-charge state-of-the-art data on the state of the Blue (physical), White (sea ice) and Green (biogeochemical) ocean, on a global and regional scale.

Use CaseData Gap Summary
Marine wildlife detection and species classification

Data downloading is a bottleneck because it requires familiarity with APIs, which not all users possess.

Give feedback
DOE Atmospheric Radiation Measurement (ARM) research facility data products
Details (click to expand)

ARM represents data from various field measurement programs sponsored by the US Department of Energy with a focus on ground-based pyrheliometer and spectrometer data which is useful for solar radiation time series forecasting and solar potential assessment.

Use CaseData Gap Summary
Solar power forecasting: Very-short-term (0-30min)

ARM dataset includes data from various DOE sites that include sensor information from sun-tracking photometers, radiometers, spectrometer data which is helpful in understanding hyperspectral solar irradiance and cloud dynamics. ARM sites generate large datasets which can be challenging to store, stream, analyze and archive, may be sensitive to sensor noise, and require further measurement verification especially with respect to aerosol composition. Additionally, ARM data coverage is limited to ARM sites motivating future collaboration with partner networks to enhance observational spatial coverage.

Give feedback
DYAMOND (DYnamics of the Atmospheric general circulation Modeled On Non-hydrostatic Domains)
Details (click to expand)

Intercomparison of global storm-resolving (5km or less) model simulations; used as the target of the emulator. Data can be found here.

Use CaseData Gap Summary
Development of hybrid-climate models

Physics-based climate models incorporate numerous complex components, such as radiative transfer, subgrid-scale cloud processes, deep convection, and subsurface ocean eddy dynamics. These components are computationally intensive, which limits the spatial resolution achievable in climate simulations. ML models can emulate these physical processes, providing a more efficient alternative to traditional methods. By integrating ML-based emulations into climate models, we can achieve faster simulations and enhanced model performance.

Give feedback
Direct measurement of methane emission of rice paddies
Details (click to expand)

Direct measurement of methane emission of rice paddies by instruments and sampling systems placed in rice paddies to directly measure methane concentrations in the air above the fields or in the soil. 

Use CaseData Gap Summary
Estimation of methane emissions from rice paddies

There is a lack of direct observation of methane emissions from rice paddies.

Give feedback
Distribution system simulators
Details (click to expand)

Distribution system simulators such as OpenDSS and GridLab-D are crucial for understanding the hosting capacity of distribution level substation feeders because they allow for the analysis of various factors that can affect the stability and reliability of the power grid. These factors include voltage limits, thermal capability, control parameters, and fault current, among others. By simulating different scenarios and conditions, such as the integration of distributed energy resources (DERs) such as photovoltaic (PV) solar panels, these tools can provide insights into how the grid can be optimized to accommodate these resources without compromising safety and reliability. OpenDSS is free to use as an alternative when distribution utility real circuit feeder data is unavailable.

Use CaseData Gap Summary
Distribution-side hosting capacity estimation

While OpenDSS and GridLab-D are free to use as an alternative when real distribution substation circuit feeder data is unavailable, to perform site-specific or scenario studies, data from different sources may be needed to verify simulation results. Actual hosting capacity may vary from simulation results due to differences in load, environmental conditions, and the level of DER penetration.

Give feedback
Drone images
Details (click to expand)

Drone images have revolutionized various fields by providing high-resolution, aerial perspectives that were previously difficult to obtain. Equipped with advanced cameras and sensors, drones capture detailed visual data from above, offering insights into landscapes, infrastructure, and environmental changes.

Use CaseData Gap Summary
Early detection of fire

Thermal images captured by drones have high value but the cost of high-resolution sensors is high.

Give feedback
Drone images for biodiversity
Details (click to expand)

Like camera traps, drone images can offer high-resolution and relatively close-range images for species identification, individual identification, and environment reconstruction. As with camera traps, most drone images are scattered across disparate sources. Some such data is hosted on www.lila.science。 

Use CaseData Gap Summary
Assessing forest restoration outcomes

There is a lack of standardized protocol to guide data collection for restoration projects, including which variables to measure and how to measure them. Developing such protocols would standardize data collection practices, enabling consistent assessment and comparison of restoration successes and failures on a global scale.

Give feedback
Automatic individual re-identification for wildlife

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Give feedback
Digital reconstruction of the environment

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Give feedback
Terrestrial wildlife detection and species classification

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Give feedback
ENS
Details (click to expand)

Ensemble forecast up to 15 days ahead, generated by ECMWF numerical weather prediction model; used as a benchmark/baseline for evaluating ML-based weather forecasts. Data can be found here.

Use CaseData Gap Summary
Bias-correction of weather forecasts

Same as HRES, the biggest challenge of ENS is that only a portion of it is available to the public for free.

Give feedback
Weather forecasting: Short-to-medium term (1-14 days)

The biggest challenge of ENS is that only a portion of it is available to the public for free.

Give feedback
EPRI10: Transmission control center alarm and operational data set
Details (click to expand)

Supervisory Control and Data Acquisition (SCADA) systems collect data from sensors throughout the power grid. Alarm operational data, a portion of the data received by SCADA, provides discrete event-based information on the status of protection and monitoring devices in a tabular format which includes semi-structured text descriptions of individual alarm events. Often the data is formatted based on timestamp (in milliseconds), station, signal identification information, location, description, and action. Encoded within the identification information is the alarm message.

View dataset

Use CaseData Gap Summary
Analysis of grid reliability events

Access to EPRI10 grid alarm data is currently limited within EPRI. Data gaps with respect to usability are the result of redundancies in grid alarm codes, requiring significant preprocessing and analysis of code ids, alarm priority, location, and timestamps. Alarm codes can vary based on sensor, asset and line. Resulting actions as a consequence of alarm trigger events need field verifications to assess the presence of fault or non-fault events.

Give feedback
ERA5
Details (click to expand)

Atmospheric reanalysis data integrates both in-situ and remote sensing observations, including data from weather stations, satellites, and radar. This comprehensive dataset can be downloaded from the provided link.

View dataset

Use CaseData Gap Summary
Bias-correction of climate projections

ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.

Give feedback
Development of hybrid-climate models

ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.

Give feedback
Weather forecasting: Subseasonal horizon

ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.

Give feedback
Wildfire prediction: Short-term (3-7 days)

ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.

Give feedback
ESA land cover map
Details (click to expand)

Yearly land cover classification gridded map at 300-m resolution from 1992 to present produced by European Space Agency (ESA) Climate Change Initiative (CCI) https://catalogue.ceda.ac.uk/uuid/b382ebe6679d44b8b0e68ea4ef4b701c.

Higher resolution land cover maps (at 10-m resolution) are also available for years 2020 and 2021 (https://esa-worldcover.org/en).

For fire prediction, this provides fine-grained information of available fuel.

Use CaseData Gap Summary
Wildfire prediction: Short-term (3-7 days)

Wildfires are becoming more frequent and severe due to climate change. Accurate early prediction of these events enables timely evacuations and resource allocation, thereby reducing risks to lives and property. ML can enhance fire prediction by providing more precise forecasts quickly and efficiently.

Give feedback
ESRI land cover map
Details (click to expand)

Sentinel-2 10-m annual map of Earth’s land surface from 2017-2023.

There are also other land cover maps available: https://gisgeography.com/free-global-land-cover-land-use-data/.

Use CaseData Gap Summary
Emission dataset compiled from FAO statistics
Details (click to expand)

Dataset taken from FAO statistics and extrapolated spatially

Use CaseData Gap Summary
Modeling effects of soil processes on soil organic carbon

Data is extrapolated from statistics on a national level. It is unknown how accurate this data is when focusing on local information.

Give feedback
Exposure data
Details (click to expand)

Exposure is defined as the representative value of assests potentially exposed to a natural hazard occurrence. It can be described by a wide range of features, such as GDP, population, buildings, agriculture, depending on the risk exposed to.

There are global open data as well as proprietary data with more detailed information coming from well estabilished insurance markets.

It can be socio-economic data or structural (building occupancy and construction class) data. Two open-source structural data are OpenStreetMap and OpenQuake GEM project.

Use CaseData Gap Summary
Disaster risk assessment

Accessibility and reliability is a big issue.

Give feedback
Faraday: Synthetic smart meter data
Details (click to expand)

Due to consumer privacy protections, advanced metering infrastructure (AMI) data is unavailable for realistic demand response studies. In an effort to open smart meter data, Octopus Energy’s Centre for Net Zero, has generated a synthetic dataset conditioned on presence of low carbon technologies, energy efficiency, and property type from a model trained on 300 million actual smart meter readings from a United Kingdom (UK) energy supplier.

View dataset

Use CaseData Gap Summary
Short-term electricity load forecasting

Faraday synthetic AMI data is a response to the bottlenecks faced in retrieval of building level demand data based on consumer privacy. However, despite the model being trained on actual AMI data, synthetic data generation requires a priori knowledge with respect to building type, efficiency, and presence of low carbon technology. Furthermore, since the dataset is trained on UK building data, AMI time series generated may not be an accurate representation of load demand in regions outside the UK. Finally, since data is synthetically generated studies will require validation and verification using real data or aggregated per substation level data to assess its effectiveness. Faraday is currently accessible through the Centre for Net Zero’s API.

Give feedback
FathomNet
Details (click to expand)

FathomNet is an open-source image database that standardizes and aggregates expertly curated labeled data. It can be used to train, test, and validate state-of-the-art artificial intelligence algorithms to help us understand our ocean and its inhabitants.

Use CaseData Gap Summary
Marine wildlife detection and species classification

The biggest challenge for the developer of FathomNet is enticing different institutions to contribute their data.

Give feedback
Floating INfrastructure for Ocean observations FINO3
Details (click to expand)

FINO3 is an off-shore wind mast based wind speed and wind direction research platform datasets which include time series data with respect to temperature, air pressure, relative humidity, global radiation, and precipitation. Images from the perspective of the platform provide a snapshot of of environmental conditions directly. The platform is located in the northern part of the German Bight, 80km northwest of the island of Sylt in the midst of wind farms. Wind measurements are taken between 32 to 102 meters above sea level with wind speed measurements taken every 10meters. Data is collected from August 2009 until the present day.

Use CaseData Gap Summary
Offshore wind power forecasting: Long-term (3 hours-1 year)

Due to the location, FINO platform sensors are prone to failures of measurement sensors due to adverse outdoor conditions such as high wind and high waves. Often when sensors fail manual recalibration or repair is nontrivial requiring weather conditions to be amenable for human intervention. This can directly affect the data quality which can last for several weeks to a season. Coverage is also constrained to the platform and associated wind farm location.

Give feedback
GBIF
Details (click to expand)

GBIF—the Global Biodiversity Information Facility—is an international network and data infrastructure funded by the world’s governments. It offers open access to global biodiversity data. It sets common standards for sharing species records collected from various sources, like museum specimens and modern technologies. Using standards like Darwin Core, GBIF.org indexes millions of species records, accessible under open licenses, supporting scientific research and policy-making.

Use CaseData Gap Summary
Terrestrial wildlife detection and species classification

While GBIF provides a common standard, the accuracy of species classification in the data can vary, and classifications may change over time.

Give feedback
GEDI lidar
Details (click to expand)

The Global Ecosystem Dynamics Investigation (GEDI) is a joint mission between NASA and the University of Maryland. It uses three lasers to capture and then construct detailed three-dimensional (3D) maps of forest canopy height and the distribution of branches and leaves. By accurately measuring forests in 3D, GEDI data play an important role in estimating the forest height as well as canopy height, and thus understanding the amounts of biomass and carbon forests store and how much they lose when disturbed.

Use CaseData Gap Summary
Estimation of forest carbon stock

There is uncertainty in the data.

Give feedback
Grid event signature library
Details (click to expand)

The Grid Event Signature Library

Use CaseData Gap Summary
Grid2Op and PandaPower
Details (click to expand)

Grid2Op is a power systems simulation framework to perform reinforcement learning for electricity network operation that focuses on the use of topology to control the flows of the grid. Grid2Op allows users to control voltages by manipulating shunts or changing setpoint values of generators, influence active generation by use of redispatching, and manipulate storage units such as batteries or pumped storage to produce or absorb energy from the grid when needed. The grid is represented as a graph with nodes being buses and edges corresponding to power lines and transformers. Grid2Op has several available environments with different network topologies as well as variables that can be monitored as observations. The environment is designed for reinforcement learning agents to act upon with a variety of actions some of which are binary or continuous. This includes changes in topology such as changing bus, changing line status, setting storage, curtailment, redispatching, setting bus values, and setting line status. Multiple reward functions are also available in the platform for experimentation with different agents. It is important to note that Grid2Op has no internal modeling of equations of the grids or what kind of solver is necessary to adopt. Data on how the power grid is evolving is represented by the “Chronics.” The solver that computes the state of the grid is represented by the “Backend” which utilizes PandaPower to compute power flows.

Use CaseData Gap Summary
Improving power grid optimization

Grid2Op is a reinforcement learning framework that builds an environment based on topologies, selected grid observations, a selected reward function, and selected actions for an agent to select from. The framework relies on control laws rather than direct system observations which are subject to multiple constraints and changing load demand. Time steps representing 5 minutes are unable to capture complex transients and can limit the effectiveness of certain actions within the action space over others. Furthermore, customization of the Grid2Op can be challenging as the platform does not allow for single to multi-agent conversion, and is not a suitable environment for cascading failure scenarios due to game over rules.

Give feedback
Ground survey of building information
Details (click to expand)

On-site collection of data to accurately map and measure the physical dimensions and boundaries of buildings. This survey is typically conducted using a variety of methods and tools to ensure precise and detailed mapping.

Use CaseData Gap Summary
Ground survey of land use and land management
Details (click to expand)

The direct collection of data through field observations to understand how land is utilized and managed.

Use CaseData Gap Summary
Detection of climate-induced ecosystem changes

Data access is restricted due to institutional barriers and other restrictions.

Give feedback
Ground-survey based forest inventory data
Details (click to expand)

Forest information collected directly from forested areas through on-the-ground observations and measurements serves as ground truth for training and validating estimates. This data is crucial for accurate assessments, such as estimating forest canopy height using machine learning models.

Use CaseData Gap Summary
Estimation of forest carbon stock

The data is manually collected and recorded, resulting in errors, missing values, and duplicates. Additionally, it is limited in coverage and collection frequency.

Give feedback
HRES
Details (click to expand)

Single high-resolution forecast up to 10 days ahead generated by ECMWF numerical weather prediction model, the Integrated Forecasting system (IFS). It is usually used as a benchmark/baseline for evaulating ML-based weather forecast. Data can be found here.

Use CaseData Gap Summary
Bias-correction of weather forecasts

The biggest challenge with using HRES data is that only a portion of it is available to the public for free.

Give feedback
Weather forecasting: Short-to-medium term (1-14 days)

The biggest challenge of HRES is that only a portion of it is available to the public for free.

Give feedback
Hazard data
Details (click to expand)

Hazard data used for risk assessments usually are presented in the form of a catalog of hypothetical events with characteristics derived from, and statistically consistent with, the observational record. Some hazard data catalog can be found here https://sedac.ciesin.columbia.edu/theme/hazards/data/sets/browse, as well as from the Risk Data Library of the World Bank.

Use CaseData Gap Summary
Disaster risk assessment

Resolution of current hazard data is not sufficient for effective physical risk assessment

Give feedback
Health data
Details (click to expand)

Health data refers to information related to individuals’ physical and mental well-being. This can include a wide range of data, such as medical records, health surveys, healthcare utilization, and epidemiological data.

Use CaseData Gap Summary
Assessment of climate impacts on public health

The biggest issue for health data is its limited and restricted access.

Give feedback
High-resolution weather forecast (HRRR)
Details (click to expand)

Near-term weather forecast by High-Resolution Rapid Refresh (HRRR) model. HRRR is real-time 3-km resolution, hourly updated, cloud-resolving, convection-allowing atmospheric model. Radar data is assimilated in the HRRR every 15 min over a 1-h period.

Use CaseData Gap Summary
Weather forecasting: Near-term (< 24 hours)

Data volume is large, and only data covering the US is available.

Give feedback
Historical climate observations
Details (click to expand)

Climate observations of the past. Reanalysis dataset like ERA5 provides a global-scale data at coarse-resolution. Climate data aggregated from local weather station observations offer a more granular view.

Use CaseData Gap Summary
Assessment of climate impacts on public health

Processing climate data and Integrating climate data with health data is a big challenge.

Give feedback
Detection of climate-induced ecosystem changes

For highly biodiverse regions, like tropical rainforests, there is a lack of high-resolution data that captures the spatial heterogeneity of climate, hydrology, soil, and the relationship with biodiversity.

Give feedback
LBNL: Solar panel PV system dataset
Details (click to expand)

Lawrence Berkeley National Lab (LBNL) Solar Panel PV System Dataset is a small tabular dataset that includes specific feature data on PV system size, rebate, construction, tracking, mounting, module types, number of inverters and types, capacity, electricity pricing, and battery rated capacity. The LBNL solar panel PV system dataset was created by collecting and cleaning data for 1.6 million individual PV systems, representing 81% of all U.S. distributed PV systems installed through 2018. The analysis of installed prices focused on a subset of roughly 680,000 host-owned systems with available installed price data, of which 127,000 were installed in 2018. The dataset was sourced primarily from state agencies, utilities, and organizations administering PV incentive programs, solar renewable energy credit registration systems, or interconnection processes.

Use CaseData Gap Summary
Solar installation site assessment

The LBNL solar panel PV system dataset excluded third party owned systems and systems with battery backup. Since data was self-reported, component cost data may be imprecise. The dataset also has a data gap in timeliness as some of the data includes historical data which may not reflect current pricing and costs of PV systems.

Give feedback
Large-eddy simulations
Details (click to expand)

Very high resolution (finer than 150 m) atmospheric simulations where atmospheric turbulence is explicitly resolved in the model.

Use CaseData Gap Summary
Development of hybrid-climate models

Extremely high-resolution simulations, such as large-eddy simulations, are needed. By explicitly resolving processes that are not resolved in current climate models, these simulations may serve as better ground truth for training machine learning models that emulate the physical processes and offer a more accurate basis for understanding and predicting climate phenomena.

Give feedback