Data Gaps (Beta)

Artificial intelligence (AI) and machine learning (ML) offer a powerful suite of tools to accelerate climate change mitigation and adaptation across different sectors. However, the lack of high-quality, easily accessible, and standardized data often hinders the impactful use of AI/ML for climate change applications.

In this project, Climate Change AI, with the support of Google DeepMind, aims to identify and catalog critical data gaps that impede AI/ML applications in addressing climate change, and lay out pathways for filling these gaps. In particular, we identify candidate improvements to existing datasets, as well as "wishes" for new datasets whose creation would enable specific ML-for-climate use cases. We hope that researchers, practitioners, data providers, funders, policymakers, and others will join the effort to address these critical data gaps.

This project is currently in its beta phase, with ongoing improvements to content and usability. The information provided is not exhaustive, and may contain errors. We encourage you to provide input and contributions via the routes listed below, or by emailing us at datagaps@climatechange.ai. We are grateful to the many stakeholders and interviewees who have already provided input.

Contribute a new data gap by filling out this form.
Provide updates to an existing data gap by clicking the "Give feedback" button within the Details view for that data gap.
Provide general feedback (e.g., on content, usability, or actionability) by filling out this form.

Text Search

Data Gap Types (definitions)

Data Modalities

Sectors

Use Case

Gap Types

Sectors

Accelerating and improving weather forecasting: Near-term (< 24 hours)

Details (click to expand)

Accurate near-term (< 24 hours ahead) weather forecasting is critical for climate change mitigation (e.g., solar panel deployment) and adaptation (e.g., crisis management during disasters), with applications requiring high spatial and temporal resolution of temperature, precipitation, wind, and cloud coverage.

Machine learning can help make these forecasts more computationally efficient and accurate while maintaining or improving the high resolution needed for climate applications.

The main data gaps include limited geographic coverage (primarily US-centric data), extremely large data volumes that are difficult to transfer and process, and inconsistent data formats from different sources.

Addressing these gaps requires expanding coverage to global regions (especially the Global South), providing cloud-based computational resources alongside the data, and developing standardized formats for multi-source data integration.

Dataset

Data Gap Summary

Automated surface observation system (ASOS)

Data volume is large and only data specific to the US is available.

Data volume is large and only data specific to the US is available.

Data Gap Type	Data Gap Details
U6: Usability > Large Volume	The large data volume, resulting from its high spatio-temporal resolution, makes transferring and processing the data very challenging. It would be beneficial if the data were accessible remotely and if computational resources were provided alongside it.
S2: Sufficiency > Coverage	This assimilated dataset currently covers only the continental US. It would be highly beneficial to have a similar dataset that includes global coverage.

High-Resolution Rapid Refresh (HRRR) weather forecast

Data volume is large, and only data covering the US is available.

Data volume is large, and only data covering the US is available.

Data Gap Type	Data Gap Details
U6: Usability > Large Volume	The large data volume, resulting from its high spatio-temporal resolution, makes transferring and processing the data very challenging. It would be beneficial if the data were accessible remotely and if computational resources were provided alongside it.
S2: Sufficiency > Coverage	This assimilated dataset currently covers only the continental US. It would be highly beneficial to have a similar dataset that includes global coverage.

Regularly gridded high-resolution atmospheric observations

An enhanced version of ERA5 with higher granularity and fidelity is needed. Many surface observations and remote sensing data are available but underutilized for developing such a dataset.

An enhanced version of ERA5 with higher granularity and fidelity is needed. Many surface observations and remote sensing data are available but underutilized for developing such a dataset.

Data Gap Type	Data Gap Details
W: Wish	ERA5 is currently widely used in ML-based weather forecasts and climate modeling because of its high resolution and ready-for-analysis characteristics. However, large volumes of observations from radiosondes, balloons, and weather stations are largely underutilized. Creating a well-structured dataset like ERA5 but with more observational data would be valuable.

Satellite imagery – Multi-Radar/Multi-Sensor System

Obtaining and integrating radar data from various sources is challenging due to access restrictions, format inconsistencies, and limited global coverage.

Obtaining and integrating radar data from various sources is challenging due to access restrictions, format inconsistencies, and limited global coverage.

Data Gap Type	Data Gap Details
U3: Usability > Usage Rights	Many radar data are rescrited for academic and research purpose only.
O2: Obtainability > Accessibility	Radar data from many countries are not open to the public. They must be purchased or formally requested. Different agencies apply differing quality control protocols, making global-scale analysis challenging.
U1: Usability > Structure	Radar data from different sources vary in format, spatial resolution, and temporal resolution, making data assimilation difficult.
S2: Sufficiency > Coverage	There is insufficient data or no data available from the Global South.

Accelerating building energy models

Details (click to expand)

Building energy modeling (also called building performance simulation) is key across an array of use cases that can help reduce energy demand in buildings, including architectural design, heating, ventilation, and air conditioning (typically abbreviated HVAC) design and control, building performance rating, and building stock analysis.

Traditional building energy modeling, such as the software EnergyPlus relies on detailed physics models with significant computational complexity and processing time.. Machine learning models can significantly enhance evaluation by providing fast emulators for these models based on synthetic and real-world data, enabling faster prototyping and optimization of building design and operations along multiple comfort, consumption, and environmental objectives.

Traditional models and ML-based emulation both require precise inputs about the building design, its usage, as well as the physical and environmental conditions surrounding it. However, information building usage and design are often kept in silos, while information about the surroundings are, when available, dispersed across various datasets. There are very few benchmarks gathering all information for given buildings.

Closing these gaps involves releasing anonymized usage data, working on building bridges between relevant datasets, and developing benchmark datasets. This may enable testing models across more geographies and building types to reduce existing biases and uncertainties attached to building energy models.

Dataset

Data Gap Summary

Benchmark datasets for building energy modeling

Benchmark datasets for building energy modeling are few, are mostly available in the US, and cover a limited range of building types. The variables provided in such datasets are not always precise and comprehensive enough to test models adequately.

Benchmark datasets for building energy modeling are few, are mostly available in the US, and cover a limited range of building types. The variables provided in such datasets are not always precise and comprehensive enough to test models adequately.

Data Gap Type	Data Gap Details
O2: Obtainability > Accessibility	Most of the energy demand data is not freely available. Reasons include the reluctance of private companies to share the data and privacy concerns with respect to the residents of the buildings. Such data may be obtained for research via non-disclosure agreements, often after lengthy bureaucratic approval. This situation makes the development of open-access benchmark datasets complex. Targeted stakeholder engagement via data collection projects is required to overcome this situation.
U2: Usability > Aggregation	The different variables needed may not always be available together. One may need to match energy demand with building stock information and climatic data. Reusable open-source tools may ease this process.
S2: Sufficiency > Coverage	Most datasets are from test beds, buildings, and contributing households from the United States. Similar data from other regions would require data collection as household usage behavior may differ depending on culture, location, building age, and weather. Targeted stakeholder engagement via data collection projects is required to overcome this situation.
S3: Sufficiency > Granularity	Dataset time resolution and period of temporal coverage vary depending on the dataset selected. To overcome this gap, interpolation techniques may be employed and recorded.
S6: Sufficiency > Missing Components	Certain detailed variables about the building design and occupancy may not be recorded. Such data points are difficult to obtain without new data collection. Building data typically does not include grid interactive data or signals from the utility side with respect to control or demand side management. Such data can be difficult to obtain or require special permissions. By enabling the collection of utility side signals, utility-initiated auto-demand response (auto-DR) and load shifting could be better assessed.

Computational fluid dynamics simulation for building energy models

Despite its usefulness in ventilation studies for new construction, CFD simulations are computationally expensive making it difficult to include in the early phase of the design process where building morphosis can be optimized to reduce future operational consumption associated with building lighting, heating, and cooling. Simulations require accurate input information with respect to material properties that may not be present in traditional urban building types. Output of models require the integration of domain knowledge to interpret results from large volumes of synthetic data for different wind directions becoming challenging to manage. Future data collection with respect to simulation output verification can benefit surrogate or proxy approaches to computationally expensive Navier-Stokes equations, and coverage is often restricted to modern building approaches, leaving out passive building techniques known as vernacular architecture from indigenous communities from being taken into design consideration.

Despite its usefulness in ventilation studies for new construction, CFD simulations are computationally expensive making it difficult to include in the early phase of the design process where building morphosis can be optimized to reduce future operational consumption associated with building lighting, heating, and cooling. Simulations require accurate input information with respect to material properties that may not be present in traditional urban building types. Output of models require the integration of domain knowledge to interpret results from large volumes of synthetic data for different wind directions becoming challenging to manage. Future data collection with respect to simulation output verification can benefit surrogate or proxy approaches to computationally expensive Navier-Stokes equations, and coverage is often restricted to modern building approaches, leaving out passive building techniques known as vernacular architecture from indigenous communities from being taken into design consideration.

Data Gap Type	Data Gap Details
W: Wish	Such datasets do not exist and require dedicated work to gather inputs, generate the data via simulations, and ensure that the simulations are reliable by verifying them with real-world data. Licensing and privacy issues may also be important aspects of such efforts.

Residential daylight performance metric (DPM) data

While daylight performance metric (DPM) evaluation is an important step in the planning of commercial buildings, residential buildings do not have a similar focus, which is unusual given that most new building construction occurs within the residential sector. Residential DPMs often lack metrics associated with direct sunlight access, rely on annual averages for seasons, and utilize fixed occupancy schedules that are overly simplified for residential spaces. Additionally, illuminance metrics and thresholds utilized in commercial spaces do not translate well for residential spaces where people may prefer higher or lower illuminances depending on their location and lifestyles. Lastly, DPM optimization is based on operational metrics and assumptions on illumination in a space and its effects on the resulting thermal comfort and operational consumption of a traditional urban residential spaces, vernacular architecture which is specific to a local region and culture may not share similar objectives, preferring more indoor-outdoor transitional spaces, earthen materials, and less focus on windows and incident natural sunlight.

While daylight performance metric (DPM) evaluation is an important step in the planning of commercial buildings, residential buildings do not have a similar focus, which is unusual given that most new building construction occurs within the residential sector. Residential DPMs often lack metrics associated with direct sunlight access, rely on annual averages for seasons, and utilize fixed occupancy schedules that are overly simplified for residential spaces. Additionally, illuminance metrics and thresholds utilized in commercial spaces do not translate well for residential spaces where people may prefer higher or lower illuminances depending on their location and lifestyles. Lastly, DPM optimization is based on operational metrics and assumptions on illumination in a space and its effects on the resulting thermal comfort and operational consumption of a traditional urban residential spaces, vernacular architecture which is specific to a local region and culture may not share similar objectives, preferring more indoor-outdoor transitional spaces, earthen materials, and less focus on windows and incident natural sunlight.

Data Gap Type

Data Gap Details

O2: Obtainability > Accessibility

Depending on the simulation software selected, intended use, and number of features requested, simulation software is available for purchase.

S2: Sufficiency > Coverage

Vernacular architecture, characterized by traditional building styles and techniques specific to a local region or culture, are not covered in simulation tools. In fact, most simulation output focus on residential areas in primarily urban regions to minimize future operational costs with assumptions made based on desired illuminance thresholds which may not be universal. By including the ability to evaluate passive design strategies adapted to a specific climate and expanding the materials expression to include high thermal inertia walls and roofs such as those of earthen or thatched construction, additional thermal comfort studies can be performed for given incident illuminance. Cultural considerations to outdoor spaces in relation to indoor spaces can provide even greater context of simulation studies and their usefulness in new construction for diverse regions.

S3: Sufficiency > Granularity

Simulations use fixed occupancy schedules which work well in the context of commercial buildings but are overly prescriptive in the context of residential buildings where user occupancy may vary depending on the number of occupants, time of day, day of week, and season. Residential buildings are multipurpose and can be characterized with a member spending more time in some areas rather than others depending on activity. This gap can be alleviated by adapting and expanding simulation inputs to take diverse occupancy scenarios into consideration.

Current DPMs take into account annual averages rather than granular information with respect to seasonal variations in daylight availability. While some advances have been made to incorporate this information through tools like Daysim which defines new DPMs for residential buildings, further work is needed for regions where occupants may want to minimize direct light access and focus more on diffuse lighting. Expanding studies for clients in warmer more arid climates may provide different thresholds and comfort parameters depending on preferences and lifestyle and may even take into account daylight oversupply, glare, and even thermal discomfort.

Materials used in the construction process of the building may change after initial simulation development depending on availability. Finalized building materials and interior absorption and reflectance may diverge from those simulated. Use of dynamic shading devices could also decrease indoor temperature due to incident irradiance. Simulated results could be provided over a range.

Accelerating data-driven generation of climate simulations

Details (click to expand)

Climate simulation using physics-based Earth system models is computationally intensive and time-consuming, limiting the exploration of different climate scenarios.

ML can accelerate this process by creating surrogate models that approximate complex Earth system model simulations, enabling rapid generation of climate projections under various greenhouse gas emission scenarios.

Current ML approaches are limited by the availability of diverse training data from multiple climate models, with most datasets featuring only single-model simulations or inconsistent data structures across models.

Addressing these gaps requires standardizing data formats across climate models, making high-volume data more accessible through cloud-based solutions, and improving model quality to reduce biases and uncertainties in simulations.Closing these data gaps would enable more robust ML emulators capable of producing reliable climate projections at a fraction of the computational cost, accelerating climate research and supporting more informed policy decisions.

Dataset

Data Gap Summary

ClimateBench v1.0 (benchmark dataset for earth system models)

The dataset currently includes simulations from only one Earth system model, limiting the diversity of training data and potentially affecting the robustness and generalizability of ML emulators trained on it.

The dataset currently includes simulations from only one Earth system model, limiting the diversity of training data and potentially affecting the robustness and generalizability of ML emulators trained on it.

Data Gap Type

Data Gap Details

S2: Sufficiency > Coverage

Currently, the dataset includes information from only one model. Training a machine learning model with this single source of data may result in limited generalization capabilities. To improve the model’s robustness and accuracy, it is essential to incorporate data from multiple models. This approach not only enhances the model’s ability to generalize across different scenarios but also helps reduce uncertainties associated with relying on a single model.

ClimateSet (ML-ready earth system model inputs/outputs)

No significant data gap identified yet.

No significant data gap identified yet.

Data Gap Type	Data Gap Details

CMIP6 (earth system model intercomparison data)

The dataset faces three key challenges: its large volume makes access and processing difficult with standard computational infrastructure; lack of uniform structure across models complicates multi-model analysis; and inherent biases and uncertainties in the simulations affect reliability.

The dataset faces three key challenges: its large volume makes access and processing difficult with standard computational infrastructure; lack of uniform structure across models complicates multi-model analysis; and inherent biases and uncertainties in the simulations affect reliability.

Data Gap Type	Data Gap Details
U6: Usability > Large Volume	Massive computational requirements - Cloud-based platforms and data subsetting tools can improve accessibility
U1: Usability > Structure	Inconsistent formats across models - Standardized naming conventions and preprocessing pipelines can enable seamless multi-model integration
R1: Reliability > Quality	Large uncertainties in future projections - Model evaluation frameworks and ensemble weighting methods can help quantify and reduce uncertainties

Accelerating distribution-side hosting capacity estimations

Details (click to expand)

Transitioning power grids from carbon-based generation to renewable sources requires restructuring from unidirectional to bidirectional energy networks, which stresses existing systems—especially at the low-voltage distribution level. The hosting capacity of distribution feeders determines how much distributed renewable generation can be safely integrated without triggering safety equipment or compromising power quality.

Traditional methods for assessing distribution network hosting capacity rely on computationally expensive power flow simulations that are difficult to perform in real-time. Machine learning models can serve as surrogate models by capturing spatio-temporal patterns across multiple data streams, enabling real-time hosting capacity estimation and accelerated scenario evaluation through reinforcement learning.

A significant data gap is the limited availability of real distribution feeder data, requiring researchers to rely on simulations that may not accurately reflect actual grid conditions due to differences in load patterns, environmental factors, and DER penetration levels.

Distribution system operators, utilities, and researchers can collaborate to improve data sharing while protecting sensitive information, thereby enabling more accurate hosting capacity assessments and facilitating higher renewable energy integration in distribution networks.

Dataset

Data Gap Summary

Distribution system simulators

While OpenDSS and GridLab-D provide valuable simulation capabilities, their utility is limited by challenges in obtaining verification data from real distribution circuits, aggregating necessary input data from multiple sources, and navigating usage rights for proprietary utility data. Closing these gaps through improved utility-researcher partnerships and data sharing protocols would significantly enhance the accuracy of hosting capacity assessments, enabling greater renewable energy integration in distribution networks.

While OpenDSS and GridLab-D provide valuable simulation capabilities, their utility is limited by challenges in obtaining verification data from real distribution circuits, aggregating necessary input data from multiple sources, and navigating usage rights for proprietary utility data. Closing these gaps through improved utility-researcher partnerships and data sharing protocols would significantly enhance the accuracy of hosting capacity assessments, enabling greater renewable energy integration in distribution networks.

Data Gap Type	Data Gap Details
U2: Usability > Aggregation	Realistic distribution system studies require aggregating and collating data from multiple external sources regarding network topology, load profiles, and DER penetration for the specific region of interest.
U3: Usability > Usage Rights	Rights to external data for use with OpenDSS or GridLab-D may require purchase or partnerships with utilities and/or the Distribution System Operator (DSO) to perform scenario studies with high DER penetration and load demand.
R1: Reliability > Quality	Simulator studies require real deployment data from substations for verification, as actual hosting capacity may vary based on load conditions, environmental factors, and DER penetration levels in the service area.

Accelerating post-disaster damage assessments

Details (click to expand)

Post-disaster evaluations are crucial for identifying vulnerabilities exposed by climate-related events, which is essential for enhancing resilience and informing climate adaptation strategies.

ML can help by rapidly identifying and quantifying damage, such as structural collapse or vegetation loss, thereby improving response and recovery efforts.

Current datasets for ML-based damage assessment face significant geographic bias and granularity issues, limiting their effectiveness in global contexts and for detailed damage classification.

Expanding geographic coverage beyond North America and enhancing damage severity classifications would enable more accurate and globally applicable ML damage assessment models, improving disaster response worldwide.

Dataset

Data Gap Summary

Financial loss datasets related to the impacts of disasters

Financial loss data for disasters is primarily proprietary and inaccessible to researchers, limiting the development of comprehensive disaster impact assessment models.

Financial loss data for disasters is primarily proprietary and inaccessible to researchers, limiting the development of comprehensive disaster impact assessment models.

Data Gap Type	Data Gap Details
O2: Obtainability > Accessibility	Financial loss data is typically proprietary and held by insurance and reinsurance companies, as well as financial and risk management firms. Some of the data should be made available for research purposes to improve disaster response and planning.

Satellite imagery

Satellite imagery for disaster assessment faces challenges with temporal currency and spatial resolution, with public datasets having insufficient resolution for accurate damage assessment and commercial high-resolution options being prohibitively expensive.

Satellite imagery for disaster assessment faces challenges with temporal currency and spatial resolution, with public datasets having insufficient resolution for accurate damage assessment and commercial high-resolution options being prohibitively expensive.

Data Gap Type	Data Gap Details
S4: Sufficiency > Timeliness	Both pre- and post-disaster imagery are needed, but pre-disaster imagery sometimes is outdated, not really reflecting the condition right before the disaster.
S3: Sufficiency > Granularity	Accurate damage assessment requires high-resolution images, but the resolution of current publicly open datasets is inadequate for this purpose. Some commercial high-resolution images should be made available for research purposes at no cost.

xBD Dataset (pre- and post-disaster satellite imagery)

The xBD dataset has two significant limitations: it is geographically biased toward North America and lacks granular damage severity classification, limiting its global applicability and assessment precision.

The xBD dataset has two significant limitations: it is geographically biased toward North America and lacks granular damage severity classification, limiting its global applicability and assessment precision.

Data Gap Type	Data Gap Details
S3: Sufficiency > Granularity	There is no differentiation of grades of damage. More granular information about the severity of damage is needed for more precise assessments.
S2: Sufficiency > Coverage	Data is highly biased towards North America. Similar data from the other parts of the world is urgently needed.

Accelerating the design of new carbon-absorbing materials

Details (click to expand)

Carbon sequestration through absorption methods can effectively reduce CO2 levels in the atmosphere. Engineered molecules, carbon sorbents, can be designed to selectively bind to CO2. Traditionally, these molecules require in-lab experimentation, which can be time- and resource-intensive because they necessitate replication to identify adsorbent characteristics. Additionally, the search space of possible molecules can be very large and non-trivial to explore directly through experiment.

Machine learning can significantly accelerate materials discovery by systematically generating and evaluating candidate molecule properties based on structure, thereby facilitating rapid iteration.

There is a lack of openly-accessible lab measurements to train ML simulation models.

Multiple initiatives could be taken to close this gap, including creating industry-research data sharing initiatives or establishing mandatory data sharing requirements for scientific publications.

Dataset

Data Gap Summary

Lab measurements of material property and carbon absorption

The major challenge is that data is not shared with the public.

The major challenge is that data is not shared with the public.

Data Gap Type

Data Gap Details

O2: Obtainability > Accessibility

Data related to carbon absorption materials is often not readily accessible to the public, as it is typically withheld until commercial products are developed. While it is possible to scrape data from published literature, this approach can be cumbersome, especially for large datasets. To advance research and innovation in this field, establishing mandatory data sharing as a requirement for publication is essential. When a paper is published, authors should be required to provide their data in open, machine-readable formats to facilitate accessibility and usability.

Creating open initiatives where companies and institutions recognize the mutual benefits of data sharing is also vital. Until such initiatives demonstrate clear advantages for all stakeholders, private companies may be hesitant to share proprietary data. Initiatives like OpenDAC are promising steps toward fostering collaboration and transparency in the field.

Assessing rooftop solar photovoltaic potential

Details (click to expand)

Accelerating residential solar PV deployment is essential for decarbonizing energy systems, yet systematically assessing rooftop solar PV potential at scale remains a significant challenge.

Machine learning helps with analyzing aerial imagery and other data to estimate solar potential automatically, enabling faster and more targeted solar deployment. An example of this is the Google Sunroof project.

Key data gaps include limited high-resolution imagery, incomplete rooftop metadata, and scarce historical data, which reduce model accuracy and coverage.

Addressing these gaps through standardized data collection, integration of diverse sources, and better validation can help scale and improve this use case.

Dataset

Data Gap Summary

Building stock – from cadaster and aerial imagery

This use case requires 3D models of buildings that include roof geometries (surfaces and angles), which only few datasets, mostly in Europe, provide currently.

This use case requires 3D models of buildings that include roof geometries (surfaces and angles), which only few datasets, mostly in Europe, provide currently.

Data Gap Type	Data Gap Details
S2: Sufficiency > Coverage	These datasets are only available in certain high-income regions, and there is almost no data on entire continents e.g. Africa or South America.
S6: Sufficiency > Missing Components	Datasets that contain 3D information on roofs (sometimes called Level of detail or LoD 2, and above) are rare. 3D datasets at LoD1 do not have information on roof surfaces and angles, which are key to estimating solar potential. LoD0 (building footprints alone) lack building height, which is important for shading considerations.

JRC PVGIS (solar radiation data)

This dataset does not have major data gaps for this use case, but there are some approximations and other errors in the data to be considered.

This dataset does not have major data gaps for this use case, but there are some approximations and other errors in the data to be considered.

Data Gap Type	Data Gap Details
S3: Sufficiency > Granularity	The irradiance values provided represent instantaneous measurements at the timestamp, which may not accurately represent hourly or daily averages used for energy estimates.

Automating individual re-identification for wildlife

Details (click to expand)

Identifying individual animals within wildlife populations is critical for monitoring endangered species, understanding their behaviors, and developing effective conservation strategies for biodiversity preservation.

Computer vision and machine learning techniques enable automatic individual identification at scale, helping researchers track specific animals over time without invasive tagging methods.

The scarcity of publicly available and well-annotated datasets poses a significant challenge for applying ML in wildlife identification, with the most valuable data scattered across individual research labs or organizations rather than centralized repositories.

Addressing this requires fostering a culture of data sharing in the ecological community through incentives like financial rewards and recognition for data collectors, while establishing standardized pipelines and infrastructures to aggregate existing annotated data for model training.

Dataset

Data Gap Summary

Camera trap wildlife image collections

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Data Gap Type

Data Gap Details

S1: Sufficiency > Insufficient Volume

This scarcity of publicly open and well-annotated datasets arises from the labor-intensive nature of data annotation and the dearth of expertise necessary to accurately identify species. This is the case even for relatively well-studied taxa such as birds, and is even more notable for taxa such as Orthoptera (grasshoppers and relatives). The incompleteness of our current taxonomy also compounds this issue.

Though some data cannot be shared with the public because of concerns about poaching or other legal restrictions imposed by local governments, there are still considerable sources of well-annotated data that currently reside within the confines of individual researchers’ or organizations’ hard drives. Unlocking these data will partially address the data gap.

To achieve this goal:

A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.

S2: Sufficiency > Coverage

There exists a significant geographic imbalance in data collected, with insufficient representation from highly biodiverse regions. The data also does not cover a diverse spectrum of species, intertwining with taxonomy insufficiencies.

Addressing these gaps involves several strategies:

Data sharing: Unlocking existing datasets scattered across individual laboratories and organizations can partially alleviate the geographic and taxonomic gaps in data coverage.
Annotation Efforts:
- Leveraging crowdsourcing platforms enables collaborative data collection and facilitates volunteer-driven annotation, playing a pivotal role in expanding annotated datasets.
- Collaborating with domain experts, including ecologists and machine learning practitioners, is crucial. Incorporating local ecological knowledge, particularly from indigenous communities in target regions, enriches data annotation efforts.
- Developing AI-powered methods for automatic cataloging, searching, and data conversion is imperative to streamline data processing and extract relevant information efficiently.
Data Collection:
- Prioritize continuous data collection efforts, especially in underrepresented regions and for less documented species and taxonomic groups.
- Explore innovative approaches like Automated Multisensor Stations for Monitoring of Species Diversity (AMMODs) to enhance data collection capabilities.
- Strategically plan data collection initiatives to target species at risk of extinction, ensuring data capture before irreversible loss occurs.

U2: Usability > Aggregation

Many well-annotated datasets are currently scattered within the confines of individual researchers’ or organizations’ hard drives. Aggregating and sharing these datasets would largely address the lack of publicly open annotated data needed for model training. This initiative would also mitigate geographic and taxonomic imbalances in existing data.

To achieve this goal:

A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.

To achieve this goal:

A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Data Gap Type

Data Gap Details

S1: Sufficiency > Insufficient Volume

This scarcity of publicly open and well-annotated datasets arises from the labor-intensive nature of data annotation and the dearth of expertise necessary to accurately identify species. This is the case even for relatively well-studied taxa such as birds, and is even more notable for taxa such as Orthoptera (grasshoppers and relatives). The incompleteness of our current taxonomy also compounds this issue.

Though some data cannot be shared with the public because of concerns about poaching or other legal restrictions imposed by local governments, there are still considerable sources of well-annotated data that currently reside within the confines of individual researchers’ or organizations’ hard drives. Unlocking these data will partially address the data gap.

To achieve this goal:

A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.

S2: Sufficiency > Coverage

There exists a significant geographic imbalance in data collected, with insufficient representation from highly biodiverse regions. The data also does not cover a diverse spectrum of species, intertwining with taxonomy insufficiencies.

Addressing these gaps involves several strategies:

Data sharing: Unlocking existing datasets scattered across individual laboratories and organizations can partially alleviate the geographic and taxonomic gaps in data coverage.
Annotation Efforts:
- Leveraging crowdsourcing platforms enables collaborative data collection and facilitates volunteer-driven annotation, playing a pivotal role in expanding annotated datasets.
- Collaborating with domain experts, including ecologists and machine learning practitioners, is crucial. Incorporating local ecological knowledge, particularly from indigenous communities in target regions, enriches data annotation efforts.
- Developing AI-powered methods for automatic cataloging, searching, and data conversion is imperative to streamline data processing and extract relevant information efficiently.
Data Collection:
- Prioritize continuous data collection efforts, especially in underrepresented regions and for less documented species and taxonomic groups.
- Explore innovative approaches like Automated Multisensor Stations for Monitoring of Species Diversity (AMMODs) to enhance data collection capabilities.
- Strategically plan data collection initiatives to target species at risk of extinction, ensuring data capture before irreversible loss occurs.

U2: Usability > Aggregation

Many well-annotated datasets are currently scattered within the confines of individual researchers’ or organizations’ hard drives. Aggregating and sharing these datasets would largely address the lack of publicly open annotated data needed for model training. This initiative would also mitigate geographic and taxonomic imbalances in existing data.

To achieve this goal:

A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.

To achieve this goal:

A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.

Environmental DNA (eDNA)

A significant challenge for eDNA-based monitoring is the incomplete barcoding reference databases, limiting the ability to accurately identify species from genetic material. Initiatives like the BIOSCAN project are actively working to address this gap by expanding reference collections for diverse taxonomic groups, particularly for understudied regions and species.

A significant challenge for eDNA-based monitoring is the incomplete barcoding reference databases, limiting the ability to accurately identify species from genetic material. Initiatives like the BIOSCAN project are actively working to address this gap by expanding reference collections for diverse taxonomic groups, particularly for understudied regions and species.

Data Gap Type	Data Gap Details
S2: Sufficiency > Coverage	Barcoding reference databases remain far from complete. However, much attention and effort are being devoted to filling this data gap, for example, the BIOSCAN project.

Enabling 2D to 3D shape recovery and pose estimation of animals

Details (click to expand)

3D shape recovery and pose estimation refer to the reconstruction of the 3D shapes and poses of animals from 2D images. This information can provide non-invasive insights into animals’ health, age, or reproductive status in their natural environment, which are important for biodiversity monitoring.

ML-based computer vision techniques have been used to construct more accurate estimations of 3D animal shapes and poses.

However, there is a lack of open annotated datasets to train models.

More efforts going into the curation and release of such datasets could be pivotal towards unlocking this use case.

Dataset

Data Gap Summary

Camera trap wildlife image collections

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity studies. Addressing this gap requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity studies. Addressing this gap requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Data Gap Type

Data Gap Details

S1: Sufficiency > Insufficient Volume

This scarcity of publicly open and well-annotated datasets arises from the labor-intensive nature of data annotation and the dearth of expertise necessary to accurately identify species. This is the case even for relatively well-studied taxa such as birds, and is even more notable for taxa such as Orthoptera (grasshoppers and relatives). The incompleteness of our current taxonomy also compounds this issue.

Though some data cannot be shared with the public because of concerns about poaching or other legal restrictions imposed by local governments, there are still considerable sources of well-annotated data that currently reside within the confines of individual researchers’ or organizations’ hard drives. Unlocking these data will partially address the data gap.

To achieve this goal:

A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.

S2: Sufficiency > Coverage

There exists a significant geographic imbalance in data collected, with insufficient representation from highly biodiverse regions. The data also does not cover a diverse spectrum of species, intertwining with taxonomy insufficiencies. There is now increasing work in insect camera traps, but this field is still in its infancy and data remains limited.

Addressing these gaps involves several strategies:

Data sharing: Unlocking existing datasets scattered across individual laboratories and organizations can partially alleviate the geographic and taxonomic gaps in data coverage.
Annotation Efforts:
- Leveraging crowdsourcing platforms enables collaborative data collection and facilitates volunteer-driven annotation, playing a pivotal role in expanding annotated datasets.
- Collaborating with domain experts, including ecologists and machine learning practitioners, is crucial. Incorporating local ecological knowledge, particularly from indigenous communities in target regions, enriches data annotation efforts.
- Developing AI-powered methods for automatic cataloging, searching, and data conversion is imperative to streamline data processing and extract relevant information efficiently.
Data Collection:
- Prioritize continuous data collection efforts, especially in underrepresented regions and for less documented species and taxonomic groups.
- Explore innovative approaches like Automated Multisensor Stations for Monitoring of Species Diversity (AMMODs) to enhance data collection capabilities.
- Strategically plan data collection initiatives to target species at risk of extinction, ensuring data capture before irreversible loss occurs.

U2: Usability > Aggregation

Many well-annotated datasets are currently scattered within the confines of individual researchers’ or organizations’ hard drives. Aggregating and sharing these datasets would largely address the lack of publicly open annotated data needed for model training. This initiative would also mitigate geographic and taxonomic imbalances in existing data.

To achieve this goal:

A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.

To achieve this goal:

A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.

Enabling assessments of rest area capacity and use for electric truck charging

Details (click to expand)

Electric trucks will play a critical role in decarbonizing freight transport, but they require reliable access to break-time and overnight charging infrastructure. Rest areas along highways are suitable locations for this infrastructure, yet their capacity and utilization is not known at scale. Utilization rates would be needed to understand infrastructure and space needs and constraints.

There is currently little data on the capacity and use of rest areas, and ML could help generate such information at scale from limited samples e.g. using remote sensing on satellite imagery.

The lack of ground truth data and the difficulties in accessing high-resolution satellite imagery are some of the most important bottlenecks for this use case.

Initiating projects with highway operators, who may find an economic interest in developing this monitoring technique, may help address both issues.

Dataset

Data Gap Summary

Occupancy data from rest areas from cameras

This data is generally not shared and only accessible for few rest areas.

This data is generally not shared and only accessible for few rest areas.

Data Gap Type	Data Gap Details
O2: Obtainability > Accessibility	These data are very rarely shared by highway operators.
S1: Sufficiency > Insufficient Volume	This data is shared for some rest areas in some regions, which do not offer enough ground truth data to understand the occupancy across a network or regions.

Satellite imagery

Satellite imagery for obtaining truck counts requires high-resolution imagery (here, both high temporal and spatial resolution matter) that is cloud-free over several kilometers. Usual cloud-free products are not suitable, because the time stamp attached to the image is important, and one image should cover several kilometers of a street or highway.

Satellite imagery for obtaining truck counts requires high-resolution imagery (here, both high temporal and spatial resolution matter) that is cloud-free over several kilometers. Usual cloud-free products are not suitable, because the time stamp attached to the image is important, and one image should cover several kilometers of a street or highway.

Data Gap Type	Data Gap Details
S3: Sufficiency > Granularity	This use case necessitates imagery at high temporal and spatial resolution, which is difficult to access and may be costly.

Enabling inference of city-level transportation mode shares

Details (click to expand)

Knowing modal shares in cities is crucial for climate change mitigation because it helps identify how people travel, the extent to which low-carbon options like walking, cycling, and public transit are being used, and study what influences such choices to design effective policy interventions.

ML has the potential to predict modal share based on city characteristics, complementing traditional transportation surveys that are infrequent and not available for all cities. This enables new opportunities for tracking modal shifts and linking them to various policies and measures.

City-level modal share and EUROSTAT socio-economic data face challenges with inconsistent methodologies, outdated information, changing boundaries, and missing values, limiting their reliability, comparability, and usability.

More efforts for harmonizing data collection procedures would reduce the need for harmonization and increase the robustness of the data.

Dataset

Data Gap Summary

City-level transportation mode share data

There are issues with the quality of the data and consistent time series: there are no datasets with data for multiple cities that were produced with the same methodology, that are directly usable and highly trustworthy for scientific research.

There are issues with the quality of the data and consistent time series: there are no datasets with data for multiple cities that were produced with the same methodology, that are directly usable and highly trustworthy for scientific research.

Data Gap Type	Data Gap Details
R1: Reliability > Quality	Different surveys may ask different questions and use different methodologies, making the data hard to compare. The European Union is organizing a centralized survey that may address this issue and provide an example of how this may be reproduced elsewhere.
S4: Sufficiency > Timeliness	There is a lack of consistent time series over time, and certain cities don’t have recent data. The European Platform on Mobility Management is an interesting gathering initiative, but the website is not maintained anymore, and the data is getting old.

EUROSTAT city-level socio-economic data

EUROSTAT city-level socio-economic data faces challenges with inconsistent time series due to changing boundaries, incomplete validation and aggregation issues, and missing values that limit its reliability and usability.

EUROSTAT city-level socio-economic data faces challenges with inconsistent time series due to changing boundaries, incomplete validation and aggregation issues, and missing values that limit its reliability and usability.

Data Gap Type	Data Gap Details
R1: Reliability > Quality	Timeseries may be inconsistent, for example, because of the changing administrative boundaries (e.g. cities or regions merging or splitting), without the data being appropriately altered.
U5: Usability > Pre-processing	There are issues with the aggregation process of the data, and validation may be incomplete for certain variables.
S6: Sufficiency > Missing Components	The dataset provides many variables, but there are many missing values, compromising the usability of the data.

Enabling non-intrusive electricity load monitoring

Details (click to expand)

Non-intrusive load monitoring (NILM) is critical for disaggregating building electricity consumption into individual appliance profiles, enabling targeted energy efficiency strategies, demand response, and better supply/demand matching to reduce carbon emissions and maintain grid stability.

AI techniques can analyze patterns in aggregate electricity data to identify individual appliance signatures without requiring separate meters for each device, providing cost-effective insights for both consumers and utilities.

The effectiveness of AI-based NILM is hindered by insufficient training data that represents diverse appliance types, usage patterns, and building characteristics across different regions, limiting model accuracy and generalizability in real-world settings.

Utilities, researchers, and manufacturers can collaborate to create standardized, privacy-preserving datasets through controlled data collection campaigns and by developing synthetic data generation techniques that capture the diversity of appliance signatures and usage patterns.

Dataset

Data Gap Summary

Pecan Street (appliance-level consumption data)

Pecan Street DataPort requires non-academic and academic users to purchase access via licensing which varies depending on the building data features requested. Coverage area of data is primarily concentrated in the Mueller planned housing community in Austin, Texas–a modern built environment which is not representative of older historical buildings that may be in need of energy efficient upgrades and retrofits. In customer segmentation studies and consumer-in-the-loop load consumption modeling, annual socio-demographic survey data may be too coarse and not provide insight into behavioral effects of household members on consumption profiles with time.

Pecan Street DataPort requires non-academic and academic users to purchase access via licensing which varies depending on the building data features requested. Coverage area of data is primarily concentrated in the Mueller planned housing community in Austin, Texas–a modern built environment which is not representative of older historical buildings that may be in need of energy efficient upgrades and retrofits. In customer segmentation studies and consumer-in-the-loop load consumption modeling, annual socio-demographic survey data may be too coarse and not provide insight into behavioral effects of household members on consumption profiles with time.

Data Gap Type	Data Gap Details
U3: Usability > Usage Rights	Usage rights vary depending on the agreed upon licensing agreement.
S6: Sufficiency > Missing Components	The data does not track real-time occupancy of individuals in the household which may provide insight into behavioral effects on energy consumption. Addition of this data, can allow for improved consumption based customer segmentation models, as patterns can change with respect to time and day of the week. The data would also be amenable for consumer-in-the-loop energy management studies with respect to comfort based on customer habitual activity, location in the house, and number of occupants.
S3: Sufficiency > Granularity	Disaggregated data may provide greater granular context for customer segmentation studies than those relying on aggregate data only. However, such segmentation studies ultimately depend on the number of household members that may be using appliances at a given time. Pecan Street data contains annual survey responses by participants with respect to household demographics and home features which may be too coarse in granularity to tracking how customer segments change over time as members move in or out of a building. Jointly taking occupancy data, can address the gap in granularity but can potentially limit volunteer engagement as concerns with respect to privacy will need to be evaluated.
S2: Sufficiency > Coverage	Data coverage primarily focuses on Texas with limited coverage in New York and California. Though there are efforts to include Puerto Rico data hinges on volunteer participation. This could introduce self-selection bias, as households who participate are likely more interested in energy conservation than the general population. Furthermore, a majority of the dataset covers the Mueller community in Austin, a planned community developed after 1999 with modern built types. Enrollment of older built environment homes and different temperate regions within the United States and globally, may provide greater insight into household appliance usage patterns as well as generation patterns which vary depending on temperate region as well as appliance age. Identifying high consumption older appliances can assist in identifying upgrades.
O2: Obtainability > Accessibility	Data is downloadable as a static file or accessible via the DataPort API. Based on the licensing agreement, a small dataset is available for free for academic individuals with pricing for larger datasets. Commercial use requires paid access based on requested features ranging from the standard to unlimited customer tier and plan.

Sub-metered appliance-level data

For accurate NILM studies, benchmark datasets are required to include not only consumption but local power generation (e.g., from rooftop solar), as it can affect the overall aggregate load observed at the building level. While some datasets may include generation information, most studies do not take rooftop solar generation into account. Additionally, devices that can behave both as a load and generator such as electric vehicles or stationary batteries were also not included. The majority of building types are single family housing units limiting the diversity of representation. Furthermore, most datasets are no longer maintained following study close.

For accurate NILM studies, benchmark datasets are required to include not only consumption but local power generation (e.g., from rooftop solar), as it can affect the overall aggregate load observed at the building level. While some datasets may include generation information, most studies do not take rooftop solar generation into account. Additionally, devices that can behave both as a load and generator such as electric vehicles or stationary batteries were also not included. The majority of building types are single family housing units limiting the diversity of representation. Furthermore, most datasets are no longer maintained following study close.

Data Gap Type	Data Gap Details
U5: Usability > Pre-processing	Sub-metered data relies heavily on the sensor network installation used in monitoring the building. Depending on the technology used, some sensors require calibration or are prone to malfunctions and delays. Additionally, interference from other devices can be present in the aggregate building level readings, such as that experienced by REFIT, that need to be addressed manually to enhance the usability of the dataset. These may vary depending on the submeter dataset utilized, requiring a clear understanding of associated metadata and documentation specific to the testbed the study was built upon. Exploratory data analysis of the time series data may assist in identifying outliers that may be a result of sensor drift.
U1: Usability > Structure	When retrieving NILM data from a variety of sources from pre-existing studies as well as through custom data collection, the structure of the received data can vary. Testbed design, hardware, and variables monitored depend on sensor availability which can ultimately influence schema and data formats. Data structure may also differ based on the level of disaggregation at the plug level or the individual appliance level. When building future testbeds for data collection, it may help to to follow the standards set by APIs such as NILMTK which has successfully integrated multiple datasets from different sources. Using the REDD dataset format as inspiration, the toolkit developers created a standard energy disaggregation data format (NILMTK-DF) with common metadata and structure which requires manual dataset-specific preprocessing. When working with non-standardized data that may require aggregation, machine learning based data fusion strategies may automating schema matching and data integration.
S6: Sufficiency > Missing Components	While sub-metered data provides a means of verifying non-intrusive load monitoring techniques, it does not capture the hidden human motivators driving appliance usage (such as comfort, utility cost, and daily activities) as well as other important factors contributing to the aggregate load seen at the building level meter. The key to improving these studies is to provide greater context to the sub-metered data by taking additional joint measurements such as rooftop solar power production, electric vehicle load, occupancy related information, and battery storage. Some dataset-specific missing data components are highlighted below. All datasets mentioned do not include electric vehicle loads. REDD, AMPds2, COMBED, DEDDIAG, DRED, GREEND, iAWE, UK-DALE do not include generation from rooftop solar. REFIT contains solar from three homes but they were not the focus of the study and were treated as solar interference to the aggregate load. The UMass smart home dataset only had representation of one home with solar and wind generation, though at a significantly larger square footage and build compared to the other two homes that were featured. While DRED provided occupancy information through data collected from wearable devices with respect to the home and ECO and IDEAL through self-reporting and an infrared entryway sensor, all other studies did not. The majority of datasets are not amenable to human in the loop user behavior analysis with respect to consumption patterns, response to feedback, and the effectiveness in load shifting to promote energy conserving behaviors due to their lack of representation. While AMPds2 includes some utility data, most datasets do not incorporate billing or real time pricing. This type of data would be beneficial as it varies from time, season, region, and utility. Battery storage was not taken into account in all building consumption datasets.
S2: Sufficiency > Coverage	Gaps in dataset coverage are specific to the sub-metered dataset. These gaps may be due to unaccounted loads, level of disaggregation (e.g. circuit level, plug level, or individual appliance level), or limited appliance types. Diversity of building types are limited as most studies take place in single family residences. Some dataset specific gaps are detailed below that may be addressed by collecting new data on existing testbeds or by augmenting already collected data with synthetic information. Future data collection efforts should be mindful of avoiding the kinds of gaps associated with existing datasets. In AMPds2 data there were some missing data from electricity and water and natural gas readings. Additionally, there exist un-metered household loads which were not accounted for in the aggregate building level readings. With respect to dishwasher consumption, AMPds2 did not have a direct water meter monitoring. REFIT did not monitor appliances that could not be accessed through wall plugs such as electric ovens. Depending on the built environment and building type, larger loads may not be able to be connected to building level meters. For example, in the GREEND dataset, electric boilers in Austria were connected to separate external meters. In the UMass smart home dataset, gas furnaces, exhaust fans, and recirculator pump loads were not able to be monitored. AMPds2, DEDDIAG, DRED, iAWE, REDD, REFIT, and UMass smart home dataset all gather data in single family homes which may not be representative of the diversity of buildings in terms of age, location, construction, and potential household demographics. REFIT data covers different single family home types such as detached, semi-detached, and mid-terrace homes that ranged from 2-6 bedrooms with builds from the 1850s to 2005. GREEND covers apartments in addition to single family homes but the number of households was 9. AMPds2, DRED, iAWE only cover a single household. Additionally, datasets are specific to the location where the measurements were taken which are susceptible to the environmental conditions of the region as well as the culture of the population. For example, REDD consists of data from 10 monitored homes which may not be representative of common appliances contributing to the overall load of the broader population outside of Boston. COMBED contains complex load types that may rely on variable-speed drives as well as multi-state devices, which the other datasets do not contain which may be due to the difference in building type but could also be due to the lack of diversity of appliance representation. ECO data relied on smart plugs for disaggregated load consumption measurements which varied between households depending on the smart plug appliance coverage. For all households the total consumption was not equal to the sum of the consumption measured from the plugs alone, indicative of a high proportion of non-attributed consumption.
R1: Reliability > Quality	In the AMPds2 data, the sum of the sub-metered consumption data did not add up to the whole house consumption due to some rounding error in the meter measurements, highlighting not only the need for NILM studies with sub-metered data as ground truth, but also the type of building level meter. Future data collection efforts may want to not only focus on retrieval of utility-side building meter data but also supplemental aggregate meter data to detect mismatches in measurements. Datasets or studies that require self-reporting by customers may introduce participant bias, as the resolution with which households update voluntary information may vary. For example, if the number of household members, occupancy schedule, and the addition of new plug loads is self-reported, frequency of updates vary depending on volunteer engagement. Additionally, volunteers who participate in NILM studies may have a particular propensity for energy efficient actions, and may not be representative of the general population. For example, some participants of UK-DALE were Imperial College graduate students were motivated to participate to advance their own projects. To make sure that electricity usage represents the general population, future case studies can recruit potential volunteer communities with diversity of socioeconomic background and location.

Enabling predictions of materials optimised for filtration, catalysis, electrics & magnetics

Details (click to expand)

A key climate-related challenge is the reduction of greenhouse gas emissions and the development of sustainable industrial processes. This includes improving gas and liquid filtration (e.g., using zeolites and MOFs for hydrogen separation and tailings reclamation), advancing catalysis to minimize industrial waste, replacing carbon-intensive systems through electrification and novel material design for carbon capture.

Artificial Intelligence accelerates progress by predicting novel material structures and their properties through generative models and machine learning, enabling faster discovery of effective materials. It also reduces the computational cost of traditional simulations (like DFT) and refines interatomic potentials for molecular dynamics, making material optimization and synthesis route identification significantly more efficient.

Synthesis experimental data, both positive and negative, is necessary to train algorithms, but negative data is not publicly available.

Research on this use case would be facilitated if industrial organizations performing such experiments could share negative results.

Dataset

Data Gap Summary

Negative experimental synthesis data

Negative – not only positive – synthesis experimental data is necessary to train algorithms, but such data is not publicly available.

Negative – not only positive – synthesis experimental data is necessary to train algorithms, but such data is not publicly available.

Data Gap Type	Data Gap Details
W: Wish	In material synthesis, the literature only includes successful experimental results. To optimize synthesis routes, we need to know which experiments were not successful. This data is usually only kept in industrial organizations’ internal Laboratory Information Management Systems.

Enhancing bias-correction of climate projections

Details (click to expand)

Climate projections provide essential information about future climate conditions, guiding critical mitigation and adaptation efforts such as disaster risk assessments and power grid optimization.

ML enhances the accuracy of these projections by bias-correcting forecasts from physics-based climate models like CMIP6, learning relationships between historical simulations and observed ground truth data.

Large uncertainties in climate projections and inconsistent data formats across models create significant barriers for developing robust ML bias-correction methods.

Improved model ensemble techniques and standardized data formats can enhance projection reliability and enable more effective climate risk planning.

Dataset

Data Gap Summary

CMIP6 (earth system model intercomparison data)

Large uncertainties in future climate projections limit confidence in bias-correction applications. The massive data volume and inconsistent formats across models—including variable naming conventions, resolutions, and file structures—hinder effective multi-model analysis. Improved model evaluation frameworks and data standardization efforts can enhance projection reliability and streamline ML model development.

Large uncertainties in future climate projections limit confidence in bias-correction applications. The massive data volume and inconsistent formats across models—including variable naming conventions, resolutions, and file structures—hinder effective multi-model analysis. Improved model evaluation frameworks and data standardization efforts can enhance projection reliability and streamline ML model development.

Data Gap Type	Data Gap Details
R1: Reliability > Quality	Large uncertainties in future projections - Model evaluation frameworks and ensemble weighting methods can help quantify and reduce uncertainties
U1: Usability > Structure	Inconsistent formats across models - Standardized naming conventions and preprocessing pipelines can enable seamless multi-model integration
U6: Usability > Large Volume	Massive computational requirements - Cloud-based platforms and data subsetting tools can improve accessibility

ECMWF ERA5 Atmospheric Reanalysis

ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.

ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.

Data Gap Type	Data Gap Details
O2: Obtainability > Accessibility	Download delays from Copernicus Climate Data Store - Enhanced server infrastructure, regional mirror sites, and cloud-based access platforms can reduce download times from days/months to hours
U6: Usability > Large Volume	Massive storage and processing requirements - Cloud computing platforms with pre-loaded datasets and data subsetting tools can enable analysis without full downloads
R1: Reliability > Quality	Inherent biases limit ground truth applications - ML-enhanced data assimilation techniques and ensemble reanalysis approaches can reduce model-dependent biases, particularly improving precipitation and cloud field accuracy

Ground-Based Weather Station Observations

Irregular spatial distribution and point-based measurements require extensive preprocessing to create gridded datasets suitable for ML applications. Limited station density in many regions, especially over oceans and remote areas, constrains bias-correction accuracy. Enhanced observation networks and improved interpolation techniques can provide more comprehensive spatial coverage for model validation.

Irregular spatial distribution and point-based measurements require extensive preprocessing to create gridded datasets suitable for ML applications. Limited station density in many regions, especially over oceans and remote areas, constrains bias-correction accuracy. Enhanced observation networks and improved interpolation techniques can provide more comprehensive spatial coverage for model validation.

Data Gap Type	Data Gap Details
U1: Usability > Structure	Point measurements require conversion to regular grids through statistical interpolation methods and geostatistical techniques before use in ML models designed for gridded forecast data.
S2: Sufficiency > Coverage	Sparse observation networks in remote regions and developing countries create significant gaps in ground truth data needed for validating bias-corrected forecasts, though expanded observation networks and satellite-derived proxies can help fill these spatial gaps.
O2: Obtainability > Accessibility	Weather station data access is heavily restricted in many regions, with only a small fraction available to the public, limiting the development of comprehensive bias-correction training datasets.

Enhancing bias-correction of weather forecasts

Details (click to expand)

Accurate weather forecasts are critical for agriculture, disaster preparedness, and energy management, yet physics-based numerical weather models contain systematic biases that reduce forecast reliability, especially for extreme weather events.

ML can improve forecast accuracy by post-processing outputs from numerical weather prediction models and learning to correct the systematic biases inherent in physics-based forecasting systems.

The primary data gap is limited public access to high-resolution real-time forecast data, as most operational forecast products are costly and proprietary, hindering development of bias-correction algorithms.

Increased data sharing partnerships between meteorological agencies and research institutions, along with development of accessible benchmark datasets, could democratize access to high-quality forecast data and accelerate ML-based improvements to weather prediction.

Dataset

Data Gap Summary

ECMWF ENS (global 9-km 15-day ahead weather model ensemble)

Same as HRES, the biggest challenge of ENS is that only a portion of it is available to the public for free.

Same as HRES, the biggest challenge of ENS is that only a portion of it is available to the public for free.

Data Gap Type	Data Gap Details
O2: Obtainability > Accessibility	High-resolution real-time ensemble forecasts are for purchase only and expensive to obtain for research purposes.
U6: Usability > Large Volume	Due to high spatio-temporal resolution and ensemble size, ENS data presents significant challenges for downloading, transferring, storing, and processing with standard computational infrastructure. The WeatherBench 2 benchmark dataset provides a solution for certain machine learning tasks involving ENS data.

ECMWF HRES (global 9-km 10-day ahead weather model)

Limited public access to real-time high-resolution forecasts and computational challenges from large data volumes restrict ML model development and validation for operational weather bias correction.

Limited public access to real-time high-resolution forecasts and computational challenges from large data volumes restrict ML model development and validation for operational weather bias correction.

Data Gap Type	Data Gap Details
O2: Obtainability > Accessibility	The high-resolution real-time forecast is for purchase only and it is expensive to get them.
U6: Usability > Large Volume	Due to its high spatio-temporal resolution, HRES data is quite large, creating significant challenges for downloading, transferring, storing, and processing with standard computational infrastructure. To address this, the WeatherBench 2 benchmark dataset provides a solution for certain machine learning tasks involving HRES data.

Ground-Based Weather Station Observations

Sparse spatial coverage, restricted data access in many regions, and the need for gridding point measurements limit the effectiveness of station observations for training and validating ML bias-correction models.

Sparse spatial coverage, restricted data access in many regions, and the need for gridding point measurements limit the effectiveness of station observations for training and validating ML bias-correction models.

Data Gap Type	Data Gap Details
O2: Obtainability > Accessibility	Weather station data access is heavily restricted in many regions, with only a small fraction available to the public, limiting the development of comprehensive bias-correction training datasets.
U1: Usability > Structure	Point measurements require conversion to regular grids through statistical interpolation methods and geostatistical techniques before use in ML models designed for gridded forecast data.
S2: Sufficiency > Coverage	Sparse observation networks in remote regions and developing countries create significant gaps in ground truth data needed for validating bias-corrected forecasts, though expanded observation networks and satellite-derived proxies can help fill these spatial gaps.

Enhancing digital reconstructions of the environment

Details (click to expand)

Digital reconstruction of the environment using remote sensing data is crucial for understanding habitat conditions and their impacts on wildlife, enabling more effective conservation strategies in the face of climate change.

ML enhances this process by efficiently analyzing large volumes of data from multiple sources, producing more detailed and accurate environmental reconstructions.

A key data gap is the limited availability of high-resolution imagery, with most high-quality data being commercial and not freely accessible, particularly affecting studies that require detailed environmental monitoring.

Fostering a data-sharing culture through incentives for collectors, creating standardized annotation pipelines, and making commercial high-resolution satellite imagery more accessible would significantly advance ML-enabled environmental monitoring for biodiversity conservation.

Dataset

Data Gap Summary

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Data Gap Type	Data Gap Details
O2: Obtainability > Accessibility
S1: Sufficiency > Insufficient Volume	This scarcity of publicly open and well-annotated datasets arises from the labor-intensive nature of data annotation and the dearth of expertise necessary to accurately identify species. This is the case even for relatively well-studied taxa such as birds, and is even more notable for taxa such as Orthoptera (grasshoppers and relatives). The incompleteness of our current taxonomy also compounds this issue. Though some data cannot be shared with the public because of concerns about poaching or other legal restrictions imposed by local governments, there are still considerable sources of well-annotated data that currently reside within the confines of individual researchers’ or organizations’ hard drives. Unlocking these data will partially address the data gap. To achieve this goal: A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative. Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets. The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
S2: Sufficiency > Coverage	There exists a significant geographic imbalance in data collected, with insufficient representation from highly biodiverse regions. The data also does not cover a diverse spectrum of species, intertwining with taxonomy insufficiencies. Addressing these gaps involves several strategies: Data sharing: Unlocking existing datasets scattered across individual laboratories and organizations can partially alleviate the geographic and taxonomic gaps in data coverage. Annotation Efforts: Leveraging crowdsourcing platforms enables collaborative data collection and facilitates volunteer-driven annotation, playing a pivotal role in expanding annotated datasets. Collaborating with domain experts, including ecologists and machine learning practitioners, is crucial. Incorporating local ecological knowledge, particularly from indigenous communities in target regions, enriches data annotation efforts. Developing AI-powered methods for automatic cataloging, searching, and data conversion is imperative to streamline data processing and extract relevant information efficiently. Data Collection: Prioritize continuous data collection efforts, especially in underrepresented regions and for less documented species and taxonomic groups. Explore innovative approaches like Automated Multisensor Stations for Monitoring of Species Diversity (AMMODs) to enhance data collection capabilities. Strategically plan data collection initiatives to target species at risk of extinction, ensuring data capture before irreversible loss occurs.
U2: Usability > Aggregation	Many well-annotated datasets are currently scattered within the confines of individual researchers’ or organizations’ hard drives. Aggregating and sharing these datasets would largely address the lack of publicly open annotated data needed for model training. This initiative would also mitigate geographic and taxonomic imbalances in existing data. To achieve this goal: A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative. Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets. The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources. To achieve this goal: A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative. Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets. The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.

Environmental DNA (eDNA)

One gap in data is the incomplete barcoding reference databases.

One gap in data is the incomplete barcoding reference databases.

Data Gap Type	Data Gap Details
S2: Sufficiency > Coverage	Barcoding reference databases remain far from complete. However, much attention and effort are being devoted to filling this data gap, for example, the BIOSCAN project.

Satellite imagery

Satellite images provide environmental information for habitat monitoring. Combined with other data, e.g. bioacoustic data, they have been used to model and predict species distribution, richness, and interaction with the environment. High-resolution images are needed but most of them are not open to the public for free.

Satellite images provide environmental information for habitat monitoring. Combined with other data, e.g. bioacoustic data, they have been used to model and predict species distribution, richness, and interaction with the environment. High-resolution images are needed but most of them are not open to the public for free.

Data Gap Type	Data Gap Details
S3: Sufficiency > Granularity	Resolution of the publicly open satellite images are not sufficient for some environment reconstruction studies.
O2: Obtainability > Accessibility	The resolution of publicly open satellite images is not high enough. High-resolution images are usually commercial and can be expensive.

Enhancing energy policy and market analysis

Details (click to expand)

Energy transition policies require comprehensive data on generation, emissions, and financial performance across power systems, but fragmented government datasets make evidence-based policymaking challenging.

AI and data fusion techniques can integrate scattered regulatory data from utilities and energy companies to create analysis-ready datasets that inform carbon pricing, renewable incentives, and grid modernization policies.

Inconsistent data formats, missing identifiers, and poor documentation across government agencies create significant barriers for automated data processing and analysis.

Standardized reporting formats, improved documentation, and centralized data platforms could enable more effective AI-driven policy analysis and accelerate evidence-based energy transitions.

Dataset

Data Gap Summary

The Public Utility Data Liberation (PUDL)

Government energy datasets suffer from inconsistent formats, missing documentation, and aggregation challenges that prevent ready analysis. Key gaps include complex pre-processing requirements due to format changes, limited documentation maintenance, and missing weather and transmission data. Standardized reporting formats across agencies, improved documentation practices, and expanded data collection could significantly enhance the utility of integrated energy datasets for policy analysis.

Government energy datasets suffer from inconsistent formats, missing documentation, and aggregation challenges that prevent ready analysis. Key gaps include complex pre-processing requirements due to format changes, limited documentation maintenance, and missing weather and transmission data. Standardized reporting formats across agencies, improved documentation practices, and expanded data collection could significantly enhance the utility of integrated energy datasets for policy analysis.

Data Gap Type	Data Gap Details
U5: Usability > Pre-processing	Source data format changes (e.g., FERC’s shift from PDF to XBRL) and semi-structured formats require extensive preprocessing. PDF-based data extraction faces OCR challenges due to scan quality and inconsistent formatting. Standardized reporting formats and machine-readable data standards across agencies could reduce preprocessing burden.
U4: Usability > Documentation	Documentation updates lag behind source data changes, requiring continuous monitoring by maintainers. Proactive documentation standards and change notification systems from data providers could improve maintenance efficiency.
U3: Usability > Usage Rights	While PUDL uses Creative Commons licensing, some utility operator data has unclear public use rights despite being provided to regulatory agencies. Explicit public use licensing statements from government agencies could clarify usage permissions.
U2: Usability > Aggregation	Varying schema and naming conventions across agencies complicate data joining. Probabilistic entity matching helps but requires manual verification. Universal relational database standards and common identifiers across agencies could streamline aggregation.
U1: Usability > Structure	Source data structures vary significantly between reporting years and agencies, with inconsistent plant identification systems. Standardized data schemas and versioning practices could improve structural consistency.
S6: Sufficiency > Missing Components	Weather model data and transmission/congestion information from grid operators would enhance analysis capabilities. Integration partnerships with weather services and grid operators could expand dataset utility.
S3: Sufficiency > Granularity	Temporal resolution varies from hourly to annual across sources, requiring interpolation techniques. More frequent and standardized reporting intervals could improve data granularity.
S2: Sufficiency > Coverage	Dataset coverage limited to US regulatory agencies and organizations. International data partnerships could expand geographic scope for comparative analysis.

Enhancing estimations of methane emissions from rice paddies

Details (click to expand)

Rice paddies are a major source of global anthropogenic methane emissions. Accurate quantification of CH₄ emissions, especially how they vary with different agricultural practices, is crucial for addressing climate change.

ML can enhance methane emission estimation by automatically processing and analyzing remote-sensing data, leading to more efficient assessments.

Currently, there is a lack of direct observation of methane emissions from rice paddies that could be used to train ML models.

Real-world data collection is needed to unlock this use case.

Dataset

Data Gap Summary

Direct measurement of methane emission of rice paddies

There is a lack of direct observation of methane emissions from rice paddies.

There is a lack of direct observation of methane emissions from rice paddies.

Data Gap Type	Data Gap Details
W: Wish	Direct measurement of methane emissions is often expensive and labor-intensive. But this data is essential as it provides the ground truth for training and constraining ML models. Increased funding is needed to support and encourage comprehensive data collection efforts.

Enhancing marine wildlife detection and species classification

Details (click to expand)

Marine wildlife detection and species classification are crucial for understanding the impacts of climate change on marine ecosystems. These tasks involve identifying and categorizing different marine species.

ML can significantly enhance these efforts by automatically processing large volumes of data from diverse sources, improving accuracy and efficiency in monitoring and analyzing marine life.

Current bottlenecks due to data availability include the lack of sufficient labeled data and the lack of open data. Regarding existing data, enabling broader data sharing is the most critical challenge to address. A lot of ocean data is collected, there are massive gaps in coverage, with heavy biases towards coastal regions. Collecting data from the deep ocean is technologically challenging and financial incentives are lacking. High seas fall outside national jurisdictions, so data collection often occurs only through mining companies, military operations, or ad hoc research expeditions. The absence of marine protected areas on high seas and the migratory nature of species like phytoplankton further complicate data collection.

Open-source databases containing labeled data and label editors such as FathomNet can increase the amount of relevant data for training ML models. Initiatives like the Ocean Biodiversity Information System (OBIS) and, Integrated Ocean Observing System (IOOS) contribute to data availability more broadly. Data collection efforts may strategically target places where biodiversity is large, but currently available data is sparse. Financial tools or regulations could incentivize data collection.

Dataset

Data Gap Summary

FathomNet (marine wildlife annotated imagery)

The biggest challenge for the developer of FathomNet is enticing different institutions to contribute their data.

The biggest challenge for the developer of FathomNet is enticing different institutions to contribute their data.

Data Gap Type	Data Gap Details
O2: Obtainability > Accessibility	The biggest challenge for the developer of FathomNet is enticing different institutions to contribute their data.

Enhancing power grid-vegetation management for wildfire risk mitigation

Details (click to expand)

Vegetation encroachment near high-voltage transmission lines can lead to outages and pose major fire risks, compromising the safety and reliability of the power grid and potentially igniting dangerous wildfires that release stored carbon and endanger wildlife.

Machine learning, especially computer vision applied to remote sensing imagery and historic management records, can accelerate vegetation management by identifying overgrowth areas and tracking dynamic seasonal vegetation growth near grid infrastructure.

Key data gaps include limited access to proprietary utility data, sparse LiDAR captures leading to incomplete scans, insufficient temporal and spatial coverage, and preprocessing requirements for imagery from multiple sensor platforms.

Solutions include establishing partnerships with utilities for data sharing, coordinating multiple robot/UAV inspection trips for improved coverage, developing preprocessing pipelines for diverse sensor data, and implementing regular monitoring schedules to capture seasonal vegetation changes.

Dataset

Data Gap Summary

Aerial power line corridor inspection data

UAV imagery for vegetation management near power lines requires partnerships with private companies and utilities for access. LiDAR data is often sparse with partial line scans resulting in poor data quality. Coverage is typically limited to specific rights-of-way, requiring continuous monitoring to track vegetation growth over time.

UAV imagery for vegetation management near power lines requires partnerships with private companies and utilities for access. LiDAR data is often sparse with partial line scans resulting in poor data quality. Coverage is typically limited to specific rights-of-way, requiring continuous monitoring to track vegetation growth over time.

Data Gap Type	Data Gap Details
U3: Usability > Usage Rights	Once collected, data is private as RoWs represent critical energy infrastructure. Private partnerships may allow for extended usage rights within a predefined scope.
S4: Sufficiency > Timeliness	Measurements should be taken at multiple periods to examine transmission line characteristics to both vegetation growth and or line sag caused by overvoltage conditions.
S2: Sufficiency > Coverage	Coverage can vary depending on the RoW examined. Often multiple datasets that contain multiple transmission RoW UAV image data would be necessary to increase the number of image examples in the dataset.
O1: Obtainability > Findability	Must be involved in an active study with a partnering utility or transmission owner to get access to pre-existing drone data or to get permission to collect drone data.

Power line robot inspection imagery

Grid inspection robot imagery requires coordination with local utilities foraccess, multiple robot trips for complete coverage, image preprocessing to remove ambient artifacts, position and location calibration, and may be limited by camera resolution for detecting subtle degradation patterns.

Grid inspection robot imagery requires coordination with local utilities foraccess, multiple robot trips for complete coverage, image preprocessing to remove ambient artifacts, position and location calibration, and may be limited by camera resolution for detecting subtle degradation patterns.

Data Gap Type	Data Gap Details
U2: Usability > Aggregation	Data needs to be aggregated from multiple cable inspection robots for improved generalizability of detection models. Multiple robot trips over areas of interest can help identify target locations needing further inspection.
U3: Usability > Usage Rights	Data is proprietary and requires coordination with utility.
U5: Usability > Pre-processing	Data may need significant preprocessing and thresholding to perform image segmentation tasks.
S2: Sufficiency > Coverage	Data must be supplemented with position orientation system information for accurate robot localization, potentially requiring preliminary inspections followed by detailed autonomous inspection of targets.
S3: Sufficiency > Granularity	Spatial resolution depends on the type of cable inspection robot utilized. Data from multiple multispectral imagers, drones, cable-mounted sensors, and additional robots may be employed to improve the level of detail needed for specific obstructions.

Enhancing the scalability and robustness of building stock assessments

Details (click to expand)

Rapid decarbonization of the building sector requires understanding the current composition and differentials in energy performance across buildings to guide the deployment of solutions, including insulation retrofits, heat pump installations, or district heating system provision.

ML can support the deployment of large-scale building stock models that include increasingly granular information on buildings, e.g., their geometry, construction period, or materials, or energy performance certificates. Relevant existing applications of ML include inferring building characteristics across geographies as proxies for missing measured data, or processing satellite imagery to identify buildings in regions where up-to-date cadastral data is not openly accessible.

Reliable building-level information on energy performance – and strong predictors of it, such as precise floor space or walls’ insulation – remains very limited, even in high-income countries. This causes noise and uncertainty in assessments and makes training ML models that generalize well across geographies difficult.

These data gaps can be alleviated by increased data releases from governments, energy operators, and real estate companies; regulations requiring the disclosure of energy data and the usage of standards to harmonize measurement of approaches across countries can also play a big role.

Dataset

Data Gap Summary

Building energy performance certificates

Energy Performance Certificate (EPC) datasets face major gaps in aggregation, provenance, documentation, missing components, structure, and timeliness. Differences in formats and methodologies across countries, limited metadata, outdated records, and missing key attributes can affect their usability.

Energy Performance Certificate (EPC) datasets face major gaps in aggregation, provenance, documentation, missing components, structure, and timeliness. Differences in formats and methodologies across countries, limited metadata, outdated records, and missing key attributes can affect their usability.

Data Gap Type	Data Gap Details
U2: Usability > Aggregation	EPC datasets differ significantly across countries in terms of rating systems, data schemas, and calculation methods. This inconsistency requires custom harmonization and prevents seamless aggregation of EPC data across jurisdictions. Establishing an international standard for EPC rating systems, data schemas, and calculation protocols would enable easier harmonization and cross-border analysis.
R2: Reliability > Provenance	Information about how EPCs are generated—including inspection protocols, calculation tools, and assumptions—is often undocumented or inaccessible. This lack of transparency undermines trust and reproducibility.
S6: Sufficiency > Missing Components	EPCs typically lack important building-level attributes such as occupancy patterns, appliance usage, and retrofit history, all of which are critical for accurately modeling real energy consumption.
U4: Usability > Documentation	EPC datasets often come without adequate metadata explaining their structure, data sources, update cycles, or validation methods, limiting users’ ability to interpret or integrate the data.
S4: Sufficiency > Timeliness	EPCs are usually updated only during specific events like property sales. As a result, datasets may be several years old and not reflect recent improvements or changes in energy performance. Linking smart meter data to EPCs may help proxy changes.
U1: Usability > Structure	EPC datasets are published in varying formats and data models, with differences in variable naming, units, and encoding standards. This structural heterogeneity complicates cross-country or large-scale analysis. Developing and adopting a unified data model with standardized variable names, units, and encoding formats would facilitate large-scale and cross-jurisdictional use.

Building stock – from cadaster and aerial imagery

Datasets tend to face gaps in obtainability, reliability, usability, and sufficiency. These include challenges in finding and interpreting data due to inconsistent naming, poor documentation, variable quality, limited geographic and temporal coverage, and inconsistent data models requiring manual aggregation.

Datasets tend to face gaps in obtainability, reliability, usability, and sufficiency. These include challenges in finding and interpreting data due to inconsistent naming, poor documentation, variable quality, limited geographic and temporal coverage, and inconsistent data models requiring manual aggregation.

Data Gap Type	Data Gap Details
O1: Obtainability > Findability	Datasets are sometimes published by cities, regions, and states. This depends on the country. They may be named and described in the national language without reference to standard terms that would enable more effective searches. Projects like EUBUCCO have gathered a list of such datasets. Further adoption of multilingual metadata standards and controlled vocabularies (e.g. INSPIRE, ISO 19115) would enhance dataset discoverability and interoperability.
R1: Reliability > Quality	The quality of certain attributes and geometries differs across datasets. Of particular importance, building height is often not correctly processed from point clouds, leading to large errors in computing floor areas or shading. More transparent documentation of the processing pipelines would help as a first step.
U4: Usability > Documentation	Documentation about provenance, validation, and in general detailed metadata is often missing.
S2: Sufficiency > Coverage	Such datasets are typically only available in high-income countries that have the capacity to digitize and share cadaster data, as well as run aerial surveys. Most datasets are found in Europe.
U2: Usability > Aggregation	Despite existing naming standards, various datasets use various data models, requiring aggregation efforts involving custom scripts for each dataset. See the EUBUCCO project. Promoting adoption of harmonized schemas (e.g. CityGML or GeoJSON with common attribute keys) would streamline integration and reduce preprocessing effort.
S4: Sufficiency > Timeliness	Aerial surveys are typically not run more often than every 10 years. While some datasets are accessible via an API linked to an up-to-date cadastral database, some datasets are a one-off release that can be older than 10 years. Some datasets are updated every year. Further adoption of versioned APIs linked to authoritative cadastral databases could ensure that users always access the most current information, regardless of local survey cycles.
S3: Sufficiency > Granularity	There are different levels of detail: only the footprint, footprint + walls, footprint + walls + roof. Further details like doors and windows are very rare but would be highly valuable e.g. for thermal performance analyses. Having only footprints available limits applications, although building heights can be predicted with ML, with some errors.
S6: Sufficiency > Missing Components	Very few attributes such as usage type or age are typically available, but these are crucial for energy-related applications when aiming to distinguish different energy profiles across buildings. Enriching datasets through linkage with auxiliary registries (e.g. permits, tax data, utility records) can provide critical attributes for distinguishing building energy profiles.

Building stock – satellite-derived

Building datasets generated through large-scale ML extraction, such as Microsoft ML buildings, face reliability and sufficiency issues due to limited validation, positional inaccuracies, and inferred heights with low accuracy. Usability is also hindered by missing documentation on methodologies and input imagery, while datasets with coarse raster resolution or missing key attributes like usage type or age reduce the data’s applicability for detailed energy analyses.

Building datasets generated through large-scale ML extraction, such as Microsoft ML buildings, face reliability and sufficiency issues due to limited validation, positional inaccuracies, and inferred heights with low accuracy. Usability is also hindered by missing documentation on methodologies and input imagery, while datasets with coarse raster resolution or missing key attributes like usage type or age reduce the data’s applicability for detailed energy analyses.

Data Gap Type	Data Gap Details
R1: Reliability > Quality	Because the data has been produced by automated ML extraction at scale, there are issues of positional accuracy, missing buildings, and false positives. When available, heights are also inferred, and the accuracy may be low. Future model developments may better address this issue.
R2: Reliability > Provenance	For Microsoft ML buildings, limited validation results are provided for building footprints for small-scale case studies. No validation is provided for building heights. The lack of more comprehensive validation undermines the scientific value of the data and it is crucial to provide this information.
U4: Usability > Documentation	For Microsoft ML buildings, there is limited information about the methodology e.g. the ML approach used. Perhaps even more importantly, the vintage and satellite used for the input imagery are unknown, but these differences may create systematic biases in the data. This information could easily be provided.
S3: Sufficiency > Granularity	For products provided at rasters such as GHS-AGE or WSF-EVO, the resolution (typically 10 to 100 m) leads to noise in the estimations and in the matching with polygons. Future model developments and access to newer imagery may better address these issues.
S6: Sufficiency > Missing Components	Very few attributes, such as usage type or age, are typically available, but these are crucial for energy-related applications when aiming to distinguish different energy profiles across buildings.

Material intensity data

Material intensity coefficient datasets face key gaps in aggregation, provenance, documentation, granularity, and timeliness. These issues stem from inconsistent formats, missing metadata, outdated or high-level data, and limited transparency on how values are derived, all of which can hinder reliable, comparative use in material and emissions modeling.

Material intensity coefficient datasets face key gaps in aggregation, provenance, documentation, granularity, and timeliness. These issues stem from inconsistent formats, missing metadata, outdated or high-level data, and limited transparency on how values are derived, all of which can hinder reliable, comparative use in material and emissions modeling.

Data Gap Type	Data Gap Details
R2: Reliability > Provenance	Many material intensity values lack detailed documentation on how they were derived, including data sources, assumptions, and modeling techniques, making it difficult to assess the reliability or reproduce the values. Efforts should be put to ensure that future data collection is documented following best practices.
S6: Sufficiency > Missing Components	Key construction materials or components (e.g., finishes, insulation, MEP systems) are often omitted from available coefficient datasets, leading to underestimation of total material use in buildings or infrastructure.
U4: Usability > Documentation	Coefficient datasets may lack accompanying metadata that clarifies boundaries (e.g., cradle-to-gate vs. cradle-to-site), regional applicability, or assumptions about building typologies and lifespans.
S3: Sufficiency > Granularity	Many datasets only provide material intensities at a high level (e.g., per building or structural system), with limited resolution at the component or material subtype level, reducing their value for detailed modeling. There is also insufficient spatial granularity and measurements of how material intensity varies within countries, which should be the focus of future surveys.
S4: Sufficiency > Timeliness	Coefficients are often based on outdated construction practices or legacy data, and are not regularly updated to reflect changes in material composition, manufacturing efficiency, or new building technologies.

TABULA building typology

TABULA suffers from limited granularity, insufficient volume, and outdated information. They often provide only one representative value per archetype, lack typological diversity across countries, and include parameters with questionable accuracy.

TABULA suffers from limited granularity, insufficient volume, and outdated information. They often provide only one representative value per archetype, lack typological diversity across countries, and include parameters with questionable accuracy.

Data Gap Type	Data Gap Details
S3: Sufficiency > Granularity	Only a few typologies are available for each country. The number of typologies varies across countries. Some countries have several sub-regions and types, while some have little variability.
S1: Sufficiency > Insufficient Volume	Only one representative value is provided per archetype, while the real value distribution may have a substantial spread. New data collection should aim at collecting more granular data.
R1: Reliability > Quality	Some parameters have unrealistic values and may not accurately represent the typical values for the given building type. For certain parameters, more granular information e.g. coming from energy performance certificates, may be used to validate, and new data collection should aim at collecting more granular data.
S4: Sufficiency > Timeliness	The data was produced more than 10 years ago. New data collection following a similar general approach is needed.
U4: Usability > Documentation	The data was produced with different methodologies across countries, which are not clearly documented.

Enhancing wind power grid integration and stability

Details (click to expand)

The integration of low-inertia distributed energy resources like wind power into the grid creates critical stability and reliability challenges, particularly for maintaining system frequency at nominal levels to prevent damage and blackouts.

AI and machine learning can enhance wind power’s contribution to grid stability by optimizing synthetic inertial and primary frequency response capabilities through advanced modeling and control strategies.

Key data gaps include limited accessibility to simulation tools, insufficient temporal granularity in models that operate on hourly rather than sub-hourly scales, and reliability concerns due to the lack of real-world validation data for model outputs.

Grid operators and research institutions can collaborate to improve model accessibility, increase temporal resolution to capture sub-hourly dynamics, and validate simulations with operational data, enabling more effective AI-driven solutions for grid stability as renewable penetration increases.

Dataset

Data Gap Summary

NREL Wind Active Power Control Simulation Tools

Access to NREL’s FESTIV model requires special permission, limiting broader research applications. The model’s hourly temporal resolution cannot capture sub-hourly dynamics critical for frequency response and system stability. Additionally, the simulation-based approach requires validation with real-world operational data to ensure accuracy for practical grid applications.

Access to NREL’s FESTIV model requires special permission, limiting broader research applications. The model’s hourly temporal resolution cannot capture sub-hourly dynamics critical for frequency response and system stability. Additionally, the simulation-based approach requires validation with real-world operational data to ensure accuracy for practical grid applications.

Data Gap Type	Data Gap Details
O2: Obtainability > Accessibility	Access to the FESTIV model requires permission by contacting the group manager
R1: Reliability > Quality	The model may not account for all real-time system dynamics and complexities, requiring verification from operational data. Scenario-based forecasting may not capture real-world uncertainties, and operating reserve values may be inaccurate without practical validation.
S3: Sufficiency > Granularity	FESTIV operates on hourly unit commitment time resolution, which cannot capture reliability impacts occurring on sub-hourly scales including frequency response, voltage magnitudes, and reactive power flows that affect system stability.

Facilitating grid reliability events analysis

Details (click to expand)

Due to rapid fluctuations in power generation, renewables introduce variability into the grid. These signals are capable of triggering safety monitoring systems related to grid stability. Power grid control centers receive multiple streams of data from these systems (e.g. alarms, sensors, and field reports) that are semi-structured and arriving at a high volume. For operators, these alarm triggers and associated data can be overwhelming to rationalize, reduce, and contextualize to diagnose grid conditions.

ML can assist in interpreting these data to better understand the sequence of events leading up to an incident as well as to identify and detect the causes behind system disturbances affecting grid reliability.

Access to grid reliability data remains limited, the amount of preprocessing needed constitutes a hurdle, and not all alarm triggers have been validated, also possibly resulting in noise.

More open data releases and open community work regarding data preprocessing can help further advance this use case.

Dataset

Data Gap Summary

EPRI10 (transmission control center alarm and operational data set)

Access to EPRI10 grid alarm data is currently limited within EPRI. Data gaps with respect to usability are the result of redundancies in grid alarm codes, requiring significant preprocessing and analysis of code ids, alarm priority, location, and timestamps. Alarm codes can vary based on sensor, asset and line. Resulting actions as a consequence of alarm trigger events need field verifications to assess the presence of fault or non-fault events.

Access to EPRI10 grid alarm data is currently limited within EPRI. Data gaps with respect to usability are the result of redundancies in grid alarm codes, requiring significant preprocessing and analysis of code ids, alarm priority, location, and timestamps. Alarm codes can vary based on sensor, asset and line. Resulting actions as a consequence of alarm trigger events need field verifications to assess the presence of fault or non-fault events.

Data Gap Type	Data Gap Details
U6: Usability > Large Volume	Operational alarm data volume is large, given the cadence of measurements made in the system at every millisecond. This can result in high data volume that is tabular in nature, but also unstructured with respect to text details associated with alarm trigger events, sensor measurements, and controller actions. Since the data also contains locations and grid asset information, spatio-temporal analysis can be made with respect to a single sensor and the conditions over which that sensor is operating. Therefore, indexing and mining time series data can be an approach for facilitating faster search over alarm data leading up to a fault event. Additionally, natural language processing and text mining techniques can also be utilized to facilitate search over alarm text and details.
U5: Usability > Pre-processing	In addition to challenges with respect to the decoding of remote signal identification data, the description fields associated with alarm trigger events are unstructured and vary in the amount of text detail provided. Typically, the details cover information with respect to the grid asset and its action. For example, a text description from a line monitoring device may describe the power, temperature, and the action taken in response to the grid alarm trigger event. Often, in real-world systems, the majority of grid alarm trigger events are short circuit faults and non-fault events, limiting the diversity of fault types found in the data. To combat these issues, data pre-processing becomes necessary. For remote signal identification data, this includes parsing and hashing through text codes, assessing code components for redundancies, and building an associated reduced dictionary of alarm codes. For textual description fields and post-fault field reports, the use of natural language processing techniques to extract key information can provide more consistency between sensor data. Additionally, techniques like diverse sampling can account for the class imbalance with respect to the associated fault that can trigger the alarm.
U4: Usability > Documentation	Remote signaling identification information from monitoring sensors and devices encodes data with respect to the alarm trigger event in the context of fault priority. Based on the asset, line, or sensor, this identification code can vary depending on the naming conventions used. Documentation on remote signal ids associated with a dictionary of finite alarm code types can facilitate pre-processing of alarm data and assessment on the diversity of fault events occurring in real-time systems (as different alarm trigger codes may correspond to redundant events similar in nature).
U3: Usability > Usage Rights	Usage rights are currently constrained to those working within EPRI at this time.
U2: Usability > Aggregation	Reports on location, asset, and time can result in false alarm triggers requiring operators to send field workers to investigate, fix, and recalibrate field sensors. The data with respect to field assessments can be incorporated into the original data to provide greater context resulting in compilation of multimodal datasets which can enhance alarm data understanding.
U1: Usability > Structure	Grid alarm codes may be non-unique for different lines and grid assets. In other words, two different codes could represent equivalent information due to differences in naming conventions requiring significant alarm data pre-processing and analysis in identifying unique labels from over 2000 code words. Additional labels expressing alarm priority, for example high alarm type indicative of events such as fire, gas, or lightning, are also encoded into the grid alarm trigger event code. Creation of a standard structure for operational text data such as those already utilized in operational systems by companies such as General Electric or Siemens can avoid inconsistencies in data.
R1: Reliability > Quality	Alarm trigger events and the corresponding action taken by the events, require post assessment by field workers especially in cases of faults or perceived faults for verification.
O2: Obtainability > Accessibility	Data access is limited within EPRI due to restrictions with respect to data provided by utilities. Anonymization and aggregation of data to a benchmark or toy dataset by EPRI to the wider community can be a means of circumventing the security issues at the cost of operational context.

Facilitating disaster risk assessments

Details (click to expand)

As climate change progresses, extreme weather events and related hazards are expected to become more frequent and severe. To effectively address these challenges, robust disaster risk assessment and management are crucial. This involves better mapping of which population and assets are subject to given risks.

ML can be used to facilitate disaster risk assessments by helping analyze satellite imagery and geographic data in order to pinpoint vulnerable areas and produce more detailed risk maps. By this, ML can overcome some limitations of traditional ground surveys that are time- and cost-intensive.

There is a general lack of data from the Global South where, for many regions, collection capabilities are lower while climate impacts are forecasted to be disproportionally high. Existing data are typically incomplete, even in most high-income countries, limiting the depth of potential analyses and generating uncertainties in assessments, for example, about monetary losses due to disasters.

Closing these data gaps involves inter alia deploying ML techniques that perform well in the Global South, collecting high-quality data involving local knowledge in a variety of contexts, and making the best remote sensing and cadaster data available to these efforts.

Dataset

Data Gap Summary

Building stock – from cadaster and aerial imagery

These datasets are mainly available in rich countries from Europe, North America, and Asia, leaving large parts of the world with timely challenges involving their building stock without appropriate data for detailed assessments. The lack of attributes beyond the location and shape of the buildings also limits the breadth of applications. ML may help by inferring missing attributes, given the availability of sufficient training data. Research efforts in particular in Europe, including EUBUCCO (eubucco.com) or the Digital Building Stock Model by the Joint Research Centre of the European Commission (https://data.jrc.ec.europa.eu/collection/id-00382), are addressing several of the existing data gaps.

These datasets are mainly available in rich countries from Europe, North America, and Asia, leaving large parts of the world with timely challenges involving their building stock without appropriate data for detailed assessments. The lack of attributes beyond the location and shape of the buildings also limits the breadth of applications. ML may help by inferring missing attributes, given the availability of sufficient training data. Research efforts in particular in Europe, including EUBUCCO (eubucco.com) or the Digital Building Stock Model by the Joint Research Centre of the European Commission (https://data.jrc.ec.europa.eu/collection/id-00382), are addressing several of the existing data gaps.

Data Gap Type	Data Gap Details
O1: Obtainability > Findability	Certain datasets require searching and navigating websites in a foreign language.
O2: Obtainability > Accessibility	Some datasets are not publicly available and require either payment or governmental authorization. This situation is changing in Europe via the high-value dataset regulation in the European Union, which mandates member to release their building stock data with permissive licenses (https://data.europa.eu/en/news-events/news/unlocking-potential-high-value-datasets-impact-hvd-implementing-regulation).
U1: Usability > Structure	Datasets are released under a multitude of formats. Despite the existence of standards such as CityGML, one typically needs a particular pipeline for processing every new dataset.
U2: Usability > Aggregation	Datasets are typically released by local authorities and require aggregations. Some efforts in particular in Europe, including EUBUCCO (eubucco.com) or the Digital Building Stock Model by the Joint Research Centre of the European Commission (https://data.jrc.ec.europa.eu/collection/id-00382), have made this process easier, but without enabling yet seamless updates.
U3: Usability > Usage Rights	Most datasets use attribution-based licenses, but some datasets use custom licenses, unclear licenses, or restrictive licenses.
U4: Usability > Documentation	Most datasets do not provide appropriate documentation to fully understand how the dataset was created.
U5: Usability > Pre-processing	Certain fields may contain local codes that need to be translated and understood. Numerical values may contain encodings for NAs, such as -1 or 1000, that need to be cleaned.
U6: Usability > Large Volume	Precise 3D datasets can be voluminous for a city. Country-level datasets also tend to require significant computing resources.
R1: Reliability > Quality	The height estimation from LiDAR data may contain large errors, e.g., due to surrounding objects such as trees.
S2: Sufficiency > Coverage	There are very few datasets outside of rich countries from Europe, North America, and Asia. Precise 3D models and attribute-rich datasets are available for even fewer countries.
S4: Sufficiency > Timeliness	Practices vary widely between multiple updates per year to a one-off release that may be more than 10 years old. Aerial surveys with LiDAR are expensive and are rarely done more than once every ten years.

Building stock – satellite-derived

These datasets are typically released at a scale that makes their validation complex and partial, implying potentially large uncertainties in the data. The lack of attributes beyond the location and shape of the buildings also limits the breadth of applications. ML may help by inferring missing attributes, given the availability of sufficient training data.

These datasets are typically released at a scale that makes their validation complex and partial, implying potentially large uncertainties in the data. The lack of attributes beyond the location and shape of the buildings also limits the breadth of applications. ML may help by inferring missing attributes, given the availability of sufficient training data.

Data Gap Type	Data Gap Details
U1: Usability > Structure	Some datasets have not been published as scientific datasets and lack appropriate documentation about the methodology. Users should be aware of uncertainties in case of insufficient documentation of potential errors.
R1: Reliability > Quality	The building footprint data can contain errors due to detection inaccuracies in the models used to generate the dataset, as well as limitations of satellite imagery. These limitations include outdated images that may not reflect recent developments and visibility issues such as cloud cover or obstructions that can prevent the accurate identification of buildings.
U6: Usability > Large Volume	When working at a large geographical scale, e.g. continental scale, the data volume requires significant computational resources for the processing.
S3: Sufficiency > Granularity	Raster datasets provide a noisy view of the building stock.
S4: Sufficiency > Timeliness	The data depends on the availability of satellite surveys. Some datasets may mix images from different years. The surveys may be more than 5 years old, mischaracterizing fast-growing areas. In case of disasters, the imagery pre-disaster may not be representative of the current building stock.
S6: Sufficiency > Missing Components	More attributes inferred with high confidence would unlock new use cases.

Digital elevation model

Very high-resolution reference data is currently not freely open to the public.

Very high-resolution reference data is currently not freely open to the public.

Data Gap Type

Data Gap Details

O1: Obtainability > Findability

Surface elevation data defined by a digital elevation model (DEM) is one of the most essential types of reference data. The high-resolution elevation data has huge value for disaster risk assessment, particularly for the Global South.

Open DEM data with global coverage now goes to a resolution of 30-m, but the resolution is still insufficient for many disaster risk assessments. Higher-resolution datasets exist, but they are either with limited spatial coverage or are commercial products and are very expensive to get.

Financial loss datasets related to the impacts of disasters

Financial loss data is typically proprietary and not publicly accessible.

Financial loss data is typically proprietary and not publicly accessible.

Data Gap Type	Data Gap Details
O1: Obtainability > Findability	Most consistent loss data is produced by the insurance industry and remains proprietary.
O2: Obtainability > Accessibility	Collecting robust, homogeneous loss data even for a single event presents significant challenges.
U4: Usability > Documentation	Loss data frequently lacks metadata, making it difficult to determine data completeness.

Natural hazards forecasts

The resolution of current natural hazard forecast data is not sufficient for effective physical risk assessment.

The resolution of current natural hazard forecast data is not sufficient for effective physical risk assessment.

Data Gap Type

Data Gap Details

S3: Sufficiency > Granularity

Climate hazard data (e.g., floods, tropical cyclones, droughts) is often too coarse for effective physical risk assessments, which focus on evaluating damage to infrastructure such as buildings and power grids. While exposure data, including information on buildings and power grids, is available at resolutions ranging from 25 meters to 250 meters, climate hazard projections, especially those extending beyond a year, are typically at resolutions of 25 kilometers or more.

To provide meaningful risk assessments, more granular data is required. This necessitates downscaling efforts, both dynamical and statistical, to refine the resolution of climate hazard data. Machine learning (ML) can play a valuable role in these downscaling processes. Additionally, the downscaled data should be made publicly available, and a dedicated portal should be established to facilitate access and sharing of this refined information.

R1: Reliability > Quality

Projecting future climate hazards is crucial for assessing long-term risks. Climate simulations from CMIP models are currently our primary source for future climate projections. However, these simulations come with significant uncertainties due to both uncertainties in model and emission scenarios. To improve their utility for disaster risk assessment and other applications, increased funding and efforts are needed to advance climate model development for greater accuracy. Additionally, machine learning methods can help mitigate some of these uncertainties by bias-correcting the simulations.

S6: Sufficiency > Missing Components

Seasonal climate hazard forecasts are crucial for disaster risk assessment, management, and preparation. However, high-resolution data at this scale is often lacking for many hazards. This challenge is likely due to the difficulty in generating accurate seasonal weather forecasts. ML has the potential to address this gap by improving forecast accuracy and granularity.

OpenStreetMap (land use map)

The quality of OpenStreetMap is very variable in terms of coverage of geometries e.g. buildings and attributes. Roads are better mapped than buildings in general. The very permissive data model from OpenStreetMap enables users to provide a variety of information, but it is often not well harmonized. Recent corporate editing efforts have increased dramatically the coverage in previously poorly mapped regions.

The quality of OpenStreetMap is very variable in terms of coverage of geometries e.g. buildings and attributes. Roads are better mapped than buildings in general. The very permissive data model from OpenStreetMap enables users to provide a variety of information, but it is often not well harmonized. Recent corporate editing efforts have increased dramatically the coverage in previously poorly mapped regions.

Data Gap Type	Data Gap Details
U6: Usability > Large Volume	When working at a large geographical scale, e.g. continental scale, the data volume requires significant computational resources for the processing.
U4: Usability > Documentation	The origin of attributes is often unknown, creating uncertainty about values.
U5: Usability > Pre-processing	The flexible data model lacks type enforcement, requiring additional processing for analysis.
R1: Reliability > Quality	Data quality ranges from excellent (sometimes surpassing official sources) to very low (including mapping vandalism).
S2: Sufficiency > Coverage	Street coverage is generally good, while building coverage varies widely.
S4: Sufficiency > Timeliness	Update frequency varies from multiple times per year to decades-old data, with disaster areas often updated quickly by active communities.
S6: Sufficiency > Missing Components	Most attributes remain incomplete, with completeness levels below 10%.

Population and asset exposure to natural hazards

Accessibility and reliability are the most significant challenges with exposure data.

Accessibility and reliability are the most significant challenges with exposure data.

Data Gap Type	Data Gap Details
O2: Obtainability > Accessibility	Country-specific exposure data varies widely in availability, with some existing only as hard copies in government offices.
U3: Usability > Usage Rights	OpenQuake GEM project provides comprehensive data on the residential, commercial, and industrial building stock worldwide, but it is restricted to non-commercial use only.
R1: Reliability > Quality	Population datasets show significant discrepancies, requiring validation before confident use. Some geospatial socioeconomic data from sources like UNEP are outdated or incomplete.
S3: Sufficiency > Granularity	Open global data often lacks sufficient resolution and completeness for hazard risk assessment, such as World Bank or US CIA GDP data.

Facilitating fault detection in low voltage distribution grids

Details (click to expand)

The low-voltage distribution portion of the grid directly supplies power to consumers. As consumers integrate more distributed energy resources and dynamic loads (such as electric vehicles), low-voltage distribution systems are susceptible to power quality issues that can affect the stability and reliability of the grid. Fault-inducing harmonics can be challenging to monitor, diagnose, and control due to the number of nodes/buses that connect various grid assets and the short distances between them.

Machine learning methods can recognize patterns to automate fault diagnoses agnostic to specific line thresholds and topologies. If integrated into advanced monitoring systems, detecting and localizing faults can accelerate adaptive protection and network reconfiguration efforts to ensure reliability and stability.

Data gaps for this use case include lack of coverage (spatial and temporal), noise in the data and high data volume.

New data collection and further analyses of existing data to better understand its pitfalls have the potential to help mitigate the existing gaps for this use case.

Dataset

Data Gap Summary

Micro-synchrophasors (µPMU data)

For effective fault localization using µPMU data, it is crucial that distribution circuit models supplied by utility partners or Distribution System Operators (DSOs) are accurate, though they often lack detailed phase identification and impedance values, leading to imprecise fault localization and time series contextualization. Noise and errors from transformers can degrade µPMU measurement accuracy. Integrating additional sensor data affects data quality and management due to high sampling rates and substantial data volumes, necessitating automated analysis strategies. Cross-verifying µPMU time series data, especially around fault occurrences, with other data sources is essential due to the continuous nature of µPMU data capture. Additionally, µPMU installations can be costly, limiting deployments to pilot projects over a small network. Furthermore, latencies in the monitoring system due to communication and computational delays challenge real-time data processing and analysis.

For effective fault localization using µPMU data, it is crucial that distribution circuit models supplied by utility partners or Distribution System Operators (DSOs) are accurate, though they often lack detailed phase identification and impedance values, leading to imprecise fault localization and time series contextualization. Noise and errors from transformers can degrade µPMU measurement accuracy. Integrating additional sensor data affects data quality and management due to high sampling rates and substantial data volumes, necessitating automated analysis strategies. Cross-verifying µPMU time series data, especially around fault occurrences, with other data sources is essential due to the continuous nature of µPMU data capture. Additionally, µPMU installations can be costly, limiting deployments to pilot projects over a small network. Furthermore, latencies in the monitoring system due to communication and computational delays challenge real-time data processing and analysis.

Data Gap Type	Data Gap Details
O2: Obtainability > Accessibility	Typically the distribution circuit lacks notation with respect to the phase identification and impedance values, often providing rough approximations which can ultimately influence the accuracy of localization as well as time series contextualization of a fault. Decreased accuracy of localization can then affect downstream control mechanisms to ensure operational reliability. For µPMU data to be utilized for fault localization, the distribution circuit model must be provided by the partnering utility or DSO.
U5: Usability > Pre-processing	µPMU data is sensitive to noise especially from geomagnetic storms which can induce electric currents in the atmosphere and impact measurement accuracy. Data can also be compromised by errors introduced by current and potential transformers. One way to mitigate this error is to monitor and re-calibrate transformers or deploy redundant µPMUs to verify measurements. Depending on whether additional data from other sensors or field reports is being used to classify µPMU time series data, creation of a joint sensor dataset may improve quality based on the overall sampling rate and format of the additional non-µPMU data.
U6: Usability > Large Volume	Due to the high sampling rates, data volume from each individual µPMU can be challenging to manage and analyze due to its continuous nature. Coupled with the number of µPMUs required to monitor a portion of the distribution network, the amount of data can easily exceed terabytes. Automation of indexing and mining time series by transient characteristics can facilitate domain specialist verification efforts.
R1: Reliability > Quality	Since µPMU data is continuously captured, time series data leading up to or even identifying a fault or potential fault requires verification from other data sources. Digital Fault Recorders (DFRs) capture high resolution event driven data such as disturbances due to faults, switching and transients. They are able to detect rapid events like lightning strikes and breaker trips while also recording the current and voltage magnitude with respect to time. Additionally, system dynamics over a longer period following a disturbance can also be captured. When used in conjunction with µPMU data, DFR data can assist in verifying significant transients found in the µPMU data which can facilitate improved analysis of both signals leading up to and after an event from the perspective of distribution-side state.
S2: Sufficiency > Coverage	Currently µPMU installation to existing distribution grids have significant financial costs so most deployments have been in the form of pilot projects with utilities. Pilot studies include the Flexgrid testing facility at Lawrence Berkeley National Laboratory (LBNL), Philadelphia Navy Yard microgrid (2016-2017), the micro-synchrophasors for distribution systems plus-up project (2016-2018), resilient electricity grids in the Philippines (2016), the GMLC 1.2.5- sensing and measurement strategy (2016), the bi-national laboratory for the intelligent management of energy sustainability and technology education in Mexico City (2017-2018) based on North American Synchrophasor Initiative (NASPI) reports. Coverage is also limited by acceptance to this technology due to a pre-existing reliance on SCADA systems which measure grid conditions on a 15 minute cadence. As transients become more common, especially on the low voltage distribution grid, transition to monitoring with higher resolution will become necessary. Multi-objective evaluation with respect to the value proposition of further µPMU sensor monitoring networks can provide utilities and DSOs with a framework for assessing the economic, environmental, and operational benefit to pursue larger scale studies.
S4: Sufficiency > Timeliness	µPMU data can suffer from multiple latencies within the monitoring system of the grid that are unable to keep up with the high sampling rate of the continuous measurements that µPMUs generate. Latencies occur in the context of the system communications surrounding signals as they are being recorded, processed, sent, and received. This can be due to the communication medium used, cable distance, amount of processing, and computational delay. More specifically, the named latencies are measurement, transmission, channel, receiver, and algorithm related. Identification of characteristics preceding fault events with lead times to overcome potential latencies through machine learning or other techniques can be of benefit.

Facilitating forest restoration monitoring

Details (click to expand)

Efforts are being made to restore ecosystems like forests and mangroves.

ML can be used to monitor biodiversity changes before and after restoration efforts, in order to quantify their effectiveness and outcomes.

A significant data gap is the lack of standardized protocols to guide data collection for restoration projects, making it difficult to consistently assess biodiversity outcomes using ML across different restoration initiatives.

Developing standardized data collection protocols, fostering a culture of data sharing, and implementing incentives for data collectors would enable more effective ML applications, leading to better assessment of restoration successes and failures on a global scale.

Dataset

Data Gap Summary

Camera trap wildlife image collections

There is a lack of standardized protocol to guide data collection for restoration projects, including which variables to measure and how to measure them. Developing such protocols would standardize data collection practices, enabling consistent assessment and comparison of restoration successes and failures on a global scale.

There is a lack of standardized protocol to guide data collection for restoration projects, including which variables to measure and how to measure them. Developing such protocols would standardize data collection practices, enabling consistent assessment and comparison of restoration successes and failures on a global scale.

Data Gap Type

Data Gap Details

U1: Usability > Structure

For restoration projects, there is an urgent need for standardized protocols to guide data collection and curation for measuring biodiversity and other complex ecological outcomes of restoration projects. For example, there is an urgent need of clearly written guidance on what variables to collect and how to collect them.

Turning raw data into usable insights is also a significant challenge. There is a lack of standardized tools and workflows to turn the raw data into analysis-ready data and analyze the data in a consistent way across projects.

U4: Usability > Documentation

For restoration projects, there is an urgent need for standardized protocols to guide data collection and curation for measuring biodiversity and other complex ecological outcomes of restoration projects. For example, there is an urgent need for clearly written guidance on what variables to collect and how to collect them.

Turning raw data into usable insights is also a significant challenge. There is a lack of standardized tools and workflows to turn the raw data into analysis-ready data and analyze the data in a consistent way across projects.

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Data Gap Type

Data Gap Details

U1: Usability > Structure

For restoration projects, there is an urgent need for standardized protocols to guide data collection and curation for measuring biodiversity and other complex ecological outcomes of restoration projects. For example, there is an urgent need of clearly written guidance on what variables to collect and how to collect them.

Turning raw data into usable insights is also a significant challenge. There is a lack of standardized tools and workflows to turn the raw data into analysis-ready data and analyze the data in a consistent way across projects.

U4: Usability > Documentation

For restoration projects, there is an urgent need for standardized protocols to guide data collection and curation for measuring biodiversity and other complex ecological outcomes of restoration projects. For example, there is an urgent need for clearly written guidance on what variables to collect and how to collect them.

Turning raw data into usable insights is also a significant challenge. There is a lack of standardized tools and workflows to turn the raw data into analysis-ready data and analyze the data in a consistent way across projects.

Passive acoustic monitoring for biodiversity assessment

There is a lack of standardized protocol to guide data collection for restoration projects, including which variables to measure and how to measure them. Developing such protocols would standardize data collection practices, enabling consistent assessment and comparison of restoration successes and failures on a global scale.

There is a lack of standardized protocol to guide data collection for restoration projects, including which variables to measure and how to measure them. Developing such protocols would standardize data collection practices, enabling consistent assessment and comparison of restoration successes and failures on a global scale.

Data Gap Type

Data Gap Details

U1: Usability > Structure

For restoration projects, there is an urgent need for standardized protocols to guide data collection and curation for measuring biodiversity and other complex ecological outcomes of restoration projects. For example, there is an urgent need of clearly written guidance on what variables to collect and how to collect them.

Turning raw data into usable insights is also a significant challenge. There is a lack of standardized tools and workflows to turn the raw data into analysis-ready data and analyze the data in a consistent way across projects.

U4: Usability > Documentation

For restoration projects, there is an urgent need of standardized protocol to guide data collection and curation for measuring biodiversity and other complex ecological outcomes of restoration projects. For example, there is an urgent need of a clearly written guidance on what variables to collect and how to collect them.

Turning raw data into usable insights is also a significant challenge. There is a lack of standardized tools and workflows to turn the raw data into some analysis-ready data and analyze the data in a consistent way across projects.

Facilitating the detection of climate-induced ecosystem changes

Details (click to expand)

Climate change is causing significant alterations in ecosystems worldwide, threatening biodiversity and ecosystem services that are critical for both nature and human well-being.

Machine learning can analyze complex ecological data from multiple sources to detect climate change impacts, identify vulnerable regions, and inform targeted conservation efforts.

Key data gaps include insufficient high-resolution climate and biodiversity data, restricted access to ground survey data, and limited institutional capacity to process collected data efficiently.

Addressing these gaps requires establishing decentralized monitoring networks, improving data accessibility through legislative reforms, and developing sustainable funding models for long-term ecosystem monitoring initiatives.

Dataset

Data Gap Summary

Camera trap wildlife image collections

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Data Gap Type

Data Gap Details

U2: Usability > Aggregation

A lot of well-annotated datasets are currently scattered within the confines of individual researchers’ or organizations’ hard drives. Aggregating and sharing these datasets would largely address the lack of publicly open annotated data needed for model training. This initiative would also mitigate geographic and taxonomic imbalances in existing data.

To achieve this goal:

A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.

S2: Sufficiency > Coverage

There exists a significant geographic imbalance in data collected, with insufficient representation from highly biodiverse regions. The data also does not cover a diverse spectrum of species, intertwining with taxonomy insufficiencies.

Addressing these gaps involves several strategies:

Data sharing: Unlocking existing datasets scattered across individual laboratories and organizations can partially alleviate the geographic and taxonomic gaps in data coverage.
Annotation Efforts:

- Leveraging crowdsourcing platforms enables collaborative data collection and facilitates volunteer-driven annotation, playing a pivotal role in expanding annotated datasets.

- Collaborating with domain experts, including ecologists and machine learning practitioners, is crucial. Incorporating local ecological knowledge, particularly from indigenous communities in target regions, enriches data annotation efforts.

- Developing AI-powered methods for automatic cataloging, searching, and data conversion is imperative to streamline data processing and extract relevant information efficiently.

Data Collection:

- Prioritize continuous data collection efforts, especially in underrepresented regions and for less documented species and taxonomic groups.

- Explore innovative approaches like Automated Multisensor Stations for Monitoring of Species Diversity (AMMODs) to enhance data collection capabilities.

- Strategically plan data collection initiatives to target species at risk of extinction, ensuring data capture before irreversible loss occurs.

S1: Sufficiency > Insufficient Volume

One of the foremost challenges is the scarcity of publicly open and adequately annotated data for model training. It is worth mentioning that the lack of annotated datasets is a common and major challenge applied to almost every modality of biodiversity data and is particularly acute for bioacoustic data, where the availability of sufficiently large and diverse datasets, encompassing a wide array of species, remains limited for model training purposes.

This scarcity of publicly open and well annotated datasets arises from the labor-intensive nature of data annotation and the dearth of expertise necessary to accurately identify species. This is the case even for relatively well-studied taxa such as birds, and is even more notable for taxa such as Orthoptera (grasshoppers and relatives). The incompleteness of our current taxonomy also compounds this issue.

Though some data cannot be shared with the public because of concerns about poaching or other legal restrictions imposed by local governments, there are still considerable sources of well-annotated data that currently reside within the confines of individual researchers’ or organizations’ hard drives. Unlocking these data will partially address the data gap.

To achieve this goal:

A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.

Ground survey of land use and land management

Access to comprehensive ground survey data is restricted due to institutional barriers and privacy concerns, limiting its availability for ecosystem change analysis.

Access to comprehensive ground survey data is restricted due to institutional barriers and privacy concerns, limiting its availability for ecosystem change analysis.

Data Gap Type	Data Gap Details
O2: Obtainability > Accessibility	Access to the data is restricted, with limited availability to the public. Users often find themselves unable to access the comprehensive information they require and must settle for suboptimal or outdated data. Addressing this challenge necessitates a legislative process to facilitate broader access to data.

Historical climate observations

For highly biodiverse regions, like tropical rainforests, there is a lack of high-resolution data that captures the spatial heterogeneity of climate, hydrology, soil, and the relationship with biodiversity.

For highly biodiverse regions, like tropical rainforests, there is a lack of high-resolution data that captures the spatial heterogeneity of climate, hydrology, soil, and the relationship with biodiversity.

Data Gap Type

Data Gap Details

S3: Sufficiency > Granularity

There is a lack of high-resolution data that captures the spatial heterogeneity of climate, hydrology, soil, and so on, which are important for biodiversity patterns. This is because of a lack of observation systems that are dense enough to capture the subtleties in those variables caused by terrain. It would be helpful to establish decentralized monitoring networks to cost-effectively collect and maintain high-quality data over time, which cannot be done by one single country.

Passive acoustic monitoring for biodiversity assessment

The major data gap is the limited institutional capacity to process and analyze data promptly to inform decision-making.

The major data gap is the limited institutional capacity to process and analyze data promptly to inform decision-making.

Data Gap Type

Data Gap Details

M: Misc/Other

There is a significant institutional challenge in processing and analyzing data promptly to inform decision-making. To enhance institutional capacity for leveraging global data sources and analytical methods effectively, a strategic, ecosystem-building approach is essential, rather than solely focusing on individual researcher skill development. This approach should prioritize long-term sustainability over short-term project-based funding.

Hybrid ML-physics climate models for enhanced simulations

Details (click to expand)

Physics-based climate models incorporate numerous complex components that are computationally intensive, which limits the spatial resolution achievable in climate simulations.

ML models can emulate these physical processes, providing a more efficient alternative to traditional methods, enabling faster simulations and enhanced model performance.

The most significant data gaps are the enormous volume of climate data, which creates challenges for storage, transfer, and processing, and insufficient granularity in existing datasets to resolve fine-scale physical processes like turbulence.

Developing improved computational infrastructure for handling large datasets and creating ultra-high-resolution benchmark simulations would significantly enhance hybrid climate modeling capabilities.

Dataset

Data Gap Summary

ClimSim (benchmark data for hybrid ML-physics research)

ClimSim faces challenges with its large data volume, making downloading and processing difficult for most ML practitioners, and its resolution is insufficient to resolve some fine-scale physical processes critical for accurate climate modeling.

ClimSim faces challenges with its large data volume, making downloading and processing difficult for most ML practitioners, and its resolution is insufficient to resolve some fine-scale physical processes critical for accurate climate modeling.

Data Gap Type	Data Gap Details
U6: Usability > Large Volume	A common challenge for emulating climate model components, especially subgrid scale processes is the large data volume, which makes data downloading, transferring, processing, and storing challenging. Computation resources, including GPUs and storage, are urgently needed for most ML practitioners. Technical help on optimizing code for large volumes of data would also be appreciated.
S3: Sufficiency > Granularity	The current resolution is still sufficient to resolve some physical processes, e.g. turbulence, and tornado. Extremely high-resolution simulations, like large-eddy-simulations are needed.

DYAMOND (global atmospheric circulation model intercomparison data)

DYAMOND faces similar challenges to ClimSim: its large volume creates processing difficulties, and its resolution, while high, remains insufficient for resolving fine-scale atmospheric processes needed for accurate climate modeling.

DYAMOND faces similar challenges to ClimSim: its large volume creates processing difficulties, and its resolution, while high, remains insufficient for resolving fine-scale atmospheric processes needed for accurate climate modeling.

Data Gap Type	Data Gap Details
U6: Usability > Large Volume	A common challenge for emulating climate model components, especially subgrid scale processes is the large data volume, which makes data downloading, transferring, processing, and storing challenging. Computation resources, including GPUs and storage, are urgently needed for most ML practitioners. Technical help on optimizing code for large volumes of data would also be appreciated.
S3: Sufficiency > Granularity	The current resolution is still sufficient to resolve some physical processes, e.g. turbulence, and tornado. Extremely high-resolution simulations, like large-eddy-simulations are needed.

ECMWF ERA5 Atmospheric Reanalysis

While ERA5 is widely used due to its good structure and global coverage, users face significant challenges with downloading times that can take days to months, and the sheer data volume presents processing difficulties for many users.

While ERA5 is widely used due to its good structure and global coverage, users face significant challenges with downloading times that can take days to months, and the sheer data volume presents processing difficulties for many users.

Data Gap Type	Data Gap Details
U6: Usability > Large Volume	Massive storage and processing requirements - Cloud computing platforms with pre-loaded datasets and data subsetting tools can enable analysis without full downloads

Large-eddy simulations (atmospheric processes)

These simulations are essential for resolving turbulent processes that current climate models cannot capture, but they require significant computational resources and are not readily available as benchmark datasets for the wider research community.

These simulations are essential for resolving turbulent processes that current climate models cannot capture, but they require significant computational resources and are not readily available as benchmark datasets for the wider research community.

Data Gap Type

Data Gap Details

S6: Sufficiency > Missing Components

Current high-resolution simulations cannot resolve many physical processes like turbulence. Extremely high-resolution simulations (sub-kilometer or tens of meters) are needed to serve as ground truth for training ML models as they provide a more realistic representation of atmospheric processes. Creating and sharing benchmark datasets based on these simulations would facilitate model development and validation.

Regularly gridded high-resolution atmospheric observations

While conceptually needed, this dataset does not exist in the form required. An enhanced version of ERA5 with higher resolution and fidelity would significantly improve ML model training and validation.

While conceptually needed, this dataset does not exist in the form required. An enhanced version of ERA5 with higher resolution and fidelity would significantly improve ML model training and validation.

Data Gap Type	Data Gap Details
W: Wish	ERA5 is currently widely used in ML-based weather forecasts and climate modeling because of its high resolution and ready-for-analysis characteristics. But large volumes of observations, e.g. data from radiosonde, balloons, and weather stations are largely under-utilized. It would be great to create a dataset that is well-structured like ERA5 but from more observations.

Improving assessments of climate impacts on public health

Details (click to expand)

Climate change poses significant threats to public health through heat waves, extreme weather events, changing disease patterns, and air quality degradation, making it crucial to understand these relationships for effective health system preparedness.

Machine learning can analyze complex relationships between climate variables and health outcomes to predict disease outbreaks, assess vulnerability patterns, and inform public health interventions and adaptation strategies.

Key data gaps include the structural incompatibility between gridded climate data and tabular health records, limited accessibility of health datasets due to privacy restrictions, and lack of centralized platforms for discovering relevant climate-health data sources.

Creating standardized data integration frameworks, establishing secure health data sharing protocols, and developing centralized climate-health data repositories can enable more effective ML-driven public health preparedness and climate adaptation planning.

Dataset

Data Gap Summary

Historical climate observations

Climate data accessibility and integration challenges limit ML applications in climate-health research. Data exists in diverse formats that require significant preprocessing, and researchers without climate expertise struggle to identify appropriate datasets for their specific health applications.

Climate data accessibility and integration challenges limit ML applications in climate-health research. Data exists in diverse formats that require significant preprocessing, and researchers without climate expertise struggle to identify appropriate datasets for their specific health applications.

Data Gap Type	Data Gap Details
U1: Usability > Structure	Climate datasets exist in various incompatible formats and structures, requiring extensive preprocessing before integration with health data.
O1: Obtainability > Findability	No centralized platform exists for discovering climate datasets appropriate for health applications, making it difficult for non-climate experts to identify suitable data sources.

Public health data

Limited accessibility and poor documentation of health datasets restrict their use in climate-health ML applications. Privacy concerns and institutional barriers prevent broader data sharing, while inconsistent documentation makes existing datasets difficult to use effectively.

Limited accessibility and poor documentation of health datasets restrict their use in climate-health ML applications. Privacy concerns and institutional barriers prevent broader data sharing, while inconsistent documentation makes existing datasets difficult to use effectively.

Data Gap Type	Data Gap Details
O2: Obtainability > Accessibility	Health datasets have restricted access due to privacy regulations and institutional policies, limiting availability for climate-health research across diverse populations and demographics.
U4: Usability > Documentation	Available health datasets often lack comprehensive documentation, metadata, or accompanying code, making it difficult to understand data collection methods and appropriate usage.
U2: Usability > Aggregation	Health data from multiple sources requires integration efforts to create comprehensive datasets suitable for climate-health analysis, but standardized aggregation frameworks are lacking.

Improving battery management systems

Details (click to expand)

Battery storage is crucial for transitioning to renewable energy and electrifying transportation, with efficiency and lifetime directly impacting these sustainability efforts.

Machine learning can improve battery management systems by accurately estimating state of charge (SoC), state of health (SoH), and remaining useful life (RUL), and optimizing charging and discharging strategies.

Key data gaps include oversimplified battery models that don’t account for real-world operating conditions and insufficient validation data from physical battery systems in diverse operational environments.

Enhancing model complexity and collecting comprehensive real-world performance data can significantly improve battery management predictions, leading to extended battery lifetimes, more efficient energy use, and accelerated adoption of electric vehicles and renewable energy storage.

Dataset

Data Gap Summary

Equivalent circuit models

While ECMs enable real-time battery SoC predictions due to their computational efficiency, they often oversimplify real-life operating conditions which limits the accuracy of SoH and RUL estimates. Additionally, verification with data from physical battery systems is required to validate simulated outcomes and improve prediction reliability across diverse operational environments.

While ECMs enable real-time battery SoC predictions due to their computational efficiency, they often oversimplify real-life operating conditions which limits the accuracy of SoH and RUL estimates. Additionally, verification with data from physical battery systems is required to validate simulated outcomes and improve prediction reliability across diverse operational environments.

Data Gap Type

Data Gap Details

R1: Reliability > Quality

Due to their simplified nature and assumptions based on ideal laboratory conditions, ECMs have limited accuracy in predicting battery aging and dynamics in real systems. Verification with real-life battery system data from diverse operational environments is essential for improving state of health (SoH) and remaining useful life (RUL) predictions.

S3: Sufficiency > Granularity

The resolution of SoH and SoC predictions of ECMs are impacted by assumptions made with respect to battery performance. These include constant internal resistance assumptions that don’t account for sensitivity to complex current profiles or temperature variations, leading to inaccurate voltage and subsequent SoH/SoC calculations. ECMs also simplify electrochemical processes by ignoring electrode polarization, diffusion, and transfer kinetics, while neglecting battery aging effects like capacity fade Linearity assumptions, in simpler ECMs do not hold true under high charge/discharge rates. Solutions include increasing the complexity of ECMs by adding parallel RC networks to model the internal resistance of the battery with different time constants, introducing non-linear elements for different operating conditions, incorporating adaptive hysteresis models, and integrating aging parameters.

Improving estimations of forest carbon stock

Details (click to expand)

Forests are one of Earth’s major carbon sinks, making accurate estimation of forest carbon stocks essential for climate change mitigation efforts and carbon accounting.

ML can help by providing more precise and large-scale estimates of forest carbon through the analysis of satellite imagery, LiDAR data, and ground surveys.

Ground truth data for forest carbon stock estimation is often limited in geographical coverage and temporal frequency due to the high costs and labor-intensive nature of manual data collection. Additionally, remotely sensed data (satellite, airborne LiDAR) requires significant domain expertise for proper preprocessing and interpretation.

Governments and research institutions can address these gaps by investing in more comprehensive ground survey programs, making airborne LiDAR data more widely available, and developing standardized preprocessing tools for non-experts to utilize remote sensing data effectively.

Dataset

Data Gap Summary

Ground-survey based forest inventory data

Manual collection results in data quality issues and limited spatial coverage, requiring improved collection protocols and integration with remote sensing to expand usability.

Manual collection results in data quality issues and limited spatial coverage, requiring improved collection protocols and integration with remote sensing to expand usability.

Data Gap Type	Data Gap Details
U5: Usability > Pre-processing	Ground-survey data often contains missing values, measurement errors, and duplicates that require cleaning before use. Standardizing collection protocols and developing automated quality control procedures could improve data usability.
S2: Sufficiency > Coverage	Manual collection methods limit geographical coverage and collection frequency. Integrating ground surveys with remote sensing approaches and developing citizen science initiatives could help expand coverage while maintaining data quality.

LiDAR point cloud – airbone

Limited geographical coverage due to high collection costs, combined with the need for domain expertise to process the complex point cloud data, restricts the use of this high-value data source.

Limited geographical coverage due to high collection costs, combined with the need for domain expertise to process the complex point cloud data, restricts the use of this high-value data source.

Data Gap Type	Data Gap Details
U5: Usability > Pre-processing	Domain expertise is required to process raw LiDAR point clouds and generate canopy height metrics used for training ML models. Developing open-source processing tools with standardized workflows would make this data more accessible to non-experts.
S2: Sufficiency > Coverage	Airborne LiDAR provides the most accurate measurements of canopy height but is not collected everywhere due to the high costs of aircraft or drone operations. Coordinated efforts to expand coverage and make existing data publicly available would significantly improve forest carbon stock estimation capabilities.

Satellite imagery – GEDI LiDAR

Quality uncertainties in GEDI data affect carbon stock estimation reliability, requiring validation methods and calibration procedures to improve measurement accuracy.

Quality uncertainties in GEDI data affect carbon stock estimation reliability, requiring validation methods and calibration procedures to improve measurement accuracy.

Data Gap Type	Data Gap Details
R1: Reliability > Quality	GEDI data contains inherent uncertainties including geolocation errors and weak return signals in dense forests, which introduce errors into canopy height estimates and subsequent carbon calculations. Combining GEDI with other data sources like airborne LiDAR for validation and developing region-specific calibration methods could improve data reliability.

Satellite imagery – PALSAR radar images

Domain expertise is needed to preprocess this data, limiting its accessibility to researchers and practitioners without specialized knowledge in radar imagery interpretation.

Domain expertise is needed to preprocess this data, limiting its accessibility to researchers and practitioners without specialized knowledge in radar imagery interpretation.

Data Gap Type	Data Gap Details
U5: Usability > Pre-processing	Domain expertise is required to understand the raw radar data and preprocess it properly for use in ML models for forest carbon estimation. Developing standardized preprocessing pipelines and tools could make this valuable data more accessible to the broader ML and climate science communities.

Improving long-term extreme heat prediction

Details (click to expand)

Extreme heat events are becoming more frequent and intense due to climate change, posing serious risks to human health, infrastructure, and ecosystems worldwide.

Machine learning can improve long-term extreme heat prediction by identifying complex patterns in climate data and enhancing the accuracy and resolution of projections beyond what traditional physics-based models can achieve.

Working with climate projection datasets presents significant challenges due to their massive size, which requires substantial computational resources for storage, transfer, and processing, limiting accessibility for many researchers and stakeholders.

Cloud computing providers, research institutions, and funding agencies can collaborate to develop accessible platforms and tools for efficiently managing large climate datasets, enabling broader use of AI for extreme heat prediction and adaptation planning.

Dataset

Data Gap Summary

NEX-GDDP-CMIP6 (Global daily downscaled long-term climate projections)

The dataset’s massive size (petabytes of data) creates significant barriers for access, transfer, and analysis, requiring specialized computing infrastructure and technical expertise that many researchers lack. Additionally, efficiently extracting relevant extreme heat information from this comprehensive climate dataset presents computational and methodological challenges.

The dataset’s massive size (petabytes of data) creates significant barriers for access, transfer, and analysis, requiring specialized computing infrastructure and technical expertise that many researchers lack. Additionally, efficiently extracting relevant extreme heat information from this comprehensive climate dataset presents computational and methodological challenges.

Data Gap Type

Data Gap Details

U6: Usability > Large Volume

The NEX-GDDP-CMIP6 dataset requires substantial computational resources for processing and analysis. While cloud platforms provide access, they involve usage costs that may be prohibitive for some researchers. Processing such large datasets requires specialized techniques like distributed computing frameworks (e.g., Dask, Spark) and occasionally large-memory computing nodes for certain statistical analyses. Many researchers and practitioners lack either the technical expertise or computational resources to effectively utilize this valuable data.

Improving offshore wind power forecasting: short-to long-term (3 hours–1 year)

Details (click to expand)

Wind forecasting can allow for resource assessment studies for offshore energy production, wind resource mapping, and wind farm modeling.

Machine learning can potentially improve spatio-temporal forecasts at different horizons, provided the availability of high-quality training data.

Current data gaps include coverage gaps, noisy data, and difficulties in accessing data.

Efforts to get more of such data out of silos, mainly from energy companies, may help alleviate this gap.

Dataset

Data Gap Summary

NREL NOW23 (wind data)

This is numerically modeled data.

This is numerically modeled data.

Data Gap Type	Data Gap Details
S5: Sufficiency > Proxy	This is numerically modeled data.

NREL WIND toolkit (wind and weather)

The data is outdated and only a proxy of actual meteorological conditions.

The data is outdated and only a proxy of actual meteorological conditions.

Data Gap Type	Data Gap Details
S4: Sufficiency > Timeliness	The dataset only covers 2007-2013, and also its boundary conditions is ERA-Interim, which is also the out-of-date reanalysis data. NREL has a new version of onshore wind numerical-modeled data called WTK-LED. However, they only provide hourly data from WTK-LED.
S5: Sufficiency > Proxy	This is numerically modeled data.

Ocean observations from floating infrastructure (FINO3)

Due to the location, FINO platform sensors are prone to failures of measurement sensors due to adverse outdoor conditions such as high wind and high waves. Often, when sensors fail, manual recalibration or repair is nontrivial requiring weather conditions to be amenable for human intervention. This can directly affect the data quality which can last for several weeks to a season. Coverage is also constrained to the platform and associated wind farm location.

Due to the location, FINO platform sensors are prone to failures of measurement sensors due to adverse outdoor conditions such as high wind and high waves. Often, when sensors fail, manual recalibration or repair is nontrivial requiring weather conditions to be amenable for human intervention. This can directly affect the data quality which can last for several weeks to a season. Coverage is also constrained to the platform and associated wind farm location.

Data Gap Type	Data Gap Details
O2: Obtainability > Accessibility	The data is free to use but requires sign up through a login account at: https://login.bsh.de/fachverfahren/
U5: Usability > Pre-processing	The dataset is prone to failures of measurement sensors. Issues with data loggers, power supplies, and effects of adverse conditions such as low aerosol concentrations can influence data quality. High wind and wave conditions impact the ability to correct or recalibrate sensors creating data gaps that can last for several weeks or seasons.
S2: Sufficiency > Coverage	The relevance of the data for a given farm depends on the proximity of the farm. For locations with different offshore characteristics, similar testbed platforms or buoys can be developed.
S5: Sufficiency > Proxy	Due to the nature of sensors exposed to environmental ocean conditions and storms, often FINO sensors may need maintenance and repair but are difficult to physically access. Gaps in the data from lack of data points can be addressed by utilizing mesoscale wind modeling output.

Offshore wind data from masts and LiDAR

The spatiotemporal coverage of the offshore windspeed mast data is restricted to the dimensions of the platform/tower itself as well as the time of construction. Depending on the data provider access to the data may require the signing of a non-disclosure agreement.

The spatiotemporal coverage of the offshore windspeed mast data is restricted to the dimensions of the platform/tower itself as well as the time of construction. Depending on the data provider access to the data may require the signing of a non-disclosure agreement.

Data Gap Type

Data Gap Details

O2: Obtainability > Accessibility

Access to data must be requested with different data providers having varying levels of restrictions. For data obtained from Orsted, access is only provided by signing a standard non-disclosure agreement. For more information mail R&D at datasharing@orsted.com.

S2: Sufficiency > Coverage

Spatiotemporal coverage of the dataset varies depending on the construction of the platform testbed and location but overall data is available from 2014 to the present. While measurements from LiDAR have higher resolution than wind mast data, sensor information is still restricted to the dimensions of the platform and the associated off-shore windfarm when present. Data provided by Orsted from LiDAR sensors includes 10 minute statistics.

WHOI Martha’s Vineyard Coastal Observatory (wind speed and direction)

The data only contains measurements close to coastline, constraining its applicability for offshore wind applications in deep sea.

The data only contains measurements close to coastline, constraining its applicability for offshore wind applications in deep sea.

Data Gap Type	Data Gap Details
S2: Sufficiency > Coverage	The main issue is that this location is quite close to the coastline so the wind profiles is significantly impacted by the land. Therefore, it might be less useful for far-offshore wind nowcasting.

Wind Forecast Improvement Project 3 (wind data)

This data is not yet available.

This data is not yet available.

Data Gap Type	Data Gap Details
M: Misc/Other	This data is not yet available.

Improving offshore wind power nowcasting (10 min)

Details (click to expand)

Wind nowcasting can enable estimations of the active power generated by wind farms in the absence of curtailment and facilitate operations, potentially making them more efficient.

Machine learning can potentially improve such very short-term spatio-temporal forecasts, given the availability of high-quality training data.

High-resolution wind data, including wind power measurements (i.e. SCADA data from turbines) and wind field measurements (i.e. wind velocity, pressure, temperature, etc.), measured at wind farms currently remains limited to a few datasets.

Efforts to get such data out of silos, mainly from energy companies, may help alleviate this gap.

Dataset

Data Gap Summary

Offshore wind farm operation data (Orsted)

Data can be accessed by requesting access via the Orsted form. Sufficiency of the dataset is constrained by volume where only a finite amount of short term off-shore wind farms exist to which expanding the coverage area, volume and time granularity of data to under 10 minutes may enable transient detection from generated active power.

Data can be accessed by requesting access via the Orsted form. Sufficiency of the dataset is constrained by volume where only a finite amount of short term off-shore wind farms exist to which expanding the coverage area, volume and time granularity of data to under 10 minutes may enable transient detection from generated active power.

Data Gap Type	Data Gap Details
O2: Obtainability > Accessibility	Access requests are needed via a form from Orsted. It is also unclear if an NDA needs to be signed allows publication using the data.
S1: Sufficiency > Insufficient Volume	Data from multiple wind farms over a variety of regions would be required to get a more accurate comparison against simulated weather data.
S2: Sufficiency > Coverage	The coverage is over parts of Europe; off-shore wind conditions will vary depending on the environment and cannot scale or transfer to other temperate regions of the world. In addition to dependence on local weather conditions, wind farm performance depends on the wind turbine make and model (i.e. turbine aerodynamic properties and controller) and also the layout of the turbines in the farm. Additional data from other offshore wind farms are needed to uncover impacts depending on both wind conditions and wind farm parameters.
S3: Sufficiency > Granularity	The time granularity of 10 min is too coarse to capture transients in active power generated.
S4: Sufficiency > Timeliness	Only two years worth of data from 2016–2018 is provided. Additional data collection from offshore wind farms or simulations are needed.

Improving power grid optimization

Details (click to expand)

Optimal Power Flow (OPF) is used to find the cheapest way to generate electricity while meeting demand and staying within system limits like voltage and line capacity. Traditionally, OPF is a complex math problem solved separately for AC and DC systems. As more renewable energy is added, the grid is shifting toward hybrid AC/DC systems to better handle long-distance power flow and new challenges like two-way power movement.

Changes in the grid due to renewable sources make OPF harder to solve. ML can be used to approximate OPF problems in order to allow them to be solved at greater speed, scale, and fidelity.

Data gaps for this use case are numerous and mainly across usability, reliability, and sufficiency.

Closing these gaps requires an array of gap-specific actions; further industry engagement may have a significant impact on many of the gaps.

Dataset

Data Gap Summary

Grid2Op and PandaPower (power systems simulation outputs))

Grid2Op faces several data gaps related to usability, reliability, and coverage. Key issues include poor documentation, limited customization options (especially for reward functions and cascading failure scenarios), and a lack of support for multi-agent setups. The framework also lacks realistic system dynamics, fine time resolution, and flexible backend modeling, making it challenging to use for advanced research or real-world grid simulations without significant modification. These gaps can hinder the framework’s ability to accurately train reinforcement learning agents and simulate real-world power grid behavior.

Grid2Op faces several data gaps related to usability, reliability, and coverage. Key issues include poor documentation, limited customization options (especially for reward functions and cascading failure scenarios), and a lack of support for multi-agent setups. The framework also lacks realistic system dynamics, fine time resolution, and flexible backend modeling, making it challenging to use for advanced research or real-world grid simulations without significant modification. These gaps can hinder the framework’s ability to accurately train reinforcement learning agents and simulate real-world power grid behavior.

Data Gap Type	Data Gap Details
U4: Usability > Documentation	In the customization of the reward function, there are several TODOs in place concerning the units and attributes of the reward function related to redispatching. Documentation and code comments can sometimes provide conflicting information. Modularity of reward, adversary, action, environment, and backend are non-intuitive, requiring pregenerated dictionaries rather than dynamic inputs or conversion from single agent to multi-agent functionality. Refactoring of documentation and comments to reflect updates can assist users and avoid having to cross-reference information from the Discord channel for “Learning to Run a Power Network” and github issues.
U5: Usability > Pre-processing	The game over rules and constraints are difficult to adapt when customizing the environment for cascading failure scenarios and more complex adversaries such as natural disasters. Code base variations between versions especially between the native and Gym formatted framework lose features present in the legacy version including topology graphics. Open source refactoring efforts can assist in updating the code base to run latest and previous versions without loss of features.
R1: Reliability > Quality	The grid2op framework relies on mathematical robust control laws and rewards which train the RL agent based on set observation assumptions rather than actual system dynamics which are susceptible to noise, uncertainty, and disturbances not represented in the simulation environment. It has no internal modeling of the equations of the grids nor can it suggest which solver should be adopted to solve traditional nonlinear optimal power flow equations. Specifics concerning modeling and preferred solver require users to customize or create a new “Backend.” Additionally, such RL human-in-the-loop systems in practice require trustworthiness and quantification of risk. A library of open source contributed “Backends” from independent projects that customize the framework with supplemental documentation and paper references can assist in further development of the environment for different conditions. Human-in-the-loop studies can be completed by testing the environment scenario and control response of the system over a model of a real grid. Generated observations and control actions can then be compared to historical event sequences and grid operator responses.
S2: Sufficiency > Coverage	Coverage is limited to the network topologies provided by the grid2op environment which is based on different IEEE bus topologies. While customization of the environment in terms of the “Backend,” “Parameters,” and “Rules” are possible, there may be dependent modules that may still enforce game-over rules. Furthermore, since backend modeling is not the focus of grid2op, verification that customization obeys physical laws or models is necessary.
S3: Sufficiency > Granularity	The time resolution of 5-minute increments may not represent realistic observation time series grid data or chronics. Furthermore, the granularity may limit the effectiveness of specific actions in the provided action space. For example, the use of energy storage devices in the presence of overvoltage has little effect on energy absorption, incentivizing the agent to select from grid topology actions such as line changing line status or changing bus rather than setting storage. Expansion of the framework with efforts from the open source community to include multiple time resolutions may allow for generalization of the tool for different forecasting time horizons as well as action evaluation.

Optimal power flow simulation outputs

Traditional OPF simulation software may require the purchase of licenses for advanced features and functionalities. To simulate more complex systems or regions, additional data regarding energy infrastructure, region-specific load demand, and renewable generation may be needed to conduct studies. OPF simulation output would require verification and performance evaluation to assess results in practice. Increasing the granularity of the simulation model by increasing the number of buses, limits, or additional parameters increases the complexity of the OPF problem, thereby increasing the computational time and resources required.

Traditional OPF simulation software may require the purchase of licenses for advanced features and functionalities. To simulate more complex systems or regions, additional data regarding energy infrastructure, region-specific load demand, and renewable generation may be needed to conduct studies. OPF simulation output would require verification and performance evaluation to assess results in practice. Increasing the granularity of the simulation model by increasing the number of buses, limits, or additional parameters increases the complexity of the OPF problem, thereby increasing the computational time and resources required.

Data Gap Type	Data Gap Details
U2: Usability > Aggregation	In MATPOWER and PowerWorld outside data may be required to simulate conditions over a specific region with a given amount of DERs, generating sources, bus topology, and line limits. This will require collation of pre-existing synthetic grid data with additional data to model specific scenarios.
U3: Usability > Usage Rights	Depending on whether proprietary simulators are pursued (PowerWorld) there may be licensing costs for use of certain features.
R1: Reliability > Quality	Traditional OPF simulation software simplifies the power system and makes assumptions about the system behavior such as perfect power factor correction or constant system parameters. Simulation results may need to be verified with real-world results.
S3: Sufficiency > Granularity	In PowerWorld, bus topologies available may be simplified representations of actual grids to simplify the modeling and simulation techniques to represent overall system behavior. MATPOWER requires the user to define the bus matrix. As the number of buses in a power system increases the computational complexity of OPF increases, requiring more resources and time to solve. Additional parameters such as line limits, number of generating sources, number of DERs, and load demand also increase the complexity of the model as more constraints and assets are introduced.

Power Grid Lib (optimal power flow benchmark library)

While network datasets are open source, maintenance of the repository requires continuous curation and collection of more complex benchmark data to enable diverse AC-OPF simulation and scenario studies. Industry engagement can assist in developing more realistic data though such data without cooperative effort may be hard to find.

While network datasets are open source, maintenance of the repository requires continuous curation and collection of more complex benchmark data to enable diverse AC-OPF simulation and scenario studies. Industry engagement can assist in developing more realistic data though such data without cooperative effort may be hard to find.

Data Gap Type	Data Gap Details
O1: Obtainability > Findability	Industry engagement can assist in developing detailed and realistic networked datasets and operating conditions, limits, and constraints.
U2: Usability > Aggregation	Repository maintenance requires continuous curation of more complex networked benchmark data for more realistic AC-OPF simulation studies.

Improving short-term electricity load forecasting

Details (click to expand)

Short-term load forecasting is critical for utilities to balance power demand with supply. Utilities need accurate forecasts (e.g. on the scale of hours, days, weeks, up to a month) to plan, schedule, and dispatch energy while decreasing costs and avoiding service interruptions.

ML is well suited to handle large amounts of data such as historical electricity load data, weather forecasts, and continuous streams of advanced metering infrastructure (AMI) data, from which it may capture non-linearities which traditional linear models often struggle with.

Several data gaps for this use case resolve around the difficulty to access varied data due inter alia to privacy concerns and lack of willingness from private actors to share data for research.

ML can help the development of synthetic, privacy-preserving datasets that can accelerate research in this space.

Dataset

Data Gap Summary

Advanced metering infrastructure data

AMI data is challenging to obtain without pilot study partnerships with utilities since data collection on individual building consumer behavior can infringe upon customer privacy, especially at the residential level. Granularity of time series data can also vary depending on the level of access to the data, whether it be aggregated and anonymized or based on the resolution of readings and system. Additionally, the coverage of data will be limited to utility pilot test service areas, thereby restricting the scope and scale of demand studies.

AMI data is challenging to obtain without pilot study partnerships with utilities since data collection on individual building consumer behavior can infringe upon customer privacy, especially at the residential level. Granularity of time series data can also vary depending on the level of access to the data, whether it be aggregated and anonymized or based on the resolution of readings and system. Additionally, the coverage of data will be limited to utility pilot test service areas, thereby restricting the scope and scale of demand studies.

Data Gap Type	Data Gap Details
O2: Obtainability > Accessibility	Access to real AMI data can be difficult to obtain due to privacy concerns. Even when partnered with a utility, the AMI data can undergo anonymization and aggregation to protect individual customers. Some ISOs are able to distribute data provided that a written records request is submitted. If requesting personal consumption data, program pricing enrollment, may limit temporal resolution of data that a utility can provide. Open datasets, on the other hand, may only be available for academic research or teaching use (ISSDA CER data).
U2: Usability > Aggregation	AMI data when used jointly with other data that may influence demand such as weather, availability of rooftop solar, presence of electric vehicles, building specifications, and appliance inventory may require significant additional data collection or retrieval. Non-intrusive load monitoring techniques to disaggregate AMI data may be employed with some assumptions based on additional data. For example, the use of satellite imagery over a region of interest can assist in identifying buildings that have solar panels.
U3: Usability > Usage Rights	For ISSDA CER data use, a request form must be submitted for evaluation by issda@ucd.ie. For data obtained through utility collaborative partnerships, usage rights may vary. Please contact the data provider for more information.
U5: Usability > Pre-processing	Data cleanliness may vary depending on the data source. For individual private data collection through testbed development, cleanliness can depend on formats of data stream output from the sensor network system installed. When designing the testbed data format it is recommended to develop and structure comprehensive metadata with respect to the study to encourage further development.
R1: Reliability > Quality	Anonymized data may not be verifiable or useful once it is open-source. Further data collection for verification purposes is recommended.
S2: Sufficiency > Coverage
S3: Sufficiency > Granularity	Meter resolution can vary based on the hardware ranging from 1 hour, 30 minute, to 15 minute measurement intervals. Depending on the level of anonymization and aggregation of data, the granularity may be constrained to other factors such as the cadence of time of use pricing and other tiered demand response programs employed by the partnering utility. Interpolation may be used to combat issues with respect to resolution but may require uncertainty considerations when reporting results.
S4: Sufficiency > Timeliness	With respect to the CER Smart Metering Project and the associated Customer Behavior Trials (CBT), Electric Ireland and Bord Gais Energy smart meter installation and monitoring occurred from 2009-2010. This anonymized dataset may no longer be representative of current behavior usage as household compositions and associated loads change with time. Similarly, pilot programs through participating utilities are finite in nature. To address this data gap, in the context of previous pilot study locations, studies and testbeds can be reopened or revisited. In the context of new studies in different locations, previous data can still be utilized for pre-training models, however, fine-tuning would still require new data collection.

Building data genome project (hourly building-level metered data)

While the dataset has clear documentation, the lack of diversity in building types necessitates further development of the dataset to include other types of buildings, as well as expanding coverage to areas and times beyond those currently available.

While the dataset has clear documentation, the lack of diversity in building types necessitates further development of the dataset to include other types of buildings, as well as expanding coverage to areas and times beyond those currently available.

Data Gap Type	Data Gap Details
U2: Usability > Aggregation	Data was collated from 7 open access public data sources as well as 12 privately curated datasets from facilities management at different college sites requiring manual site visits which are not included in the data repository at this time.
S2: Sufficiency > Coverage	The dataset is curated from buildings on university campuses thereby limiting the diversity of building representation. To overcome the lack of diversity in building data, data sharing incentives and community open source contributions can allow for the expansion of the of the dataset.
S3: Sufficiency > Granularity	The granularity of the meter data is hourly which may not be adequate for short term load-forecasting and efficiency studies at a higher resolution. Assumptions on conditions would have to be made prior to interpolating.
S4: Sufficiency > Timeliness	The dataset covers hourly measurements from January 1, 2016 to December 31, 2018. While this may be adequate for pre-training models, further data collection through a reinitiation of the study may be needed to fine-tune models for more up to date periods of time.

Faraday (Synthetic smart meter data)

Despite the model being trained on actual AMI data, synthetic data generation requires a priori knowledge with respect to building type, efficiency, and presence of low-carbon technologies. Furthermore, since the dataset is trained on UK building data, AMI time series generated may not be an accurate representation of load demand in regions outside the UK. Finally, since data is synthetically generated, studies will require validation and verification using real data or aggregated per substation level data to assess its effectiveness.

Despite the model being trained on actual AMI data, synthetic data generation requires a priori knowledge with respect to building type, efficiency, and presence of low-carbon technologies. Furthermore, since the dataset is trained on UK building data, AMI time series generated may not be an accurate representation of load demand in regions outside the UK. Finally, since data is synthetically generated, studies will require validation and verification using real data or aggregated per substation level data to assess its effectiveness.

Data Gap Type	Data Gap Details
O2: Obtainability > Accessibility	The Variational Autoencoder Model can generate synthetic AMI data based on several conditions. The presence of low carbon technology (LCT) for a given household or property type depends on access to battery storage solutions, solar rooftop panels, and the presence of electric vehicles. This type of data may require curation of LCT purchases by type and household. Building type and efficiency at the residential and commercial/industrial level may also be difficult to access, requiring the user to set initial assumptions or seek additional datasets. Furthermore, data verification requires a performance metric based on actual readings. This may be done through access to substation- level load demand data.
U3: Usability > Usage Rights	Faraday is open for alpha testing by request only.
S2: Sufficiency > Coverage	Faraday is trained from utility provided AMI data from the UK which may not be representative of load demand and corresponding building type and temperate zone of other global regions. To generate similar synthetic data, custom data may be retrieved through a pilot test bed for private collection or the result of a partnership with a local utility. Additionally, pre-existing AMI data over an area of interest can be utilized to generate similar synthetic data. Datasets are restricted to past pilot study coverage areas requiring further data collection for fine-tuning models to a different coverage area.
S3: Sufficiency > Granularity	Data granularity is limited to the granularity of data the model was trained on. Generative modeling approaches similar to Faraday, can be built using higher resolution data or interpolation methods could also be employed.
S4: Sufficiency > Timeliness	Timeliness of dataset would require continuous integration and development of the model using MLOps best practices to avoid data and model drift. By contributing to Linux Foundation Energy’s OpenSynth initiative, Centre for Net Zero hopes to build a global community of contributors to facilitate research.
R1: Reliability > Quality	Verification of AMI synthetic data requires verification, which can be done in a bottom-up grid modeling manner. For example, load demand at the substation level can be estimated based on the sum of individual building loads that the substation services. This value can then be compared to the actual substation load demand provided through private partnerships with distribution network operators (DNOs). However, the accuracy of a specific demand profile per property or section of properties would require identification of a population of buildings, a connected real-world substation, and residential low carbon technology investment for the set of properties under study.

Improving solar power forecasting: long-term (>24 hours)

Details (click to expand)

Accurately forecasting solar power generation beyond 24 hours is critical for energy market pricing, investment decisions, and coordinating renewable energy sources in an increasingly decarbonized grid.

Machine learning approaches can improve longer-term solar forecasting by combining weather predictions, historical generation data, and other relevant variables to create more accurate models than traditional methods.

The primary data gaps include limited geographic coverage of existing datasets, reliance on simulated rather than measured data, and quality concerns when adapting models to specific regions.

Expanding data collection networks, validating simulated data with real measurements, and creating standardized datasets for diverse regions would enable more reliable ML-based solar forecasting systems that could significantly improve grid stability and accelerate renewable energy adoption.

Dataset

Data Gap Summary

NREL solar power data for integration studies

While valuable for renewable energy integration studies, this dataset has limitations in geographic coverage (limited to the US), temporal scope (only 2006 data), and relies on simulated rather than measured PV outputs. Addressing these gaps would enable more accurate and globally applicable ML-based solar forecasting models.

While valuable for renewable energy integration studies, this dataset has limitations in geographic coverage (limited to the US), temporal scope (only 2006 data), and relies on simulated rather than measured PV outputs. Addressing these gaps would enable more accurate and globally applicable ML-based solar forecasting models.

Data Gap Type	Data Gap Details
R1: Reliability > Quality	The dataset uses simulated outputs based on weather predictions rather than actual PV measurements, which may introduce systematic biases. Site-specific projects require additional validation with real measurements from solar power inverters. Developers can improve model accuracy by supplementing with local measurements and adapting simulation parameters to better represent specific regions.
S2: Sufficiency > Coverage	The dataset is limited to US locations based on 2006 solar conditions and is not representative of other geographic regions or more recent climate patterns. Expanding data collection to include diverse global regions and updating with more recent measurements would improve model transferability.
S4: Sufficiency > Timeliness	The dataset only covers 2006, which may not capture recent climate trends or technology improvements in PV systems. Updated datasets with more recent time periods would better represent current conditions and improve forecasting accuracy.

Improving solar power forecasting: medium-term (6-24 hours)

Details (click to expand)

Medium-term solar forecasting (6-24 hours ahead) is essential for efficient grid management, especially as solar power integration increases, impacting energy markets, demand response, and microgrid operations.

Machine learning techniques can significantly improve these forecasts by integrating satellite data with weather predictions and historical patterns to provide more accurate solar irradiance estimates.

A key data gap is the inconsistency in satellite data resolutions and coverage, alongside challenges in processing multispectral data and accurately modeling how different cloud types affect ground irradiance.

Combining satellite observations with ground-based measurements and developing standardized preprocessing approaches would substantially improve forecast accuracy, enabling better grid management and renewable energy integration.

Dataset

Data Gap Summary

Satellite imagery

Satellite remote sensing data for solar forecasting faces several challenges: variability in spatial and temporal resolution across different satellite sources, complex preprocessing requirements for multispectral data, and the need to accurately translate cloud observations into ground-level irradiance predictions. Improving granularity through supplementation with ground-based measurements and developing standardized preprocessing pipelines would significantly enhance forecast accuracy for grid management applications.

Satellite remote sensing data for solar forecasting faces several challenges: variability in spatial and temporal resolution across different satellite sources, complex preprocessing requirements for multispectral data, and the need to accurately translate cloud observations into ground-level irradiance predictions. Improving granularity through supplementation with ground-based measurements and developing standardized preprocessing pipelines would significantly enhance forecast accuracy for grid management applications.

Data Gap Type	Data Gap Details
U2: Usability > Aggregation	Data from different satellite sources (both geostationary and polar-orbiting) needs to be collated and harmonized when analyzing multiple regions of interest, creating challenges in data integration and standardization.
U5: Usability > Pre-processing	Multispectral remote sensing data requires preprocessing, including atmospheric correction and band combinations in the visible and infrared spectra, before it can be effectively used for solar forecasting models.
R1: Reliability > Quality	Different cloud types affect ground-level solar irradiance in varying ways that satellite imagery alone cannot fully capture, necessitating verification and supplementation with ground-based measurements for improved model accuracy.
S3: Sufficiency > Granularity	Spatial and temporal resolution varies significantly between satellite sources, limiting the ability to capture rapid changes in cloud cover that impact solar irradiance, particularly during partly cloudy conditions which create high variability in short timeframes.

Improving solar power forecasting: nowcasting/very-short-term (0-30min)

Details (click to expand)

Very-short-term solar power forecasting is critical for grid stability and efficiency as sudden changes in solar irradiance (ramp events) can cause abrupt fluctuations in power generation.

AI techniques can analyze cloud dynamics through segmentation and classification to predict solar irradiance attenuation, enabling more accurate forecasting for real-time electricity markets, dispatch of other generating sources, and energy storage control.

Key data gaps include limited spatial coverage of ground monitoring stations, insufficient time resolution for sub-5-minute forecasting, challenges with large data volumes from sensor networks, and data quality issues related to sensor calibration.

Expanding sensor networks to diverse environments, implementing AI-based data compression and quality control, and integrating multi-source data can close these gaps, ultimately enabling more reliable integration of solar power into electricity grids.

Dataset

Data Gap Summary

DOE Atmospheric Radiation Measurement research facility data products

ARM data presents challenges with data volume management, measurement verification (especially for aerosol composition), limited spatial coverage (ARM sites only), and sensor calibration issues. Solutions include AI-based data compression, enhanced aerosol composition measurements, collaboration with partner networks to expand coverage, and automated quality control.

ARM data presents challenges with data volume management, measurement verification (especially for aerosol composition), limited spatial coverage (ARM sites only), and sensor calibration issues. Solutions include AI-based data compression, enhanced aerosol composition measurements, collaboration with partner networks to expand coverage, and automated quality control.

Data Gap Type	Data Gap Details
U6: Usability > Large Volume	ARM sites generate large datasets which can be challenging to store, analyze, stream, and archive. AI-based data compression and novel indexing can improve data management.
S3: Sufficiency > Granularity	Enhanced aerosol composition and ice nucleating particle measurements are needed for a better understanding of cloud dynamics and solar irradiance for DER site planning.
S2: Sufficiency > Coverage	Spatial coverage is limited to ARM sites within the United States. Collaboration with partner networks can expand coverage both within and outside the US.
R1: Reliability > Quality	Sensor data can be sensitive to noise and calibration issues, requiring automated systems to identify measurement drift.

NIST campus photovoltaic arrays and weather station data

The dataset has limited spatial coverage (Gaithersburg, MD only) and is no longer maintained after July 2017, limiting its usefulness for current applications.

The dataset has limited spatial coverage (Gaithersburg, MD only) and is no longer maintained after July 2017, limiting its usefulness for current applications.

Data Gap Type	Data Gap Details
S2: Sufficiency > Coverage	Since testbeds are located on the NIST campus spatial coverage is limited to the institution’s site. Similar datasets outside which combine sensor information from the solar irradiance conditions, and the associated solar generated power at the output of the inverter would require investment in similar site-specific testbeds in different regions.
S4: Sufficiency > Timeliness	Spatial coverage is limited to the NIST campus in Gaithersburg, MD. Similar datasets in different regions would require investment in comparable testbeds.

Solcast (global solar forecasting and historical solar irradiance data)

Solcast data is only accessible through academic or research institutions, uses coarse elevation models, has limited coverage (33 global sites), and provides data at 5-60 minute intervals, insufficient for very-short-term forecasting.

Solcast data is only accessible through academic or research institutions, uses coarse elevation models, has limited coverage (33 global sites), and provides data at 5-60 minute intervals, insufficient for very-short-term forecasting.

Data Gap Type	Data Gap Details
S3: Sufficiency > Granularity	Time resolution ranges from 5 to 60 minutes, which is insufficient for sub-5-minute forecasting needs.
S2: Sufficiency > Coverage	Coverage is limited to 33 global sites (18 tropical/subtropical, 15 temperate), requiring expansion to other regions and environmental conditions.
R1: Reliability > Quality	Significant elevation differences between ground sites and cell height affect clear-sky irradiance estimation accuracy.
O1: Obtainability > Findability	Data is only accessible through collaborating academic or research institutions.

SRRL TSI-880 (sky imagery)

Data coverage is limited by camera locations, temporal resolution is restricted to 10-minute increments, and image resolution is limited to 352x288 24-bit jpeg images.

Data coverage is limited by camera locations, temporal resolution is restricted to 10-minute increments, and image resolution is limited to 352x288 24-bit jpeg images.

Data Gap Type	Data Gap Details
S3: Sufficiency > Granularity	Images have limited resolution (352x288 pixels) with 10-minute capture intervals, potentially insufficient for very-short-term forecasting.
S2: Sufficiency > Coverage	Coverage is constrained by sensor network location and density. Expanded networks in diverse environments would improve coverage.
S2: Sufficiency > Coverage	The current dataset derives from sky imager datasets in Singapore, requiring similar networks in other regions or alternative data sources.

SWINySEG (sky imagery)

The dataset provides valuable annotated data for cloud detection and segmentation but is limited to Singapore, has an insufficient volume of samples (especially nighttime images), and restricts commercial use.

The dataset provides valuable annotated data for cloud detection and segmentation but is limited to Singapore, has an insufficient volume of samples (especially nighttime images), and restricts commercial use.

Data Gap Type	Data Gap Details
S1: Sufficiency > Insufficient Volume	The dataset needs more manually annotated cloud mask labels and is imbalanced with fewer nighttime samples.
O2: Obtainability > Accessibility	The dataset is under a Creative Commons license that prohibits commercial use, and access must be requested.

Improving solar power forecasting: short-term (30 min-6 hours)

Details (click to expand)

Solar irradiance forecasting at hourly intervals is critical for managing intermittent solar energy resources and ensuring grid stability and reliability.

Machine learning approaches can enhance forecasting accuracy by leveraging multiple data sources, including measured irradiance, PV inverter outputs, and meteorological variables.

Important data gaps include limited spatial coverage, with most high-quality data concentrated in specific regions, and inconsistent temporal resolution that affects forecasting precision.

By expanding sensor networks globally and harmonizing data collection standards, forecasting models can better support real-time energy management, demand response, and grid stability across diverse geographical areas.

Dataset

Data Gap Summary

NOAA's SOLRAD Network Solar Radiation Data

While NOAA’s SOLRAD is an excellent data source for long term solar irradiance and climate studies, it has limitations for short-term solar forecasting applications. Key gaps include lower quality hourly averages compared to native resolution data, and limited geographic coverage with only nine monitoring stations across the United States. These constraints impact the effectiveness of forecasting for real-time energy management, grid stability, and market operations.

While NOAA’s SOLRAD is an excellent data source for long term solar irradiance and climate studies, it has limitations for short-term solar forecasting applications. Key gaps include lower quality hourly averages compared to native resolution data, and limited geographic coverage with only nine monitoring stations across the United States. These constraints impact the effectiveness of forecasting for real-time energy management, grid stability, and market operations.

Data Gap Type	Data Gap Details
S2: Sufficiency > Coverage	The coverage area is constrained to nine SURFRAD network locations in the United States (Albuquerque, NM; Bismark, ND; Hanford, CA; Madison, WI; Oak Ridge, TN; Salt Lake City, UT; Seattle, WA; Sterling, VA; Tallahassee, FL). For generalization to other regions, locations with similar climates and temperate zones would need to be identified.
S3: Sufficiency > Granularity	Data quality of the hourly averages is lower than that of the native resolution data, impacting effective short-term forecasting for real-time energy management, grid stability, demand response, and market operations. To address this gap, using very short-term data or supplementing with data from sky imagers and other sensors with frequent measurement outputs would be beneficial.

NREL Physical Solar Model Solar Radiation Database

While NSRDB offers global coverage using satellite-derived data, several challenges exist. The dataset requires periodic recalculation and updating to remain current, with unbalanced temporal coverage favoring the United States. Satellite-based estimations may be inaccurate in regions with frequent cloud cover, snow, or bright surfaces, requiring ground-based verification. Additionally, data derived from satellite imagery may need preprocessing to account for parallax effects and field-of-view issues that aren’t fully addressed in the higher-level FARMS products.

While NSRDB offers global coverage using satellite-derived data, several challenges exist. The dataset requires periodic recalculation and updating to remain current, with unbalanced temporal coverage favoring the United States. Satellite-based estimations may be inaccurate in regions with frequent cloud cover, snow, or bright surfaces, requiring ground-based verification. Additionally, data derived from satellite imagery may need preprocessing to account for parallax effects and field-of-view issues that aren’t fully addressed in the higher-level FARMS products.

Data Gap Type	Data Gap Details
U5: Usability > Pre-processing	Data derived from satellite imagery requires pre-processing to account for pixel variability, parallax effects, and additional modeling using radiative transfer to improve solar radiation estimates.
S4: Sufficiency > Timeliness	Data flow from satellite imagery to solar radiation measurement output from FARMS needs to be recalculated and updated to expand beyond the current coverage years of the represented global regions.
R1: Reliability > Quality	Satellite-based estimation of solar resource information for sites susceptible to cloud cover, snow, and bright surfaces may not be accurate, requiring verification from ground-based measurements.

NREL SRRL Baseline Measurement System for Multi-Variable Solar Research

While NREL’S SRRL BMS provides real-time joint variable data from ground-based sensors, its coverage is limited to the single location in Golden, CO in the United States. The diverse sensor network requires regular maintenance, and instrument malfunctions or calibration issues may lead to data inaccuracies if not promptly detected and addressed, affecting the reliability of solar forecasting applications.

While NREL’S SRRL BMS provides real-time joint variable data from ground-based sensors, its coverage is limited to the single location in Golden, CO in the United States. The diverse sensor network requires regular maintenance, and instrument malfunctions or calibration issues may lead to data inaccuracies if not promptly detected and addressed, affecting the reliability of solar forecasting applications.

Data Gap Type	Data Gap Details
U5: Usability > Pre-processing	Instrument malfunction or calibration requires human intervention, leading to inaccuracies in measured data quantities, especially if detection is delayed, affecting solar forecast accuracies. Despite this, the dataset continues to be maintained.
S2: Sufficiency > Coverage	Coverage is reserved to Golden, CO. Other locations would benefit from similar sensor monitoring systems, especially those with variations in weather patterns that could affect solar irradiance forecasting and energy harvesting.

SMA Solar Technology PV System Performance database

The SMA PV monitoring system requires user profile creation and specific system access requests, with documentation primarily in German creating potential language barriers. Data representation is geographically unbalanced with stronger coverage in Germany, Netherlands, and Australia despite its global presence. Additionally, only a subset of systems includes energy storage data, which would be valuable for comprehensive distributed energy resource load forecasting studies.

The SMA PV monitoring system requires user profile creation and specific system access requests, with documentation primarily in German creating potential language barriers. Data representation is geographically unbalanced with stronger coverage in Germany, Netherlands, and Australia despite its global presence. Additionally, only a subset of systems includes energy storage data, which would be valuable for comprehensive distributed energy resource load forecasting studies.

Data Gap Type	Data Gap Details
U4: Usability > Documentation	Documentation is primarily in German and lacks the same detail in the English version of the website. Companion research utilizing the data is not readily cited or linked. Language barriers can challenge the interpretation of displayed data values when accessed through the portal interface.
S2: Sufficiency > Coverage	Coverage varies significantly by country, with representation ranging from single systems to over 43,000 systems per country. Systems in Germany, the Netherlands, and Australia are more comprehensively represented than other regions. Additionally, battery storage information is inconsistently available across monitored systems. This gap could be addressed by increasing private user-contributed system data from diverse regions.
O2: Obtainability > Accessibility	Users must utilize the web interface or create a user profile to request access to additional data or preferred formats. Data cannot be freely downloaded in bulk or raw format and must be scraped from the web portal. Contact with SMA is required for membership or extended usage rights.

SOLETE Hybrid Solar-Wind Generation dataset

While SOLETE offers valuable data for joint wind-solar distributed energy resource forecasting, several sufficiency gaps limit its application. The dataset’s 15-month temporal coverage doesn’t capture long-term seasonal variations, and it monitors only a single wind turbine and PV array, limiting analyseis of multi-source generation coordination. Additionally, maintenance schedule and system downtime data are missing, which would enhance realistic system dynamic modeling. Supplementing with external data sources or simulation could address these limitations.

While SOLETE offers valuable data for joint wind-solar distributed energy resource forecasting, several sufficiency gaps limit its application. The dataset’s 15-month temporal coverage doesn’t capture long-term seasonal variations, and it monitors only a single wind turbine and PV array, limiting analyseis of multi-source generation coordination. Additionally, maintenance schedule and system downtime data are missing, which would enhance realistic system dynamic modeling. Supplementing with external data sources or simulation could address these limitations.

Data Gap Type	Data Gap Details
S6: Sufficiency > Missing Components	SOLETE lacks maintenance schedule data and system downtime information. Retroactively supplementing this data through simulation or SYSLAB records would improve system forecasting to account for scheduled maintenance uncertainties.
S3: Sufficiency > Granularity	Varying resolution and sampling rates (seconds to hours) can impact analysis precision, particularly when fusing data of different temporal resolutions. Aggregating second-level data to hourly intervals may affect joint short-term solar and wind forecasting outcomes.
S2: Sufficiency > Coverage	The 15-month temporal coverage is insufficient to capture long-term seasonal variations in joint wind and irradiance patterns.
S1: Sufficiency > Insufficient Volume	The dataset covers only a single wind turbine and PV array, limiting insights into coordination between multiple generation sources. This gap could be addressed by physically expanding the network or combining SOLETE with external datasets from utility and energy technology companies to enable larger grid control studies.

Improving terrestrial wildlife detection and species classification

Details (click to expand)

Terrestrial wildlife detection and species classification are essential for understanding the impacts of climate change on terrestrial ecosystems.

ML can greatly improve these efforts by automatically processing large volumes of data from diverse sources, enhancing the accuracy and efficiency of monitoring and analyzing terrestrial species.

The primary data gaps include insufficient or imbalanced publicly available annotated datasets of all modalities and challenges, in the case of bioacoustic data, with sharing large-volume data due to storage limitations and high costs.

Solutions include developing affordable data hosting platforms, incentivizing data sharing through recognition and funding, and establishing standardized protocols for data integration.

Dataset

Data Gap Summary

Biodiversity images and recordings – community science data

One challenge with community science data is biases in geographic and taxonomic representativity: while community science data can provide broader coverage than formal survey data, but is highly biased and the biases are not documented.. Data tends to be concentrated in accessible areas and often focuses on charismatic or commonly encountered species. This limits the generalizability of ML models that can be built from training on this data.

One challenge with community science data is biases in geographic and taxonomic representativity: while community science data can provide broader coverage than formal survey data, but is highly biased and the biases are not documented.. Data tends to be concentrated in accessible areas and often focuses on charismatic or commonly encountered species. This limits the generalizability of ML models that can be built from training on this data.

Data Gap Type

Data Gap Details

S2: Sufficiency > Coverage

Data is often concentrated in easily accessible areas and focuses on more charismatic or easily identifiable species. There is often a disproportionate focus on certain broad geographic regions (often platforms are set up to integrate community scientists of a certain type, which ends up de facto favoring certain demographic groups from the Global North) and taxonomic groups (as with camera traps and bioacoustic data).

R1: Reliability > Quality

Community science data may include a variable proportion of incorrect identifications.

Camera trap wildlife image collections

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Data Gap Type

Data Gap Details

S1: Sufficiency > Insufficient Volume

This scarcity of publicly open and well-annotated datasets arises from the labor-intensive nature of data annotation and the dearth of expertise necessary to accurately identify species. This is the case even for relatively well-studied taxa such as birds, and is even more notable for taxa such as Orthoptera (grasshoppers and relatives). The incompleteness of our current taxonomy also compounds this issue.

Though some data cannot be shared with the public because of concerns about poaching or other legal restrictions imposed by local governments, there are still considerable sources of well-annotated data that currently reside within the confines of individual researchers’ or organizations’ hard drives. Unlocking these data will partially address the data gap.

To achieve this goal:

A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.

S2: Sufficiency > Coverage

There exists a significant geographic imbalance in data collected, with insufficient representation from highly biodiverse regions. The data also does not cover a diverse spectrum of species, intertwining with taxonomy insufficiencies.

Addressing these gaps involves several strategies:

Data sharing: Unlocking existing datasets scattered across individual laboratories and organizations can partially alleviate the geographic and taxonomic gaps in data coverage.
Annotation Efforts:
- Leveraging crowdsourcing platforms enables collaborative data collection and facilitates volunteer-driven annotation, playing a pivotal role in expanding annotated datasets.
- Collaborating with domain experts, including ecologists and machine learning practitioners, is crucial. Incorporating local ecological knowledge, particularly from indigenous communities in target regions, enriches data annotation efforts.
- Developing AI-powered methods for automatic cataloging, searching, and data conversion is imperative to streamline data processing and extract relevant information efficiently.
Data Collection:
- Prioritize continuous data collection efforts, especially in underrepresented regions and for less documented species and taxonomic groups.
- Explore innovative approaches like Automated Multisensor Stations for Monitoring of Species Diversity (AMMODs) to enhance data collection capabilities.
- Strategically plan data collection initiatives to target species at risk of extinction, ensuring data capture before irreversible loss occurs.

U2: Usability > Aggregation

Many well-annotated datasets are currently scattered within the confines of individual researchers’ or organizations’ hard drives. Aggregating and sharing these datasets would largely address the lack of publicly open annotated data needed for model training. This initiative would also mitigate geographic and taxonomic imbalances in existing data.

To achieve this goal:

A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.

To achieve this goal:

A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Data Gap Type

Data Gap Details

S1: Sufficiency > Insufficient Volume

This scarcity of publicly open and well-annotated datasets arises from the labor-intensive nature of data annotation and the dearth of expertise necessary to accurately identify species. This is the case even for relatively well-studied taxa such as birds, and is even more notable for taxa such as Orthoptera (grasshoppers and relatives). The incompleteness of our current taxonomy also compounds this issue.

Though some data cannot be shared with the public because of concerns about poaching or other legal restrictions imposed by local governments, there are still considerable sources of well-annotated data that currently reside within the confines of individual researchers’ or organizations’ hard drives. Unlocking these data will partially address the data gap.

To achieve this goal:

A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.

S2: Sufficiency > Coverage

There exists a significant geographic imbalance in data collected, with insufficient representation from highly biodiverse regions. The data also does not cover a diverse spectrum of species, intertwining with taxonomy insufficiencies.

Addressing these gaps involves several strategies:

Data sharing: Unlocking existing datasets scattered across individual laboratories and organizations can partially alleviate the geographic and taxonomic gaps in data coverage.
Annotation Efforts:
- Leveraging crowdsourcing platforms enables collaborative data collection and facilitates volunteer-driven annotation, playing a pivotal role in expanding annotated datasets.
- Collaborating with domain experts, including ecologists and machine learning practitioners, is crucial. Incorporating local ecological knowledge, particularly from indigenous communities in target regions, enriches data annotation efforts.
- Developing AI-powered methods for automatic cataloging, searching, and data conversion is imperative to streamline data processing and extract relevant information efficiently.
Data Collection:
- Prioritize continuous data collection efforts, especially in underrepresented regions and for less documented species and taxonomic groups.
- Explore innovative approaches like Automated Multisensor Stations for Monitoring of Species Diversity (AMMODs) to enhance data collection capabilities.
- Strategically plan data collection initiatives to target species at risk of extinction, ensuring data capture before irreversible loss occurs.

U2: Usability > Aggregation

Many well-annotated datasets are currently scattered within the confines of individual researchers’ or organizations’ hard drives. Aggregating and sharing these datasets would largely address the lack of publicly open annotated data needed for model training. This initiative would also mitigate geographic and taxonomic imbalances in existing data.

To achieve this goal:

A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.

To achieve this goal:

A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.

Environmental DNA (eDNA)

eDNA is an emerging new technique in biodiversity monitoring. There are still many issues impeding the application of eDNA-based tools. One important gap in data is the incompleteness of barcoding reference databases.

eDNA is an emerging new technique in biodiversity monitoring. There are still many issues impeding the application of eDNA-based tools. One important gap in data is the incompleteness of barcoding reference databases.

Data Gap Type	Data Gap Details
S2: Sufficiency > Coverage	Barcoding reference databases remain far from complete. However, much attention and effort are being devoted to filling this data gap, for example, the BIOSCAN project.

Museum specimens

The majority of the world’s museum specimens remain undigitized, creating a significant barrier to using these records in machine learning applications for biodiversity monitoring and climate change research.

The majority of the world’s museum specimens remain undigitized, creating a significant barrier to using these records in machine learning applications for biodiversity monitoring and climate change research.

Data Gap Type

Data Gap Details

S1: Sufficiency > Insufficient Volume

Museum specimens only become valuable to ML studies when they are digitized. Many museum specimens remain to be digitized, and this task presents significant challenges. Much of the information about these specimens, such as species traits and occurrence data, is often recorded in handwritten notes, making parsing and recognizing this information a complex and error-prone process.

Digitizing these specimens has become a priority for many museums. To support this effort, adequate funding, and technical and scientific assistance should be provided. Machine learning itself can help support some of these efforts e.g. when it comes to digitizing notes.

Passive acoustic monitoring for biodiversity assessment

The first and foremost challenge of bioacoustic data is its sheer volume, which makes data sharing especially challenging. Solutions are needed for cheaper and more reliable data hosting and sharing platforms.

Additionally, there’s a significant shortage of large and diverse annotated datasets, which is even more severe than image data such as camera trap, drone, and crowd-sourced images.

The first and foremost challenge of bioacoustic data is its sheer volume, which makes data sharing especially challenging. Solutions are needed for cheaper and more reliable data hosting and sharing platforms.

Additionally, there’s a significant shortage of large and diverse annotated datasets, which is even more severe than image data such as camera trap, drone, and crowd-sourced images.

Data Gap Type	Data Gap Details
S1: Sufficiency > Insufficient Volume	This scarcity of publicly open and well-annotated datasets arises from the labor-intensive nature of data annotation and the dearth of expertise necessary to accurately identify species. This is the case even for relatively well-studied taxa such as birds, and is even more notable for taxa such as Orthoptera (grasshoppers and relatives). The incompleteness of our current taxonomy also compounds this issue. Though some data cannot be shared with the public because of concerns about poaching or other legal restrictions imposed by local governments, there are still considerable sources of well-annotated data that currently reside within the confines of individual researchers’ or organizations’ hard drives. Unlocking these data will partially address the data gap. To achieve this goal: A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative. Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets. The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
S2: Sufficiency > Coverage	There exists a significant geographic imbalance in data collected, with insufficient representation from highly biodiverse regions. The data also does not cover a diverse spectrum of species, intertwining with taxonomy insufficiencies. Addressing these gaps involves several strategies: Data sharing: Unlocking existing datasets scattered across individual laboratories and organizations can partially alleviate the geographic and taxonomic gaps in data coverage. Annotation Efforts: Leveraging crowdsourcing platforms enables collaborative data collection and facilitates volunteer-driven annotation, playing a pivotal role in expanding annotated datasets. Collaborating with domain experts, including ecologists and machine learning practitioners, is crucial. Incorporating local ecological knowledge, particularly from indigenous communities in target regions, enriches data annotation efforts. Developing AI-powered methods for automatic cataloging, searching, and data conversion is imperative to streamline data processing and extract relevant information efficiently. Data Collection: Prioritize continuous data collection efforts, especially in underrepresented regions and for less documented species and taxonomic groups. Explore innovative approaches like Automated Multisensor Stations for Monitoring of Species Diversity (AMMODs) to enhance data collection capabilities. Strategically plan data collection initiatives to target species at risk of extinction, ensuring data capture before irreversible loss occurs.
U2: Usability > Aggregation	Many well-annotated datasets are currently scattered within the confines of individual researchers’ or organizations’ hard drives. Aggregating and sharing these datasets would largely address the lack of publicly open annotated data needed for model training. This initiative would also mitigate geographic and taxonomic imbalances in existing data. To achieve this goal: A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative. Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets. The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources. To achieve this goal: A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative. Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets. The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
U6: Usability > Large Volume	One of the biggest challenges in bioacoustic data lies in its sheer volume, stemming from continuous monitoring processes. Researchers face significant hurdles in sharing and hosting this data, as existing online platforms often don’t provide sufficient long-term storage capacity or are very expensive. Solutions are needed to provide cheaper and more reliable hosting options. Moreover, accessing these extensive datasets demands advanced computing infrastructure and solutions. The availability of more funding sources may push more people to start sharing their bioacoustic data.

Satellite imagery

Some commercial high-resolution satellite images can also be used to identify large animals such as whales, but those images are not open to the public.

Some commercial high-resolution satellite images can also be used to identify large animals such as whales, but those images are not open to the public.

Data Gap Type	Data Gap Details
O2: Obtainability > Accessibility	The resolution of publicly open satellite images is not high enough. High-resolution images are usually commercial and can be expensive.

Interpolating city-wide bicycle volumes from limited count data

Details (click to expand)

The usage of bicycles as a commuting mode in cities is important both for climate change mitigation (modal shift from emitting modes like cars to active mobility) and for public health reasons (activity from biking improves health).

ML can interpolate patterns from limited bicycle count data to provide city-wide volume estimates, utilizing available count data and combining it with additional data sources such as infrastructure data, thereby offering an alternative to costly, widespread sensor deployment.

Bicycle count and infrastructure data can suffer from limited access, poor coverage, and inconsistent quality, causing hurdles when training machine learning models.

Creating centralized platforms and standardizing cycling data collection and sharing would improve data quality and accessibility, enabling more robust analyses in support of sustainable urban planning.

Dataset

Data Gap Summary

Bicycle count data – permanent sensing

Several data gaps limit the effectiveness of bicycle count data from permanent sensing. Events like construction can compromise reliability, while obtainability is hindered by limited accessibility outside major cities and the lack of centralized platforms for finding and aggregating data. Additionally, coverage is often insufficient to provide a clear picture across the city, with data collected at only a few locations within a city.

Several data gaps limit the effectiveness of bicycle count data from permanent sensing. Events like construction can compromise reliability, while obtainability is hindered by limited accessibility outside major cities and the lack of centralized platforms for finding and aggregating data. Additionally, coverage is often insufficient to provide a clear picture across the city, with data collected at only a few locations within a city.

Data Gap Type	Data Gap Details
R1: Reliability > Quality	The reliability of the data may be impacted by events happening on the street, such as construction, which can lead to power outages in the sensors or deter cyclists from passing through the monitored area.
O2: Obtainability > Accessibility	The accessibility is limited beyond main cities, often because of a lack of capacity to publish the data. Providing cities with easy-to-use software and relevant education may help.
O1: Obtainability > Findability	There are no central platforms gathering data, requiring users to contact individual cities. Such initiatives are needed.
U2: Usability > Aggregation	There are no central platforms gathering the data, requiring users to harmonize the data. Such initiatives are needed.
S2: Sufficiency > Coverage	These datasets typically measure data at only a few locations in a city.

Bicycle count data – temporary sensing

Bicycle counts from temporary sensing face several obtainability and sufficiency challenges. Accessibility is often limited outside major cities due to a lack of capacity to publish data, and users must contact individual cities to access it, as no central platforms exist. The data itself is often insufficient (collected at only a few locations and for short periods), limiting its usefulness. Additionally, limited documentation about why a count was conducted at a specific time and location reduces the usability of the data by omitting important contextual information.

Bicycle counts from temporary sensing face several obtainability and sufficiency challenges. Accessibility is often limited outside major cities due to a lack of capacity to publish data, and users must contact individual cities to access it, as no central platforms exist. The data itself is often insufficient (collected at only a few locations and for short periods), limiting its usefulness. Additionally, limited documentation about why a count was conducted at a specific time and location reduces the usability of the data by omitting important contextual information.

Data Gap Type	Data Gap Details
O2: Obtainability > Accessibility	The accessibility is limited beyond main cities, often because of a lack of capacity to publish the data. Providing cities with easy-to-use software and relevant education may help.
O1: Obtainability > Findability	There are no central platforms gathering data, requiring users to contact individual cities. Such initiatives are needed.
S2: Sufficiency > Coverage	These datasets typically measure data at only a few locations in a city.
S4: Sufficiency > Timeliness	These datasets provide only short snapshots, typically ranging from a few days to a few weeks.
U4: Usability > Documentation	There is often a lack of documentation about why the sensor was placed at a given location, which leaves the users without precious contextual information important for analysis.

Bike infrastructure data – from city administrations

City-provided bike infrastructure data is often hard to find, with no central platforms and limited sharing beyond major cities. Even when available, it may lack clear documentation on what has been implemented or omit detailed features of the infrastructure.

City-provided bike infrastructure data is often hard to find, with no central platforms and limited sharing beyond major cities. Even when available, it may lack clear documentation on what has been implemented or omit detailed features of the infrastructure.

Data Gap Type	Data Gap Details
O1: Obtainability > Findability	There are no central platforms gathering data, requiring users to contact individual cities. Such initiatives are needed. Developing centralized, open-access platforms that aggregate bike infrastructure data from multiple cities would streamline data discovery and access.
U4: Usability > Documentation	Sometimes implementation plans are made available, but often it remains unclear as to which part of these plans have already been carried out and which parts are yet to be realized. Mandating regular progress reporting with clear indicators on which infrastructure elements have been completed versus planned would improve transparency.
S2: Sufficiency > Coverage	Only bigger cities tend to share this data. Providing support and incentives for smaller and medium-sized cities to publish their infrastructure data can broaden geographic coverage.
S6: Sufficiency > Missing Components	The dataset may only contain ways but no additional features describing the infrastructure more precisely. Encouraging the inclusion of rich attribute data—such as bike lane type, surface quality, and signage—in datasets would enhance their usability for planning and analysis.

Historical climate observations

It is worth noting that there are no major data gaps for this use case for cities where the other necessary data sources are available.

It is worth noting that there are no major data gaps for this use case for cities where the other necessary data sources are available.

Data Gap Type	Data Gap Details
M: Misc/Other	It is worth noting that there are no major data gaps for this use case for cities where the other necessary data sources are available.

OpenStreetMap (land use map)

Bike infrastructure data in OpenStreetMap faces reliability and usability issues due to a lack of validation and inconsistent naming conventions, requiring extensive pre-processing. Elements like bike parking are often missing, reducing data completeness. Cities and shared tools can help address these gaps.

Bike infrastructure data in OpenStreetMap faces reliability and usability issues due to a lack of validation and inconsistent naming conventions, requiring extensive pre-processing. Elements like bike parking are often missing, reducing data completeness. Cities and shared tools can help address these gaps.

Data Gap Type	Data Gap Details
R1: Reliability > Quality	The quality of the data infrastructure is often unclear. There is a lack of validation of whether the type of cycle lanes has been correctly identified. While this is something that has been done for roads, no such studies have been conducted for bike infrastructure. Cities could have an active role here by comparing their data to what is available in OSM.
U5: Usability > Pre-processing	Different coding is used in different countries to name specific bike infrastructure elements (e.g. bike lanes, protected bike lanes, cycle street, etc.). Mappers may use different denominations to name the same thing. This requires pre-processing. Sharing code to preprocess the data can reduce the work required by individual researchers to use the data.
S6: Sufficiency > Missing Components	Certain important bike amenities are not well mapped e.g. bike parking.

Strava GPS-based cycling data

Strava GPS cycling data offers both the highest temporal and spatial resolution and the most comparable source across cities. However, it is accessible to cities but less so to academics and mainly represents specific user groups, limiting its coverage of the overall cycling population.

Strava GPS cycling data offers both the highest temporal and spatial resolution and the most comparable source across cities. However, it is accessible to cities but less so to academics and mainly represents specific user groups, limiting its coverage of the overall cycling population.

Data Gap Type	Data Gap Details
O2: Obtainability > Accessibility	The data can be accessed by cities, but it is harder to access for academics. It is particularly hard to get the data for multiple cities. Providing more options for researchers to access the data would be beneficial.
S2: Sufficiency > Coverage	The data primarily reflects the activity of Strava users, who are specific socio-economic groups who do not represent the general cycling population.

Mapping existing solar photovoltaic systems

Details (click to expand)

Mapping existing solar photovoltaic (PV) systems involves identifying and geolocating installed solar panels using data sources like satellite imagery, aerial photography, utility records, or crowd-sourced databases. This information helps track solar adoption, estimate local renewable energy capacity, and identify gaps or opportunities for further deployment. Accurate maps of solar PV systems are essential for climate change mitigation, as they inform policy, grid planning, and progress monitoring toward clean energy goals.

ML can segment remote sensing imagery towards building comprehensive databases of PV systems by identifying their locations, estimating their size, inferring their capacity, and approximating their installation age.

Solar PV data suffers from uneven global coverage, inconsistent formats, and limited attribute detail, compounded by data quality issues and lack of historical or timely updates; while some datasets like MaStR and USPVDB offer authoritative info, they remain geographically or scope limited, and satellite imagery faces challenges with large volumes and access to historical high-res data.

To close these gaps, efforts could focus on expanding and standardizing data collection globally—encouraging local contributions and harmonizing tagging schemes in OSM; further integrating multiple data sources (official registries, satellite imagery, crowdsourced data) to improve completeness and accuracy; and regularly updating datasets with clear versioning to track changes over time.

Dataset

Data Gap Summary

Marktstammdatenregister (solar photovoltaic data)

Although MaStR is one of the best datasets for solar PVs, substantial errors exist in the data, e.g. in terms of temporal or position accuracy.

Although MaStR is one of the best datasets for solar PVs, substantial errors exist in the data, e.g. in terms of temporal or position accuracy.

Data Gap Type	Data Gap Details
R1: Reliability > Quality	Although MaStR is one of the best datasets for solar PVs, substantial errors exist, e.g. in terms of temporal or position accuracy. Software has been written on top of MaStR to preprocess the data and improve quality. See https://www.mdpi.com/2306-5729/7/9/128

NREL Solar panel PV system dataset

The solar panel PV system dataset excluded third-party-owned systems and systems with battery backup. Since data was self-reported, component cost data may be imprecise. The dataset also has a data gap in timeliness as some of the data includes historical data, which may not reflect current pricing and costs of PV systems.

The solar panel PV system dataset excluded third-party-owned systems and systems with battery backup. Since data was self-reported, component cost data may be imprecise. The dataset also has a data gap in timeliness as some of the data includes historical data, which may not reflect current pricing and costs of PV systems.

Data Gap Type	Data Gap Details
S2: Sufficiency > Coverage	The dataset excluded third-party-owned systems, systems with battery backup, self-installed systems, and data that was missing installation prices. Data was self-reported and may be inconsistent based on the reporting of component costs. Furthermore, some state markets were underrepresented or missing, which can be alleviated by new data collection jointly with simulation studies.
S4: Sufficiency > Timeliness	The dataset includes historical data that may not reflect current pricing for PV systems. To alleviate this, updated pricing may be incorporated in the form of external data or as additional synthetic data from simulation.

OpenStreetMap (land use map)

OpenStreetMap’s solar PV data suffers from uneven global coverage and missing critical attributes, limiting its utility for comprehensive energy assessments. Additionally, inconsistent tagging and lack of quality control hinder data usability, reliability, and integration.

OpenStreetMap’s solar PV data suffers from uneven global coverage and missing critical attributes, limiting its utility for comprehensive energy assessments. Additionally, inconsistent tagging and lack of quality control hinder data usability, reliability, and integration.

Data Gap Type	Data Gap Details
S2: Sufficiency > Coverage	While OSM includes some solar PV installations, coverage is uneven and highly dependent on local contributor activity. Rooftop solar systems are especially underrepresented at the global scale, with significant portions of the world—particularly in the Global South—lacking any mapped installations. This gap limits the utility of OSM for global or regional solar energy assessments. Supporting community mapping initiatives and remote mapping campaigns—especially in underrepresented regions—can help improve global coverage of rooftop solar in OSM.
S6: Sufficiency > Missing Components	Many mapped solar PV features in OSM are missing critical attribute data such as system capacity, installation date, tilt, azimuth, or ownership type. This lack of detailed tagging reduces the usefulness of the data for technical or economic analyses of solar deployment and grid integration.
U5: Usability > Pre-processing	Solar PV features in OSM are tagged inconsistently, using a variety of overlapping or conflicting keys and values (e.g., generator:source=solar, plant:source=solar, or roof:material=solar_panels). This inconsistency creates challenges for automated data extraction, integration, and comparison across regions or datasets.
R1: Reliability > Quality	There is limited systematic quality control or ground-truthing of solar PV data in OSM. As a result, mapped features may contain inaccuracies in location, geometry, or classification, and it is often difficult to verify whether tagged installations correspond to real-world infrastructure. Encouraging validation efforts using high-resolution satellite imagery, local knowledge, or machine learning-assisted review could help identify and correct errors in mapped solar features.

Satellite imagery

Data gaps for this use case may stem from the need for large data volumes and high-resolution historical imagery.

Data gaps for this use case may stem from the need for large data volumes and high-resolution historical imagery.

Data Gap Type	Data Gap Details
U6: Usability > Large Volume	When undertaking such assessment at large scale (some studies are global) the volume of data is large. Strategies looking for proxies that are more easy to detect and can help identify most promising regions where to focus more efforts, e.g. using night light data.
S4: Sufficiency > Timeliness	To understand when PVs have been installed, one needs access to a time series of historical imagery with sufficient resolution (e.g. annual).

US large-scale solar photovoltaic database (USPVDB)

Only the US are covered in this dataset and coverage in the US is not complete.

Only the US are covered in this dataset and coverage in the US is not complete.

Data Gap Type	Data Gap Details
S2: Sufficiency > Coverage	Coverage is over the US and specifically over densely populated regions that may or may not correlate to areas of low cloud cover and high solar irradiance. Representation of smaller scale private PV systems could expand the current dataset to less populated areas as well as regions outside the US.

Modeling effects of soil processes on soil organic carbon

Details (click to expand)

Understanding the causal relationship between soil organic carbon and soil management or farming practices is crucial for enhancing agricultural productivity and evaluating agriculture-based climate mitigation strategies.

ML can significantly contribute to this understanding by integrating data from diverse sources to provide more precise spatial and temporal analyses.

The insufficient data coverage and granularity of soil organic carbon measurements severely limit the development of well-generalized ML models for accurately predicting soil carbon dynamics.

Expanding monitoring networks and developing cost-effective measurement technologies, combined with better data standardization across different collection efforts, would enable more effective ML applications for soil carbon management and climate-smart agriculture.

Dataset

Data Gap Summary

Emission dataset compiled from FAO statistics

Data is extrapolated from statistics on a national level. It is unknown how accurate this data is when focusing on local information

Data is extrapolated from statistics on a national level. It is unknown how accurate this data is when focusing on local information

Data Gap Type	Data Gap Details
R1: Reliability > Quality	Data is extrapolated from statistics on a national level. It is unknown how accurate this data is when focusing on local information.

Simulated variables from process-based models of soil organic carbon dynamics

Data collection is extremely expensive for some variables, leading to the use of simulated variables. Unfortunately, simulated values have large uncertainties due to the assumptions and simplifications made within simulation models.

Data collection is extremely expensive for some variables, leading to the use of simulated variables. Unfortunately, simulated values have large uncertainties due to the assumptions and simplifications made within simulation models.

Data Gap Type	Data Gap Details
R1: Reliability > Quality	Soil carbon generated from simulators is not reliable because these process-based models might be obsolete or might have a certain kind of systematic bias, which gets reflected in the simulated variables. But ML scientists who use those simulated variables usually don’t have the proper knowledge to properly calibrate these process-based models.

Soil measurements – NorthWyke Farms platform

The common and biggest challenges for use cases involving soil organic carbon is the insufficiency of data and the lack of high granularity data.

The common and biggest challenges for use cases involving soil organic carbon is the insufficiency of data and the lack of high granularity data.

Data Gap Type	Data Gap Details
S3: Sufficiency > Granularity	Data is quarterly value for soil carbon, but this is not enough for capturing the weekly changes in soil carbon when we change the fertilizer amount or the tilling practices.

Soil Survey Geographic Database (SSURGO)

In general, there is insufficient soil organic carbon data for training a well-generalized ML model (both in terms of coverage and granularity).

In general, there is insufficient soil organic carbon data for training a well-generalized ML model (both in terms of coverage and granularity).

Data Gap Type	Data Gap Details
U1: Usability > Structure	Data is collected by different farmers on different farms, leading to consistency issues and a need to better structure the data.
S3: Sufficiency > Granularity	In general, there is insufficient soil organic carbon data for training a well-generalized ML model (both in terms of coverage and granularity). One reason is that collecting such data is very expensive – the hardware is costly and collecting data at a high frequency is even more expensive.
S2: Sufficiency > Coverage	In general, there is insufficient soil organic carbon data for training a well-generalized ML model (both in terms of coverage and granularity). One reason is that collecting such data is very expensive – the hardware is costly and collecting data at a high frequency is even more expensive.

Optimizing electrified bus fleet in urban vehicle-to-grid systems

Details (click to expand)

Diesel-powered school buses contribute significant carbon emissions and air pollution in urban areas, while electric bus adoption faces high upfront costs that challenge school district budgets.

AI-powered optimization systems can manage electric school bus charging and discharging schedules to create virtual power plants, offsetting electrification costs through grid services revenue.

Key data gaps include inconsistent bus fleet reporting across states, limited access to proprietary charging profiles, and fragmented charge station data that prevent comprehensive fleet optimization modeling.

Standardizing state-level fleet reporting, fostering manufacturer partnerships for charging data access, and creating centralized charge station databases can enable scalable AI solutions for urban transit electrification.

Dataset

Data Gap Summary

Electric vehicle charge station data

Critical gaps include limited findability of station-specific usage data due to proprietary restrictions and scattered data sources requiring aggregation from multiple providers. Manufacturer partnerships and utility collaboration can improve data access, while standardized reporting frameworks can consolidate fragmented datasets to enable comprehensive fleet optimization

Critical gaps include limited findability of station-specific usage data due to proprietary restrictions and scattered data sources requiring aggregation from multiple providers. Manufacturer partnerships and utility collaboration can improve data access, while standardized reporting frameworks can consolidate fragmented datasets to enable comprehensive fleet optimization

Data Gap Type	Data Gap Details
O1: Obtainability > Findability	Charging station usage profiles and vehicle-specific load data are often proprietary. Solution: Establish manufacturer partnerships and utility pilot programs to access detailed charging profiles.
U2: Usability > Aggregation	Charging data scattered across multiple providers and systems. Solution: Create standardized APIs and data sharing agreements between charging network operators.

US school bus fleet dataset

The dataset suffers from inconsistent state-level reporting structures and missing data from 4 US states, limiting comprehensive national analysis. Standardizing reporting formats and expanding state participation could enable more robust AI models for fleet electrification planning across diverse geographic and operational contexts.

The dataset suffers from inconsistent state-level reporting structures and missing data from 4 US states, limiting comprehensive national analysis. Standardizing reporting formats and expanding state participation could enable more robust AI models for fleet electrification planning across diverse geographic and operational contexts.

Data Gap Type	Data Gap Details
U1: Usability > Structure	Inconsistent state-level reporting creates varying data structures and fields, with some states excluding contractor-owned buses. Solution: Develop federal reporting standards for consistent data collection across all states.
S2: Sufficiency > Coverage	Inconsistent state-level reporting creates varying data structures and fields, with some states excluding contractor-owned buses. Solution: Develop federal reporting standards for consistent data collection across all states.
S4: Sufficiency > Timeliness	Dataset maintenance discontinued after November 2022. Solution: Establish ongoing federal or industry-supported data collection mechanisms.

Optimizing smart inverter management for distributed energy resources

Details (click to expand)

Solar panels and batteries are part of new power systems that don’t use traditional spinning generators. They use inverters to convert DC to AC power. Smart inverters can do more than just convert power—they help manage changes in energy supply and keep the grid stable by adjusting voltage and power levels. This prevents issues like sudden drops or spikes in voltage when solar and other sources are added to the grid.

Machine learning can help better monitor and control smart inverters, with the potential to make efficiency gains.

One key data gap towards unlocking this use case is the access to relevant data.

Partnerships between research labs, utilities, and smart inverter manufacturers may help alleviate this bottleneck.

Dataset

Data Gap Summary

Outputs from distribution connected inverter systems simulations

There is a need to enhance existing simulation tools to study inverter-based power systems rather than traditional machine-based based. Simulations should be able to represent a large number of distribution-connected inverters that incorporate DER models with the larger power system. Uncertainty surrounding model outputs will require verification and testing. Furthermore, accessibility to simulations and hardware in the loop facilities and systems requires user access proposal submission for NREL’s Energy Systems Integration Facility access. Similar testing laboratories may require access requests and funding.

There is a need to enhance existing simulation tools to study inverter-based power systems rather than traditional machine-based based. Simulations should be able to represent a large number of distribution-connected inverters that incorporate DER models with the larger power system. Uncertainty surrounding model outputs will require verification and testing. Furthermore, accessibility to simulations and hardware in the loop facilities and systems requires user access proposal submission for NREL’s Energy Systems Integration Facility access. Similar testing laboratories may require access requests and funding.

Data Gap Type

Data Gap Details

O2: Obtainability > Accessibility

Contact NREL precise@nrel.gov for access to the PRECISE model

Submit an Energy Systems Integration Facility (ESIF) laboratory request form to userprogram.esif@nrel.gov to gain access to hardware in the loop inverter simulation systems. Access to particular hardware may require collaboration with inverter manufacturers which may have additional permission requirements.

R1: Reliability > Quality

The optimization routine of the simulation model may face challenges in determining the precise balance between grid operation criteria and impacts on customer PV generation. Generation may still require curtailment by the utility to prioritize grid stability. To circumvent this gap external data on distribution side operating conditions, load demand, solar generation, and utility-initiated generation curtailment can be collected and introduced into expanded simulation studies.

Smart inverter devices database

Smart inverter operational data is not publicly available and requires partnerships with research labs, utilities, and smart inverter manufacturers. However, the California Energy Commission maintains a database of UL 1741-SB compliant manufacturers and smart inverter models that can then be contacted for research partnerships. In terms of coverage area, while California and Hawaii are now moving towards standardizing smart inverter technology in their power systems, other regions outside of the United States may locate similar manufacturers through partnerships and collaborations.

Smart inverter operational data is not publicly available and requires partnerships with research labs, utilities, and smart inverter manufacturers. However, the California Energy Commission maintains a database of UL 1741-SB compliant manufacturers and smart inverter models that can then be contacted for research partnerships. In terms of coverage area, while California and Hawaii are now moving towards standardizing smart inverter technology in their power systems, other regions outside of the United States may locate similar manufacturers through partnerships and collaborations.

Data Gap Type	Data Gap Details
O2: Obtainability > Accessibility	Particularly for the CEC database, one will need to contact the CEC or manufacturer to receive additional information for a particular smart inverter. Detailed studies using smart inverter hardware may require collaboration with a utility and research organization to perform advanced research studies.
U2: Usability > Aggregation	To retrieve additional data beyond the single entry model and manufacturer of a particular smart inverter, one may need to contact a variety of manufacturers to get access to datasets and specifications for operational smart inverter data, laboratories to get access to hardware in the loop test centers, and utilities or local energy commissions for smart inverter safety compliance and standards.
S2: Sufficiency > Coverage	New grid support functions defined by UL1741-SA and UL1741-SB are optional but will be required for California and Hawaii as of now, public manufacturing data is available only via the CEC website. Collaborations and contact with manufacturers outside the US may be necessary to compile a similar database and contact with utilities can provide a better understanding of similar UL 1741-SB criteria adoption.

Scaling earth system monitoring

Details (click to expand)

Many climate-related applications suffer from a lack of real-time and on-the-ground data for monitoring Earth systems, limiting effective climate action and adaptation strategies.

ML can analyze satellite imagery at scale to fill these gaps through applications such as land cover classification, deforestation detection, emissions monitoring, and disaster management.

The massive volume of satellite data creates storage and processing challenges, while the lack of annotated datasets and limited access to high-resolution imagery particularly affects Global South applications.

Coordinated annotation efforts across sectors, development of foundation models for remote sensing, and expanded access to high-resolution imagery can enable more effective ML-driven Earth monitoring.

Dataset

Data Gap Summary

Satellite imagery

Satellite images face major challenges from massive data volumes that impede downloading and processing, lack of annotated data for training ML models, and limited access to high-resolution imagery, particularly affecting Global South applications.

Satellite images face major challenges from massive data volumes that impede downloading and processing, lack of annotated data for training ML models, and limited access to high-resolution imagery, particularly affecting Global South applications.

Data Gap Type	Data Gap Details
U6: Usability > Large Volume	The sheer volume of satellite data at terabyte scale makes downloading, transferring, and hosting extremely difficult, often exceeding storage capacity of data providers.
U5: Usability > Pre-processing	Satellite images contain extensive redundant information and require filtering of non-useful data areas during model training.
U5: Usability > Pre-processing	The lack of annotated data presents a major challenge, requiring coordinated sector-level annotation efforts and increased granularity in labeling.
U2: Usability > Aggregation	Imagery from different satellites varies in projection, temporal and spatial zones, and cloud coverage, making harmonization challenging when combining multiple sources.
S3: Sufficiency > Granularity	Publicly available datasets often lack sufficient spatial resolution, particularly challenging for Global South regions that cannot afford high-resolution commercial imagery.
O2: Obtainability > Accessibility	Very high-resolution satellite images typically come from commercial satellites and are not publicly available, with limited exceptions like the NICFI dataset for tropical regions.
M: Misc/Other	Cloud cover significantly reduces satellite imagery usability, requiring pixel substitution from clear-sky images that can introduce noise and errors.
M: Misc/Other	Technical capacity limitations in the Global South hinder effective utilization of available satellite imagery resources.

Scaling identification and mapping of climate policy

Details (click to expand)

Laws and regulations relevant to climate change mitigation and adaptation are essential for assessing progress on climate action and addressing various research and practical questions.

ML can be employed to identify climate-related policies and categorize them according to different focus areas.

Law corpora are published in various languages and formats by a variety of actors, including cities, national governments and other agencies. They are not all digitized, may be hard to access and require ample harmonization work.

These data gaps may be addressed through aggregation initiatives and ML may be a key component by automating lengthy processes such as translation or screening for relevance.

Dataset

Data Gap Summary

Climate-related laws and regulations

Laws and regulations for climate action are published in various formats through national and subnational governments, and most are not labeled as a “climate policy”. There are a number of initiatives that take on the challenge of selecting, aggregating, and structuring the laws to provide a better overview of the global policy landscape. This, however, requires a great deal of work, needs to be permanently updated, and datasets are not complete.

Laws and regulations for climate action are published in various formats through national and subnational governments, and most are not labeled as a “climate policy”. There are a number of initiatives that take on the challenge of selecting, aggregating, and structuring the laws to provide a better overview of the global policy landscape. This, however, requires a great deal of work, needs to be permanently updated, and datasets are not complete.

Data Gap Type

Data Gap Details

U1: Usability > Structure

Much of the data are in PDF format and need to be structured into machine-readable format. Much of the data in original languages of the publishing country and needs to be translated into English.

U2: Usability > Aggregation

Legislation data is published through national and subnational governments, and often is not explicitly labeled as “climate policy”. Determing whether it is climate-related is not simple.

This information is usually published on local websites and must be downloaded or scraped manually. There are a number of initiatives, such as Climate Policy Radar, International Energy Agency, and New Climate Institute that are working to address this by selecting, aggregating, and structuring these data to provide a better overview of the global policy landscape. However, this process is labor-intensive, requires continuous updates, and often results in incomplete datasets.

Scaling methane emission detection

Details (click to expand)

Methane is the most potent greenhouse gas and the second-largest contributor to climate change, with emissions from the oil and gas industry accounting for 20% of global methane emissions.

Advanced machine learning techniques applied to satellite imagery enable the detection, quantification, and monitoring of methane emissions at scale, supporting more effective mitigation efforts across global oil and gas operations.

The primary data gap for methane detection is insufficient spatial resolution in widely available satellite data, making it difficult to pinpoint smaller or localized emission sources and accurately quantify their contribution.

Developing higher-resolution satellite systems like MethaneSAT and creating benchmark datasets with synthetic methane plume data can significantly improve detection capabilities, enabling more targeted mitigation efforts and potentially reducing a substantial portion of global methane emissions.

Dataset

Data Gap Summary

Satellite imagery – Hyperspectral

Very few actual hyperspectral images of methane plumes exist, creating a significant data volume limitation for training robust detection algorithms.

Very few actual hyperspectral images of methane plumes exist, creating a significant data volume limitation for training robust detection algorithms.

Data Gap Type

Data Gap Details

S1: Sufficiency > Insufficient Volume

Images of methane plumes in hyperspectral satellite data are very rare, leading to insufficient data for developing and training robust detection algorithms. Consequently, researchers often use synthetic data, transposing high-resolution methane plume images from other sources such as Sentinel-2 onto hyperspectral images from platforms like PRISMA. Expanding the collection of actual hyperspectral methane plume observations or developing more sophisticated methods for generating realistic synthetic data would significantly improve detection capabilities.

Satellite imagery – Multispectral

Current multispectral satellite data has insufficient spatial resolution to detect smaller methane leaks.

Current multispectral satellite data has insufficient spatial resolution to detect smaller methane leaks.

Data Gap Type	Data Gap Details
S3: Sufficiency > Granularity	Many current satellites have limited spatial resolution, making it challenging to detect smaller or localized methane sources. This low resolution can result in inaccurate assessments, potentially missing smaller leaks or misidentifying emission sources. Higher resolution is necessary for accurately identifying and quantifying methane emissions from specific facilities or small-scale sources.

Scaling truck count inference from remote sensing data

Details (click to expand)

Truck counts are crucial for climate change mitigation because they help quantify freight activity, identify high-traffic corridors, and prioritize locations where electrification would have the greatest emissions impact. Typically, those numbers are obtained by vehicle counters installed by public or private entities on streets that distinguish between vehicle types.

As low- and middle-income countries have limited ground-based traffic monitoring and freight surveying activities, ML can be used to predict truck traffic from remote sensing imagery.

Truck count data suffers from limited coverage, especially in middle- and low-income countries, and often lacks sufficient granularity. Additionally, fragmented collection methods, inconsistent documentation, and limited data sharing hinder usability, while satellite imagery faces challenges related to cloud cover, resolution, and cost.

Data gaps for these use cases could be reduced by increased data sharing for research, such as counts and high-resolution raw satellite imagery, and expanding the number of vehicle counters to collect more data.

Dataset

Data Gap Summary

Satellite imagery

Satellite imagery for monitoring rest area capacity and usage has the typical data gaps for use cases requiring high-resolution imagery (here, both high temporal and spatial resolution matter) that is cloud-free over several kilometers.

Satellite imagery for monitoring rest area capacity and usage has the typical data gaps for use cases requiring high-resolution imagery (here, both high temporal and spatial resolution matter) that is cloud-free over several kilometers.

Data Gap Type	Data Gap Details
S3: Sufficiency > Granularity	This use case necessitates imagery at high temporal and spatial resolution, which is difficult to access and may be costly. Partnering with open satellite programs and exploring commercial imagery licensing models for research may help reduce costs to access high-resolution, high-frequency data.
R1: Reliability > Quality	This application requires no cloud coverage over several kilometers to count the trucks, thus requiring additional filtering of satellite images to ensure they are cloud free. This puts additional constraints in terms of the quality of the data needed.

Truck count data

Truck count data suffers from limited coverage, especially in middle- and low-income countries, and often lacks sufficient granularity. Additionally, fragmented collection methods, inconsistent documentation, and limited data sharing hinder usability.

Truck count data suffers from limited coverage, especially in middle- and low-income countries, and often lacks sufficient granularity. Additionally, fragmented collection methods, inconsistent documentation, and limited data sharing hinder usability.

Data Gap Type	Data Gap Details
S2: Sufficiency > Coverage	While there is good data in some high-income countries, very little data is typically available in middle- and low-income countries. International funding and technical support could help expand vehicle monitoring infrastructure and open data initiatives in underrepresented regions.
S3: Sufficiency > Granularity	The spatial resolution is often an issue (only few roads are counted) and counters are often short-term instead of continuous.
S6: Sufficiency > Missing Components	Vehicle types may often not be separated in count data, making it impossible to understand the share of trucks compared to cars. Standardizing data collection to include vehicle classification—e.g., through automated image recognition—would enable separation of trucks from cars for more granular analysis.
U2: Usability > Aggregation	There are only local initiatives aggregating data in a single platform and harmonizing content. Establishing a national or international data platform with standardized formats would support broader access and harmonization of truck count data.
O1: Obtainability > Findability	Data is dispersed as various actors collect the data (highway operators, states, cities) and each actor may release the data themselves. Implementing a coordinated data-sharing framework with common standards across jurisdictions would reduce fragmentation and improve usability.
O2: Obtainability > Accessibility	Many governments and highway operators collect this data, but do not share it publicly. Encouraging open data policies or at least tiered-access licensing for truck count data would unlock its potential for research and planning.
R2: Reliability > Provenance	Different methods are being used and often not documented. They each come with different measurement errors, and the lack of documentation of this aspect harms the robustness of scientific analyses that may be done with data. Raising awareness on the need for transparent documentation of data collection methods, sensor types, and known limitations may help.

Understanding fleet overturning and international second-hand vehicle markets

Details (click to expand)

Fleet overturning can lower emissions through more efficient vehicles and is central to shifting to electric vehicles but often used vehicles are internationally. Understanding international second-hand vehicle markets is needed for estimating how fast fleets will overturn globally. For example, the European Union (EU) targets all new vehicles to be electric by 2035; this will lead to used combustion-engine cars and increasingly electric cars being sold outside the EU. The number of second-hand vehicles sold internationally and their types are only known for some countries.

ML could have the potential to infer more data if sufficient samples are provided, and also assist in the analysis of the data.

Data on second-hand vehicle trade and electric vehicle infrastructure is fragmented – limited by poor coverage, low volume, and missing details.

There is a need for efforts aiming to collect more data and aggregate them into comprehensive, transnational datasets that can support effective policy and global transition monitoring.

Dataset

Data Gap Summary

Electric vehicle infrastructure transnational data

While certain countries have good electric vehicle infrastructure data, there is a need to create transnational datasets, for example to analyze infrastructure readiness in case of increasing international trade of EVs.

While certain countries have good electric vehicle infrastructure data, there is a need to create transnational datasets, for example to analyze infrastructure readiness in case of increasing international trade of EVs.

Data Gap Type	Data Gap Details
W: Wish	While certain countries have good electric vehicle infrastructure data, there is a need to create transnational datasets.

Second-hand vehicle international trade data

Second-hand vehicle trade data is limited by poor country coverage, low volume based on few case studies, and missing key details like vehicle type, age, fuel type, and mileage, making it insufficient for understanding global trade patterns and technology shifts.

Second-hand vehicle trade data is limited by poor country coverage, low volume based on few case studies, and missing key details like vehicle type, age, fuel type, and mileage, making it insufficient for understanding global trade patterns and technology shifts.

Data Gap Type	Data Gap Details
S2: Sufficiency > Coverage	Such data is openly available only for a few countries, and often for a few country pairs instead of comprehensive flows towards all countries. Creating a centralized, open-access repository with harmonized second-hand vehicle trade data across all countries would significantly improve transparency and coverage.
S1: Sufficiency > Insufficient Volume	These datasets may be built on a few case studies and may not be representative of overall flows. Developing datasets grounded in comprehensive customs, registration, and export records rather than isolated case studies would enhance representativeness and policy relevance.
S6: Sufficiency > Missing Components	Detailed information such as vehicle types, age, fuel type, or mileage may not be available, which limits the applicability, for example, to understand the evolution between combustion-engine and electric vehicles.

Understanding the impact of urban planning on travel emissions

Details (click to expand)

Creating sustainable, healthy, and equitable urban transportation is crucial because urban design influences and restricts population travel behaviors, limiting the potential for more sustainable mobility.

Travel behavior and its relation to the built environment in cities is a complex and multifaceted phenomenon that is challenging to comprehend and model using traditional statistical methods. ML aids in understanding these complexities, including threshold effects and nonlinearities.

A significant challenge is the scarcity of data in the Global South, where cities are rapidly expanding, and the impact of urban planning changes could be substantial. In contrast, changes in established cities like Berlin are more limited.

The level of digitalization of city administrations in both the Global North and South represents a bottleneck in data availability. Important changes in the city infrastructure are not digitalized, hindering impact analysis. Any released data is highly fragmented and not standardized. Furthermore, much of the key data on human mobility is commercial and remains practically inaccessible for scientific research. There is a need to open and harmonize this data, as the Overture Foundation has done with Points of Interest data, for example.

Dataset

Data Gap Summary

Building stock – satellite-derived

Datasets of the evolution of the building stock derived from different vintages of satellite imagery provide valuable information on cities expansions and on new constructions in general. Those include the World Settlement Footprint Evolution dataset or the Global Human Settlement Layer - AGE. These datasets, however, are provided as raster data, and the resolution is insufficient for analyzing micro urban planning interventions.

Datasets of the evolution of the building stock derived from different vintages of satellite imagery provide valuable information on cities expansions and on new constructions in general. Those include the World Settlement Footprint Evolution dataset or the Global Human Settlement Layer - AGE. These datasets, however, are provided as raster data, and the resolution is insufficient for analyzing micro urban planning interventions.

Data Gap Type	Data Gap Details
S3: Sufficiency > Granularity	The resolution of these datasets is typically between 100 m and 1 km, which does not permit distinguishing specific infrastructure elements, which are crucial for many analyses.

GPS travel trajectories and Origin-Destination data

Critical issues in using GPS and OD data include a lack of provenance details from commercial providers, causing analysis uncertainties. Accessibility is limited due to high costs and data silos from insufficient privacy-preserving sharing mechanisms. Essential trip details are often missing, and data from the Global South is less accessible despite global collection. Additionally, inadequate documentation undermines the data’s scientific value, as key information like sample representativeness is frequently absent, challenging accurate interpretation.

Critical issues in using GPS and OD data include a lack of provenance details from commercial providers, causing analysis uncertainties. Accessibility is limited due to high costs and data silos from insufficient privacy-preserving sharing mechanisms. Essential trip details are often missing, and data from the Global South is less accessible despite global collection. Additionally, inadequate documentation undermines the data’s scientific value, as key information like sample representativeness is frequently absent, challenging accurate interpretation.

Data Gap Type	Data Gap Details
R2: Reliability > Provenance	Commercial providers of GPS data often do not detail the provenance of the data that may come from different types of devices or vehicles. This creates uncertainties in downstream analyses. Such information should be released to increase the value of the data for scientific research.
O2: Obtainability > Accessibility	Most of the GPS traces are recorded by private actors who sell them at high prices, often also to researchers. Much GPS data recorded by apps whose primary business is not mobility is kept in silos and not shared, inter alia because of a lack of privacy-preserving, easy to use and trustworthy sharing mechanisms. The emergence of more intermediaries that convince private actors to share their data and facilitate the sharing might increase accessibility.
S6: Sufficiency > Missing Components	Certain variables describing the trip and enabling differentiation between different cases e.g. vehicle type or commercial vs private, would facilitate usage of the data. Such information may be available to the provider and could be shared easily, as it is not privacy-sensitive.
S2: Sufficiency > Coverage	Although smartphone data is being collected globally, data from the Global South is harder to access. Sharing initiatives and more demand from research projects focusing on Global South cities are also needed here.
U4: Usability > Documentation	There is a lack of documentation in general that undermines the scientific value of the data and downstream analyses. For example, information about the representativeness of the sample is not always provided or demonstrated.

OpenStreetMap (land use map)

OpenStreetMap data for mobility infrastructure is often incomplete outside of main roads and biased towards high-income countries. The data’s reliability is uncertain due to its crowdsourced nature, requiring quality checks. Its permissive data model leads to inconsistencies, necessitating thorough pre-processing, and it often lacks proper documentation, despite the benefits of documenting data provenance.

OpenStreetMap data for mobility infrastructure is often incomplete outside of main roads and biased towards high-income countries. The data’s reliability is uncertain due to its crowdsourced nature, requiring quality checks. Its permissive data model leads to inconsistencies, necessitating thorough pre-processing, and it often lacks proper documentation, despite the benefits of documenting data provenance.

Data Gap Type	Data Gap Details
S2: Sufficiency > Coverage	OSM data for mobility infrastructure beyond main roads (for car usage) is very heterogeneous, with a bias towards high-income countries. Further focus from the mapper community and corporate editors can alleviate these problems, as certain world regions display very well-mapped mobility infrastructure e.g. for bike lanes, parking slots, etc.
R1: Reliability > Quality	The quality of OSM data is inherently uncertain, given that the data is produced by millions of individual mappers. Quality checks can help increase the reliability of the data.
U5: Usability > Pre-processing	The very permissive data model of OSM enables mappers to write certain information without constraints, which involves information provided in different languages, sometimes not using the correct type, etc. Similarly, constructing valid geometries from the mapped nodes and lines in OSM is challenging and an opinionated process. Thus, thorough but also error-prone pre-processing is often needed. Sharing pre-processing code for similar use cases can reduce data cleaning needs.
U4: Usability > Documentation	Information often comes from observations from mappers, and no further documentation is provided. It is, however, possible and a good practice to document the provenance of the data within certain variables when relevant.

Points of Interest

POI data coverage is uneven, often biased towards high-income countries, with even leading datasets like Google Maps facing gaps, especially in the Global South. Timeliness is an issue as datasets may not reflect current business statuses. Reliability suffers from locational inaccuracies, affecting data matching and analysis. Usability is hindered by varied categorizations and languages in assembled datasets, necessitating standardization.

POI data coverage is uneven, often biased towards high-income countries, with even leading datasets like Google Maps facing gaps, especially in the Global South. Timeliness is an issue as datasets may not reflect current business statuses. Reliability suffers from locational inaccuracies, affecting data matching and analysis. Usability is hindered by varied categorizations and languages in assembled datasets, necessitating standardization.

Data Gap Type	Data Gap Details
S2: Sufficiency > Coverage	The coverage of POIs can be highly heterogeneous with a bias toward high-income countries, in particular for crowdsourced datasets. Google Maps is typically considered the best dataset, although it also has coverage issues e.g. in the Global South.
S4: Sufficiency > Timeliness	POIs can be businesses that start or stop operating. These changes are not always reflected in the data. Some automated checks can help detect outdated POIs.
R1: Reliability > Quality	The GPS location of the POI may often be inaccurate, which may cause issues in matching the dataset with others or create errors in analyses. Improved data generation procedures and testing may help with this data gap.
U5: Usability > Pre-processing	POIs datasets are often generated through the assemblage of datasets, where different categorizations or languages may be used. Further standardization and sharing of code to harmonize data can ease usage.

Street infrastructure data – LiDAR-derived

Very few such datasets exist. One of the only examples is the Berlin road survey (Straßenbefahrung) 2014, available at https://fbinter.stadt-berlin.de/fb/gisbroker.do;jsessionid=680EFD768EDCC386FBDF72B1637E71D7?cmd=navigationShowResult&mid=K.k_StraDa%40senstadt

Very few such datasets exist. One of the only examples is the Berlin road survey (Straßenbefahrung) 2014, available at https://fbinter.stadt-berlin.de/fb/gisbroker.do;jsessionid=680EFD768EDCC386FBDF72B1637E71D7?cmd=navigationShowResult&mid=K.k_StraDa%40senstadt

Data Gap Type	Data Gap Details
S2: Sufficiency > Coverage	Very few such datasets exist. Increasing the awareness of public authorities about the value of such datasets may help improve this situation. Street-view imagery may also be used as an alternative to LiDAR.
S4: Sufficiency > Timeliness	LiDAR surveys are expensive and typically not done more often than every ten years.

Street view imagery

Street view imagery generates massive data volumes, complicating usability, and access is restricted, often favoring larger cities and wealthier countries. Preprocessing for tasks like computer vision is intensive, and coverage can be incomplete or biased. Additionally, images may lack full 360-degree views or contextual details like weather conditions, impacting their treatment by computer vision algorithms.

Street view imagery generates massive data volumes, complicating usability, and access is restricted, often favoring larger cities and wealthier countries. Preprocessing for tasks like computer vision is intensive, and coverage can be incomplete or biased. Additionally, images may lack full 360-degree views or contextual details like weather conditions, impacting their treatment by computer vision algorithms.

Data Gap Type	Data Gap Details
U6: Usability > Large Volume	Street view imagery can reach terabytes of data, given that it provides a 360-degree view of streets at high resolution.
O2: Obtainability > Accessibility	Street view imagery from Google Maps is not freely accessible. Alternatives exist but may be inferior for a given area.
U5: Usability > Pre-processing	Making street view imagery ready for computer vision tasks requires substantial preprocessing, such as image alignment or matching images with other geospatial datasets. Sharing openly the toolboxes developed for facilitating pre-processing can alleviate this issue.
S2: Sufficiency > Coverage	Not all scenes in a city or even a street may have been captured exhaustively. There are also coverage gaps with biases towards larger cities and wealthier countries.
S6: Sufficiency > Missing Components	Certain parts of the 360 view may be missing, certain elements of interest may be obscured in the picture due to obstacles at the time when the image was shot. Other missing components may include elements contextualizing the conditions in which the picture was taken, for example, weather, which influences the lighting and may lead to processing errors.

Travel surveys often overlook smaller cities and rural areas, lack sufficient local data points, and provide data at ZIP code levels, limiting detailed urban planning. Modern technologies like GPS and privacy-preserving methods could address these gaps.

Travel surveys often overlook smaller cities and rural areas, lack sufficient local data points, and provide data at ZIP code levels, limiting detailed urban planning. Modern technologies like GPS and privacy-preserving methods could address these gaps.

Data Gap Type	Data Gap Details
S2: Sufficiency > Coverage	There is a tendency for travel surveys to focus on large cities and a lack of data for smaller cities and rural areas. Newer collection methods using GPS traces may enable to increase in coverage, but obtaining detailed information from questionnaires remains difficult to scale.
S1: Sufficiency > Insufficient Volume	When surveys are conducted at the national level, the number of data points per city or neighborhood is often very low. This impedes local analyses in general and, in particular, statistical analyses that can provide locally relevant insights.
S3: Sufficiency > Granularity	Data is often provided at the ZIP code level, often for privacy reasons. This impedes more granular analyses of micro-level urban planning interventions. Modern privacy-preserving technologies can help release data with higher geospatial granularity.

Urban planning projects data

The development of urban planning projects datasets faces significant hurdles, including a lack of machine-readable formats or a typical focus on large projects in existing publicly available data. These issues are compounded by geographical biases, where data availability and detail vary based on regional digitalization levels.

The development of urban planning projects datasets faces significant hurdles, including a lack of machine-readable formats or a typical focus on large projects in existing publicly available data. These issues are compounded by geographical biases, where data availability and detail vary based on regional digitalization levels.

Data Gap Type

Data Gap Details

W: Wish

Key challenges include the need for manual data interpretation due to non-machine-readable formats, an incomplete picture from focusing on large projects, and, in general, only a fraction of the projects being described in sufficient detail online.

Such information is already difficult to compile in the high-income cities where digitalization levels of the administrations are comparatively high, and may require even more manual transcription work in low-income contexts.

Engaging with local authorities on topics of digital education and the value of information for research, or developing tailored software to assist the digitalization of decisions, may help progress towards more of such datasets.

Weather forecasting: Short-to-medium term (1-14 days)

Details (click to expand)

Weather forecasting at 1-14 days ahead has implications for real-time response and planning applications within both climate change mitigation and adaptation. ML can help improve short-to-medium-term weather forecasts.

Dataset

Data Gap Summary

ECMWF ENS (global 9-km 15-day ahead weather model ensemble)

The biggest challenge of ENS is that only a portion of it is available to the public for free.

The biggest challenge of ENS is that only a portion of it is available to the public for free.

Data Gap Type	Data Gap Details
O2: Obtainability > Accessibility	High-resolution real-time ensemble forecasts are for purchase only and expensive to obtain for research purposes.
U6: Usability > Large Volume	Due to high spatio-temporal resolution and ensemble size, ENS data presents significant challenges for downloading, transferring, storing, and processing with standard computational infrastructure. The WeatherBench 2 benchmark dataset provides a solution for certain machine learning tasks involving ENS data.

ECMWF ERA5 Atmospheric Reanalysis

ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.

ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.

Data Gap Type	Data Gap Details
O2: Obtainability > Accessibility	Download delays from Copernicus Climate Data Store - Enhanced server infrastructure, regional mirror sites, and cloud-based access platforms can reduce download times from days/months to hours
U6: Usability > Large Volume	Massive storage and processing requirements - Cloud computing platforms with pre-loaded datasets and data subsetting tools can enable analysis without full downloads
R1: Reliability > Quality	Inherent biases limit ground truth applications - ML-enhanced data assimilation techniques and ensemble reanalysis approaches can reduce model-dependent biases, particularly improving precipitation and cloud field accuracy

ECMWF HRES (global 9-km 10-day ahead weather model)

The biggest challenge of HRES is that only a portion of it is available to the public for free.

The biggest challenge of HRES is that only a portion of it is available to the public for free.

Data Gap Type	Data Gap Details
O2: Obtainability > Accessibility	The high-resolution real-time forecast is for purchase only and it is expensive to get them.
U6: Usability > Large Volume	Due to its high spatio-temporal resolution, HRES data is quite large, creating significant challenges for downloading, transferring, storing, and processing with standard computational infrastructure. To address this, the WeatherBench 2 benchmark dataset provides a solution for certain machine learning tasks involving HRES data.

Weather Bench 2 is based on ERA5, so the issues of ERA5 are also inherent here, that is, data has biases over regions where there are no observations.

Weather Bench 2 is based on ERA5, so the issues of ERA5 are also inherent here, that is, data has biases over regions where there are no observations.

Data Gap Type	Data Gap Details
R1: Reliability > Quality	Inherent biases limit ground truth applications - ML-enhanced data assimilation techniques and ensemble reanalysis approaches can reduce model-dependent biases, particularly improving precipitation and cloud field accuracy

Weather forecasting: Subseasonal horizon

Details (click to expand)

High-fidelity weather forecasts at subseasonal to seasonal scales (3-4 weeks ahead) are important for a variety of climate change-related applications. ML can be used to postprocess outputs from physics-based numerical forecast models in order to generate higher-fidelity forecasts.

Dataset

Data Gap Summary

ECMWF ERA5 Atmospheric Reanalysis

ERA5 is widely used due to its high resolution and global coverage, but faces significant accessibility and reliability challenges. Download times from the Copernicus Climate Data Store can take days to months due to high demand and data storage on tape systems. ERA5’s own biases and uncertainties, particularly in precipitation fields, limit its effectiveness as ground truth for ML bias correction. Enhanced download infrastructure and improved reanalysis methods incorporating ML-based data assimilation can address these limitations.

ERA5 is widely used due to its high resolution and global coverage, but faces significant accessibility and reliability challenges. Download times from the Copernicus Climate Data Store can take days to months due to high demand and data storage on tape systems. ERA5’s own biases and uncertainties, particularly in precipitation fields, limit its effectiveness as ground truth for ML bias correction. Enhanced download infrastructure and improved reanalysis methods incorporating ML-based data assimilation can address these limitations.

Data Gap Type	Data Gap Details
O2: Obtainability > Accessibility	Download delays from Copernicus Climate Data Store - Enhanced server infrastructure, regional mirror sites, and cloud-based access platforms can reduce download times from days/months to hours
U6: Usability > Large Volume	Massive storage and processing requirements - Cloud computing platforms with pre-loaded datasets and data subsetting tools can enable analysis without full downloads
R1: Reliability > Quality	Inherent biases limit ground truth applications - ML-enhanced data assimilation techniques and ensemble reanalysis approaches can reduce model-dependent biases, particularly improving precipitation and cloud field accuracy

More data is needed to develop a more accurate and robust ML model. It is also important to note that SUBX data contains biases and uncertainties, which can be inherited by ML models trained with this data.

More data is needed to develop a more accurate and robust ML model. It is also important to note that SUBX data contains biases and uncertainties, which can be inherited by ML models trained with this data.

Data Gap Type	Data Gap Details
S1: Sufficiency > Insufficient Volume	Larger models generally offer improved performance for developing data-driven sub-seasonal forecast models. However, with only a limited number of models contributing to the SUBX dataset, there is a scarcity of training data. To enhance ML model performance, more SUBX data generated by physics-based numerical weather forecast models is required.

Weather forecasting: Subseasonal-to-seasonal horizon

Details (click to expand)

High-fidelity weather forecasts at the subseasonal-to-seasonal (S2S) scale (i.e., 10-46 days ahead) are important for a variety of climate change-related applications. ML can be used to postprocess outputs from physics-based numerical forecast models in order to generate higher-fidelity forecasts.

Dataset

Data Gap Summary

CPC Precipitation (global unified daily precipitation)

CPC Global Unified gauge-based analysis of daily precipitation https://psl.noaa.gov/data/gridded/data.cpc.globalprecip.html

CPC Global Unified gauge-based analysis of daily precipitation https://psl.noaa.gov/data/gridded/data.cpc.globalprecip.html

Data Gap Type	Data Gap Details
R1: Reliability > Quality	There is large uncertainty in data as data is derived via interpolating station data. There are large biases over areas where rain gauge stations are sparse
S3: Sufficiency > Granularity	Resolution is 0.5 deg (roughly 50km) and not sufficiently fine for many applications.

S2S forecast data

More data is needed to take advantage of the large ML models.

More data is needed to take advantage of the large ML models.

Data Gap Type	Data Gap Details
S1: Sufficiency > Insufficient Volume	Currently available data is not sufficient for training large ML models. More data is needed.

Dataset

Gap Types

Modalities

Sectors

Camera trap wildlife image collections

Details (click to expand)

Camera traps are likely the most widely used sensors in automated biodiversity monitoring due to their low cost and simple installation. This medium offers close-range monitoring over long time scales. Images and image sequences can be used to not only classify species but to identify specifics about an individual, e.g. sex, age, health, behavior, and predator-prey interactions. Camera trap data has been used to estimate species occurrence, richness, distribution, and density.

Use Case

Data Gap Summary

Automating individual re-identification for wildlife

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Data Gap Type

Data Gap Details

S1: Sufficiency > Insufficient Volume

This scarcity of publicly open and well-annotated datasets arises from the labor-intensive nature of data annotation and the dearth of expertise necessary to accurately identify species. This is the case even for relatively well-studied taxa such as birds, and is even more notable for taxa such as Orthoptera (grasshoppers and relatives). The incompleteness of our current taxonomy also compounds this issue.

Though some data cannot be shared with the public because of concerns about poaching or other legal restrictions imposed by local governments, there are still considerable sources of well-annotated data that currently reside within the confines of individual researchers’ or organizations’ hard drives. Unlocking these data will partially address the data gap.

To achieve this goal:

A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.

S2: Sufficiency > Coverage

There exists a significant geographic imbalance in data collected, with insufficient representation from highly biodiverse regions. The data also does not cover a diverse spectrum of species, intertwining with taxonomy insufficiencies.

Addressing these gaps involves several strategies:

Data sharing: Unlocking existing datasets scattered across individual laboratories and organizations can partially alleviate the geographic and taxonomic gaps in data coverage.
Annotation Efforts:
- Leveraging crowdsourcing platforms enables collaborative data collection and facilitates volunteer-driven annotation, playing a pivotal role in expanding annotated datasets.
- Collaborating with domain experts, including ecologists and machine learning practitioners, is crucial. Incorporating local ecological knowledge, particularly from indigenous communities in target regions, enriches data annotation efforts.
- Developing AI-powered methods for automatic cataloging, searching, and data conversion is imperative to streamline data processing and extract relevant information efficiently.
Data Collection:
- Prioritize continuous data collection efforts, especially in underrepresented regions and for less documented species and taxonomic groups.
- Explore innovative approaches like Automated Multisensor Stations for Monitoring of Species Diversity (AMMODs) to enhance data collection capabilities.
- Strategically plan data collection initiatives to target species at risk of extinction, ensuring data capture before irreversible loss occurs.

U2: Usability > Aggregation

Many well-annotated datasets are currently scattered within the confines of individual researchers’ or organizations’ hard drives. Aggregating and sharing these datasets would largely address the lack of publicly open annotated data needed for model training. This initiative would also mitigate geographic and taxonomic imbalances in existing data.

To achieve this goal:

A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.

To achieve this goal:

A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.

Enabling 2D to 3D shape recovery and pose estimation of animals

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity studies. Addressing this gap requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity studies. Addressing this gap requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Data Gap Type

Data Gap Details

S1: Sufficiency > Insufficient Volume

This scarcity of publicly open and well-annotated datasets arises from the labor-intensive nature of data annotation and the dearth of expertise necessary to accurately identify species. This is the case even for relatively well-studied taxa such as birds, and is even more notable for taxa such as Orthoptera (grasshoppers and relatives). The incompleteness of our current taxonomy also compounds this issue.

Though some data cannot be shared with the public because of concerns about poaching or other legal restrictions imposed by local governments, there are still considerable sources of well-annotated data that currently reside within the confines of individual researchers’ or organizations’ hard drives. Unlocking these data will partially address the data gap.

To achieve this goal:

A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.

S2: Sufficiency > Coverage

There exists a significant geographic imbalance in data collected, with insufficient representation from highly biodiverse regions. The data also does not cover a diverse spectrum of species, intertwining with taxonomy insufficiencies. There is now increasing work in insect camera traps, but this field is still in its infancy and data remains limited.

Addressing these gaps involves several strategies:

Data sharing: Unlocking existing datasets scattered across individual laboratories and organizations can partially alleviate the geographic and taxonomic gaps in data coverage.
Annotation Efforts:
- Leveraging crowdsourcing platforms enables collaborative data collection and facilitates volunteer-driven annotation, playing a pivotal role in expanding annotated datasets.
- Collaborating with domain experts, including ecologists and machine learning practitioners, is crucial. Incorporating local ecological knowledge, particularly from indigenous communities in target regions, enriches data annotation efforts.
- Developing AI-powered methods for automatic cataloging, searching, and data conversion is imperative to streamline data processing and extract relevant information efficiently.
Data Collection:
- Prioritize continuous data collection efforts, especially in underrepresented regions and for less documented species and taxonomic groups.
- Explore innovative approaches like Automated Multisensor Stations for Monitoring of Species Diversity (AMMODs) to enhance data collection capabilities.
- Strategically plan data collection initiatives to target species at risk of extinction, ensuring data capture before irreversible loss occurs.

U2: Usability > Aggregation

Many well-annotated datasets are currently scattered within the confines of individual researchers’ or organizations’ hard drives. Aggregating and sharing these datasets would largely address the lack of publicly open annotated data needed for model training. This initiative would also mitigate geographic and taxonomic imbalances in existing data.

To achieve this goal:

A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.

To achieve this goal:

A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.

Facilitating forest restoration monitoring

There is a lack of standardized protocol to guide data collection for restoration projects, including which variables to measure and how to measure them. Developing such protocols would standardize data collection practices, enabling consistent assessment and comparison of restoration successes and failures on a global scale.

There is a lack of standardized protocol to guide data collection for restoration projects, including which variables to measure and how to measure them. Developing such protocols would standardize data collection practices, enabling consistent assessment and comparison of restoration successes and failures on a global scale.

Data Gap Type

Data Gap Details

U1: Usability > Structure

For restoration projects, there is an urgent need for standardized protocols to guide data collection and curation for measuring biodiversity and other complex ecological outcomes of restoration projects. For example, there is an urgent need of clearly written guidance on what variables to collect and how to collect them.

Turning raw data into usable insights is also a significant challenge. There is a lack of standardized tools and workflows to turn the raw data into analysis-ready data and analyze the data in a consistent way across projects.

U4: Usability > Documentation

For restoration projects, there is an urgent need for standardized protocols to guide data collection and curation for measuring biodiversity and other complex ecological outcomes of restoration projects. For example, there is an urgent need for clearly written guidance on what variables to collect and how to collect them.

Turning raw data into usable insights is also a significant challenge. There is a lack of standardized tools and workflows to turn the raw data into analysis-ready data and analyze the data in a consistent way across projects.

Facilitating the detection of climate-induced ecosystem changes

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Data Gap Type

Data Gap Details

U2: Usability > Aggregation

A lot of well-annotated datasets are currently scattered within the confines of individual researchers’ or organizations’ hard drives. Aggregating and sharing these datasets would largely address the lack of publicly open annotated data needed for model training. This initiative would also mitigate geographic and taxonomic imbalances in existing data.

To achieve this goal:

A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.

S2: Sufficiency > Coverage

There exists a significant geographic imbalance in data collected, with insufficient representation from highly biodiverse regions. The data also does not cover a diverse spectrum of species, intertwining with taxonomy insufficiencies.

Addressing these gaps involves several strategies:

Data sharing: Unlocking existing datasets scattered across individual laboratories and organizations can partially alleviate the geographic and taxonomic gaps in data coverage.
Annotation Efforts:

- Leveraging crowdsourcing platforms enables collaborative data collection and facilitates volunteer-driven annotation, playing a pivotal role in expanding annotated datasets.

- Collaborating with domain experts, including ecologists and machine learning practitioners, is crucial. Incorporating local ecological knowledge, particularly from indigenous communities in target regions, enriches data annotation efforts.

- Developing AI-powered methods for automatic cataloging, searching, and data conversion is imperative to streamline data processing and extract relevant information efficiently.

Data Collection:

- Prioritize continuous data collection efforts, especially in underrepresented regions and for less documented species and taxonomic groups.

- Explore innovative approaches like Automated Multisensor Stations for Monitoring of Species Diversity (AMMODs) to enhance data collection capabilities.

- Strategically plan data collection initiatives to target species at risk of extinction, ensuring data capture before irreversible loss occurs.

S1: Sufficiency > Insufficient Volume

One of the foremost challenges is the scarcity of publicly open and adequately annotated data for model training. It is worth mentioning that the lack of annotated datasets is a common and major challenge applied to almost every modality of biodiversity data and is particularly acute for bioacoustic data, where the availability of sufficiently large and diverse datasets, encompassing a wide array of species, remains limited for model training purposes.

This scarcity of publicly open and well annotated datasets arises from the labor-intensive nature of data annotation and the dearth of expertise necessary to accurately identify species. This is the case even for relatively well-studied taxa such as birds, and is even more notable for taxa such as Orthoptera (grasshoppers and relatives). The incompleteness of our current taxonomy also compounds this issue.

Though some data cannot be shared with the public because of concerns about poaching or other legal restrictions imposed by local governments, there are still considerable sources of well-annotated data that currently reside within the confines of individual researchers’ or organizations’ hard drives. Unlocking these data will partially address the data gap.

To achieve this goal:

A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.

Improving terrestrial wildlife detection and species classification

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Data Gap Type

Data Gap Details

S1: Sufficiency > Insufficient Volume

This scarcity of publicly open and well-annotated datasets arises from the labor-intensive nature of data annotation and the dearth of expertise necessary to accurately identify species. This is the case even for relatively well-studied taxa such as birds, and is even more notable for taxa such as Orthoptera (grasshoppers and relatives). The incompleteness of our current taxonomy also compounds this issue.

Though some data cannot be shared with the public because of concerns about poaching or other legal restrictions imposed by local governments, there are still considerable sources of well-annotated data that currently reside within the confines of individual researchers’ or organizations’ hard drives. Unlocking these data will partially address the data gap.

To achieve this goal:

A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.

S2: Sufficiency > Coverage

There exists a significant geographic imbalance in data collected, with insufficient representation from highly biodiverse regions. The data also does not cover a diverse spectrum of species, intertwining with taxonomy insufficiencies.

Addressing these gaps involves several strategies:

Data sharing: Unlocking existing datasets scattered across individual laboratories and organizations can partially alleviate the geographic and taxonomic gaps in data coverage.
Annotation Efforts:
- Leveraging crowdsourcing platforms enables collaborative data collection and facilitates volunteer-driven annotation, playing a pivotal role in expanding annotated datasets.
- Collaborating with domain experts, including ecologists and machine learning practitioners, is crucial. Incorporating local ecological knowledge, particularly from indigenous communities in target regions, enriches data annotation efforts.
- Developing AI-powered methods for automatic cataloging, searching, and data conversion is imperative to streamline data processing and extract relevant information efficiently.
Data Collection:
- Prioritize continuous data collection efforts, especially in underrepresented regions and for less documented species and taxonomic groups.
- Explore innovative approaches like Automated Multisensor Stations for Monitoring of Species Diversity (AMMODs) to enhance data collection capabilities.
- Strategically plan data collection initiatives to target species at risk of extinction, ensuring data capture before irreversible loss occurs.

U2: Usability > Aggregation

Many well-annotated datasets are currently scattered within the confines of individual researchers’ or organizations’ hard drives. Aggregating and sharing these datasets would largely address the lack of publicly open annotated data needed for model training. This initiative would also mitigate geographic and taxonomic imbalances in existing data.

To achieve this goal:

A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.

To achieve this goal:

A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.

Academic literature databases

Details (click to expand)

Academic literature databases, such as Openalex, Web of Science, Scopus.

Use Case	Data Gap Summary

Active fire data – satellite-derived

Details (click to expand)

Active fire data derived from images taken by satellites such as MODIS, VIIRS, and LANDSAT at different spatial resolutions and temporal frequencies. These datasets provide near real-time detection of active fires globally and can be downloaded fromhttps://firms.modaps.eosdis.nasa.gov/active_fire.

Use Case	Data Gap Summary

Advanced metering infrastructure data

Details (click to expand)

Advanced Metering Infrastructure (AMI) facilitates communication between utilities and customers through smart meter device systems that collect, store, and analyze per building energy consumption.

AMI data can be retrieved through public data portals, individual data collection, or research partnerships with local utilities. Some examples of utility research partnerships include the Irvine Smart Grid Demonstration (ISGD) project conducted by Southern California Edison (SCE) and the smart meter pilot test from the Sacramento Municipal Utility. An example of publicly available data that is aggregated and anonymized is the Commission for Energy Regulation (CER) Smart Metering Project hosted by the Irish Social Science Data Archive (ISSDA).

Use Case

Data Gap Summary

Improving short-term electricity load forecasting

AMI data is challenging to obtain without pilot study partnerships with utilities since data collection on individual building consumer behavior can infringe upon customer privacy, especially at the residential level. Granularity of time series data can also vary depending on the level of access to the data, whether it be aggregated and anonymized or based on the resolution of readings and system. Additionally, the coverage of data will be limited to utility pilot test service areas, thereby restricting the scope and scale of demand studies.

AMI data is challenging to obtain without pilot study partnerships with utilities since data collection on individual building consumer behavior can infringe upon customer privacy, especially at the residential level. Granularity of time series data can also vary depending on the level of access to the data, whether it be aggregated and anonymized or based on the resolution of readings and system. Additionally, the coverage of data will be limited to utility pilot test service areas, thereby restricting the scope and scale of demand studies.

Data Gap Type	Data Gap Details
O2: Obtainability > Accessibility	Access to real AMI data can be difficult to obtain due to privacy concerns. Even when partnered with a utility, the AMI data can undergo anonymization and aggregation to protect individual customers. Some ISOs are able to distribute data provided that a written records request is submitted. If requesting personal consumption data, program pricing enrollment, may limit temporal resolution of data that a utility can provide. Open datasets, on the other hand, may only be available for academic research or teaching use (ISSDA CER data).
U2: Usability > Aggregation	AMI data when used jointly with other data that may influence demand such as weather, availability of rooftop solar, presence of electric vehicles, building specifications, and appliance inventory may require significant additional data collection or retrieval. Non-intrusive load monitoring techniques to disaggregate AMI data may be employed with some assumptions based on additional data. For example, the use of satellite imagery over a region of interest can assist in identifying buildings that have solar panels.
U3: Usability > Usage Rights	For ISSDA CER data use, a request form must be submitted for evaluation by issda@ucd.ie. For data obtained through utility collaborative partnerships, usage rights may vary. Please contact the data provider for more information.
U5: Usability > Pre-processing	Data cleanliness may vary depending on the data source. For individual private data collection through testbed development, cleanliness can depend on formats of data stream output from the sensor network system installed. When designing the testbed data format it is recommended to develop and structure comprehensive metadata with respect to the study to encourage further development.
R1: Reliability > Quality	Anonymized data may not be verifiable or useful once it is open-source. Further data collection for verification purposes is recommended.
S2: Sufficiency > Coverage
S3: Sufficiency > Granularity	Meter resolution can vary based on the hardware ranging from 1 hour, 30 minute, to 15 minute measurement intervals. Depending on the level of anonymization and aggregation of data, the granularity may be constrained to other factors such as the cadence of time of use pricing and other tiered demand response programs employed by the partnering utility. Interpolation may be used to combat issues with respect to resolution but may require uncertainty considerations when reporting results.
S4: Sufficiency > Timeliness	With respect to the CER Smart Metering Project and the associated Customer Behavior Trials (CBT), Electric Ireland and Bord Gais Energy smart meter installation and monitoring occurred from 2009-2010. This anonymized dataset may no longer be representative of current behavior usage as household compositions and associated loads change with time. Similarly, pilot programs through participating utilities are finite in nature. To address this data gap, in the context of previous pilot study locations, studies and testbeds can be reopened or revisited. In the context of new studies in different locations, previous data can still be utilized for pre-training models, however, fine-tuning would still require new data collection.

Aerial power line corridor inspection data

Details (click to expand)

LiDAR and image data collected from unmanned aerial vehicles (UAVs) for power line right-of-way (RoW) inspection can be accessed from private providers such as LUMA Energy and COR3, as well as sources like China Southern Power Grid with dastasets from Yunnan RoW-1, Yunnan RoW-2, and Hubei RoW 4. Open source EPRI distribution inspection imagery is also available and labeled with information regarding conductors, poles, crossarms, insulators, and other infrastructure components. These datasets pair images with geolocated GIS data to identify priority vegetation management areas near transmission lines.

Use Case

Data Gap Summary

Enhancing power grid-vegetation management for wildfire risk mitigation

UAV imagery for vegetation management near power lines requires partnerships with private companies and utilities for access. LiDAR data is often sparse with partial line scans resulting in poor data quality. Coverage is typically limited to specific rights-of-way, requiring continuous monitoring to track vegetation growth over time.

UAV imagery for vegetation management near power lines requires partnerships with private companies and utilities for access. LiDAR data is often sparse with partial line scans resulting in poor data quality. Coverage is typically limited to specific rights-of-way, requiring continuous monitoring to track vegetation growth over time.

Data Gap Type	Data Gap Details
U3: Usability > Usage Rights	Once collected, data is private as RoWs represent critical energy infrastructure. Private partnerships may allow for extended usage rights within a predefined scope.
S4: Sufficiency > Timeliness	Measurements should be taken at multiple periods to examine transmission line characteristics to both vegetation growth and or line sag caused by overvoltage conditions.
S2: Sufficiency > Coverage	Coverage can vary depending on the RoW examined. Often multiple datasets that contain multiple transmission RoW UAV image data would be necessary to increase the number of image examples in the dataset.
O1: Obtainability > Findability	Must be involved in an active study with a partnering utility or transmission owner to get access to pre-existing drone data or to get permission to collect drone data.

Automated surface observation system (ASOS)

Details (click to expand)

This dataset contains one- and five-minute observations from automated surface observation system stations in the US. The ASOS network provides near real-time surface weather measurements including wind speed and direction, dew point, air temperature, station pressure, precipitation, visibility, and cloud characteristics. See https://madis.ncep.noaa.gov/madis_OMO.shtml

Use Case

Data Gap Summary

Accelerating and improving weather forecasting: Near-term (< 24 hours)

Data volume is large and only data specific to the US is available.

Data volume is large and only data specific to the US is available.

Data Gap Type	Data Gap Details
U6: Usability > Large Volume	The large data volume, resulting from its high spatio-temporal resolution, makes transferring and processing the data very challenging. It would be beneficial if the data were accessible remotely and if computational resources were provided alongside it.
S2: Sufficiency > Coverage	This assimilated dataset currently covers only the continental US. It would be highly beneficial to have a similar dataset that includes global coverage.

Benchmark datasets for building energy modeling

Details (click to expand)

Building energy modeling datasets provide measurements of energy demand profiles for a sample of buildings, as well as relevant input variables for traditional and ML-based models, enabling us to benchmark the performance of different models for energy prediction tasks. For example, the US Office of Energy Efficiency and Renewable Energy hosts 15 building datasets for 10 states covering 7 climate zones and 11 different building types (https://bbd.labworks.org/dataset-search). The data covers energy, indoor air quality, occupancy, environment, HVAC, lighting, and energy consumption to name a few. Datasets are organized by name and points of contact.

All data featured on the platform is open access, with standardization on metadata format to allow for ease of use and information specific to buildings based on type, location, and climate zone. Data quality and guidance on curation and cleaning, in addition to access restrictions, are specified in the metadata of each hosted dataset. Licensing information for each individual featured dataset is provided.

Use Case

Data Gap Summary

Accelerating building energy models

Benchmark datasets for building energy modeling are few, are mostly available in the US, and cover a limited range of building types. The variables provided in such datasets are not always precise and comprehensive enough to test models adequately.

Benchmark datasets for building energy modeling are few, are mostly available in the US, and cover a limited range of building types. The variables provided in such datasets are not always precise and comprehensive enough to test models adequately.

Data Gap Type	Data Gap Details
O2: Obtainability > Accessibility	Most of the energy demand data is not freely available. Reasons include the reluctance of private companies to share the data and privacy concerns with respect to the residents of the buildings. Such data may be obtained for research via non-disclosure agreements, often after lengthy bureaucratic approval. This situation makes the development of open-access benchmark datasets complex. Targeted stakeholder engagement via data collection projects is required to overcome this situation.
U2: Usability > Aggregation	The different variables needed may not always be available together. One may need to match energy demand with building stock information and climatic data. Reusable open-source tools may ease this process.
S2: Sufficiency > Coverage	Most datasets are from test beds, buildings, and contributing households from the United States. Similar data from other regions would require data collection as household usage behavior may differ depending on culture, location, building age, and weather. Targeted stakeholder engagement via data collection projects is required to overcome this situation.
S3: Sufficiency > Granularity	Dataset time resolution and period of temporal coverage vary depending on the dataset selected. To overcome this gap, interpolation techniques may be employed and recorded.
S6: Sufficiency > Missing Components	Certain detailed variables about the building design and occupancy may not be recorded. Such data points are difficult to obtain without new data collection. Building data typically does not include grid interactive data or signals from the utility side with respect to control or demand side management. Such data can be difficult to obtain or require special permissions. By enabling the collection of utility side signals, utility-initiated auto-demand response (auto-DR) and load shifting could be better assessed.

Bicycle count data – permanent sensing

Details (click to expand)

Permanent bicycle sensing involves installing fixed sensors at key locations, such as bike lanes or intersections, to continuously record the number of cyclists passing by. These sensors, often using technologies like inductive loops, infrared beams, or pneumatic tubes, collect data at high temporal resolution (15 min intervals) over time, allowing cities to monitor usage patterns, track trends, and evaluate the impact of infrastructure or policy changes.

Use Case

Data Gap Summary

Interpolating city-wide bicycle volumes from limited count data

Several data gaps limit the effectiveness of bicycle count data from permanent sensing. Events like construction can compromise reliability, while obtainability is hindered by limited accessibility outside major cities and the lack of centralized platforms for finding and aggregating data. Additionally, coverage is often insufficient to provide a clear picture across the city, with data collected at only a few locations within a city.

Several data gaps limit the effectiveness of bicycle count data from permanent sensing. Events like construction can compromise reliability, while obtainability is hindered by limited accessibility outside major cities and the lack of centralized platforms for finding and aggregating data. Additionally, coverage is often insufficient to provide a clear picture across the city, with data collected at only a few locations within a city.

Data Gap Type	Data Gap Details
R1: Reliability > Quality	The reliability of the data may be impacted by events happening on the street, such as construction, which can lead to power outages in the sensors or deter cyclists from passing through the monitored area.
O2: Obtainability > Accessibility	The accessibility is limited beyond main cities, often because of a lack of capacity to publish the data. Providing cities with easy-to-use software and relevant education may help.
O1: Obtainability > Findability	There are no central platforms gathering data, requiring users to contact individual cities. Such initiatives are needed.
U2: Usability > Aggregation	There are no central platforms gathering the data, requiring users to harmonize the data. Such initiatives are needed.
S2: Sufficiency > Coverage	These datasets typically measure data at only a few locations in a city.

Bicycle count data – temporary sensing

Details (click to expand)

Temporary bicycle sensing involves deploying portable counters at selected locations for a limited period, typically ranging from a few days to a few weeks. Using technologies like pneumatic tubes or infrared sensors, these short-term counts provide snapshots of cycling activity, often used to supplement permanent data, capture seasonal variations, or assess areas without permanent infrastructure, or inform planning decisions when infrastructure changes are being considered.

Use Case

Data Gap Summary

Interpolating city-wide bicycle volumes from limited count data

Bicycle counts from temporary sensing face several obtainability and sufficiency challenges. Accessibility is often limited outside major cities due to a lack of capacity to publish data, and users must contact individual cities to access it, as no central platforms exist. The data itself is often insufficient (collected at only a few locations and for short periods), limiting its usefulness. Additionally, limited documentation about why a count was conducted at a specific time and location reduces the usability of the data by omitting important contextual information.

Bicycle counts from temporary sensing face several obtainability and sufficiency challenges. Accessibility is often limited outside major cities due to a lack of capacity to publish data, and users must contact individual cities to access it, as no central platforms exist. The data itself is often insufficient (collected at only a few locations and for short periods), limiting its usefulness. Additionally, limited documentation about why a count was conducted at a specific time and location reduces the usability of the data by omitting important contextual information.

Data Gap Type	Data Gap Details
O2: Obtainability > Accessibility	The accessibility is limited beyond main cities, often because of a lack of capacity to publish the data. Providing cities with easy-to-use software and relevant education may help.
O1: Obtainability > Findability	There are no central platforms gathering data, requiring users to contact individual cities. Such initiatives are needed.
S2: Sufficiency > Coverage	These datasets typically measure data at only a few locations in a city.
S4: Sufficiency > Timeliness	These datasets provide only short snapshots, typically ranging from a few days to a few weeks.
U4: Usability > Documentation	There is often a lack of documentation about why the sensor was placed at a given location, which leaves the users without precious contextual information important for analysis.

Bike infrastructure data – from city administrations

Details (click to expand)

Bike infrastructure data from city administrations is usually generated through planning and transportation departments using GIS tools, engineering surveys, and infrastructure project records. It reflects officially planned and built cycling facilities, including bike lanes, shared paths, and intersections. The quality of this data is higher than crowdsourced alternatives, with a tolerance down to a few centimeters.

Use Case

Data Gap Summary

Interpolating city-wide bicycle volumes from limited count data

City-provided bike infrastructure data is often hard to find, with no central platforms and limited sharing beyond major cities. Even when available, it may lack clear documentation on what has been implemented or omit detailed features of the infrastructure.

City-provided bike infrastructure data is often hard to find, with no central platforms and limited sharing beyond major cities. Even when available, it may lack clear documentation on what has been implemented or omit detailed features of the infrastructure.

Data Gap Type	Data Gap Details
O1: Obtainability > Findability	There are no central platforms gathering data, requiring users to contact individual cities. Such initiatives are needed. Developing centralized, open-access platforms that aggregate bike infrastructure data from multiple cities would streamline data discovery and access.
U4: Usability > Documentation	Sometimes implementation plans are made available, but often it remains unclear as to which part of these plans have already been carried out and which parts are yet to be realized. Mandating regular progress reporting with clear indicators on which infrastructure elements have been completed versus planned would improve transparency.
S2: Sufficiency > Coverage	Only bigger cities tend to share this data. Providing support and incentives for smaller and medium-sized cities to publish their infrastructure data can broaden geographic coverage.
S6: Sufficiency > Missing Components	The dataset may only contain ways but no additional features describing the infrastructure more precisely. Encouraging the inclusion of rich attribute data—such as bike lane type, surface quality, and signage—in datasets would enhance their usability for planning and analysis.

Biodiversity images and recordings – community science data

Details (click to expand)

Images and recordings contributed by volunteers represent another significant source of data on biodiversity and ecosystems. Crowdsourcing platforms, such as iNaturalist, eBird, Zooniverse, and Wildbook, facilitate the sharing of community science data. Many of these platforms also serve as hubs for collating and annotating datasets.

Use Case

Data Gap Summary

Improving terrestrial wildlife detection and species classification

One challenge with community science data is biases in geographic and taxonomic representativity: while community science data can provide broader coverage than formal survey data, but is highly biased and the biases are not documented.. Data tends to be concentrated in accessible areas and often focuses on charismatic or commonly encountered species. This limits the generalizability of ML models that can be built from training on this data.

One challenge with community science data is biases in geographic and taxonomic representativity: while community science data can provide broader coverage than formal survey data, but is highly biased and the biases are not documented.. Data tends to be concentrated in accessible areas and often focuses on charismatic or commonly encountered species. This limits the generalizability of ML models that can be built from training on this data.

Data Gap Type

Data Gap Details

S2: Sufficiency > Coverage

Data is often concentrated in easily accessible areas and focuses on more charismatic or easily identifiable species. There is often a disproportionate focus on certain broad geographic regions (often platforms are set up to integrate community scientists of a certain type, which ends up de facto favoring certain demographic groups from the Global North) and taxonomic groups (as with camera traps and bioacoustic data).

R1: Reliability > Quality

Community science data may include a variable proportion of incorrect identifications.

Building data genome project (hourly building-level metered data)

Details (click to expand)

The Building Data Genome Project 2 dataset contains hourly building-level data from 3,053 energy meters from 1,636 non-residential buildings covering two years worth of metered data with respect to electricity, water, and solar in addition to logistical metadata with respect to area, primary building use category, floor area, time zone, weather, and smart meter type. The goal of the dataset to allow for the development of generalizable building models for energy efficiency analysis studies. The building data genome project 2 compiles building data from public open datasets along with privately curated building data specific to university and higher education institutions.

Use Case

Data Gap Summary

Improving short-term electricity load forecasting

While the dataset has clear documentation, the lack of diversity in building types necessitates further development of the dataset to include other types of buildings, as well as expanding coverage to areas and times beyond those currently available.

While the dataset has clear documentation, the lack of diversity in building types necessitates further development of the dataset to include other types of buildings, as well as expanding coverage to areas and times beyond those currently available.

Data Gap Type	Data Gap Details
U2: Usability > Aggregation	Data was collated from 7 open access public data sources as well as 12 privately curated datasets from facilities management at different college sites requiring manual site visits which are not included in the data repository at this time.
S2: Sufficiency > Coverage	The dataset is curated from buildings on university campuses thereby limiting the diversity of building representation. To overcome the lack of diversity in building data, data sharing incentives and community open source contributions can allow for the expansion of the of the dataset.
S3: Sufficiency > Granularity	The granularity of the meter data is hourly which may not be adequate for short term load-forecasting and efficiency studies at a higher resolution. Assumptions on conditions would have to be made prior to interpolating.
S4: Sufficiency > Timeliness	The dataset covers hourly measurements from January 1, 2016 to December 31, 2018. While this may be adequate for pre-training models, further data collection through a reinitiation of the study may be needed to fine-tune models for more up to date periods of time.

Building energy performance certificates

Details (click to expand)

Energy Performance Certificates (EPCs) are official documents that rate the energy efficiency of a building on a scale from A (most efficient) to G (least efficient). They provide information on the property’s current energy use, typical energy costs, and recommendations for improving efficiency. EPCs are required when a property is built, sold, or rented, and help buyers or tenants understand potential energy expenses and environmental impact. While EPCs are specific to Europe, similar energy efficiency rating systems exist in countries like the United States, Australia, and Canada under different names and regulations.

Use Case

Data Gap Summary

Enhancing the scalability and robustness of building stock assessments

Energy Performance Certificate (EPC) datasets face major gaps in aggregation, provenance, documentation, missing components, structure, and timeliness. Differences in formats and methodologies across countries, limited metadata, outdated records, and missing key attributes can affect their usability.

Energy Performance Certificate (EPC) datasets face major gaps in aggregation, provenance, documentation, missing components, structure, and timeliness. Differences in formats and methodologies across countries, limited metadata, outdated records, and missing key attributes can affect their usability.

Data Gap Type	Data Gap Details
U2: Usability > Aggregation	EPC datasets differ significantly across countries in terms of rating systems, data schemas, and calculation methods. This inconsistency requires custom harmonization and prevents seamless aggregation of EPC data across jurisdictions. Establishing an international standard for EPC rating systems, data schemas, and calculation protocols would enable easier harmonization and cross-border analysis.
R2: Reliability > Provenance	Information about how EPCs are generated—including inspection protocols, calculation tools, and assumptions—is often undocumented or inaccessible. This lack of transparency undermines trust and reproducibility.
S6: Sufficiency > Missing Components	EPCs typically lack important building-level attributes such as occupancy patterns, appliance usage, and retrofit history, all of which are critical for accurately modeling real energy consumption.
U4: Usability > Documentation	EPC datasets often come without adequate metadata explaining their structure, data sources, update cycles, or validation methods, limiting users’ ability to interpret or integrate the data.
S4: Sufficiency > Timeliness	EPCs are usually updated only during specific events like property sales. As a result, datasets may be several years old and not reflect recent improvements or changes in energy performance. Linking smart meter data to EPCs may help proxy changes.
U1: Usability > Structure	EPC datasets are published in varying formats and data models, with differences in variable naming, units, and encoding standards. This structural heterogeneity complicates cross-country or large-scale analysis. Developing and adopting a unified data model with standardized variable names, units, and encoding formats would facilitate large-scale and cross-jurisdictional use.

Building stock – from cadaster and aerial imagery

Details (click to expand)

Building stock maps enable a geolocalized understanding of where and which kind of buildings stand, relevant both to climate change mitigation and adaptation. Building stock data from cadasters and aerial imagery provide the most precise available data. In addition to precise building footprints, the 3D geometry of walls and roofs may be available thanks to LiDAR aerial surveys. Further high-quality information from the cadaster may be available as attributes, such as the current usage or the construction year of the building.

Use Case

Data Gap Summary

Assessing rooftop solar photovoltaic potential

This use case requires 3D models of buildings that include roof geometries (surfaces and angles), which only few datasets, mostly in Europe, provide currently.

This use case requires 3D models of buildings that include roof geometries (surfaces and angles), which only few datasets, mostly in Europe, provide currently.

Data Gap Type	Data Gap Details
S2: Sufficiency > Coverage	These datasets are only available in certain high-income regions, and there is almost no data on entire continents e.g. Africa or South America.
S6: Sufficiency > Missing Components	Datasets that contain 3D information on roofs (sometimes called Level of detail or LoD 2, and above) are rare. 3D datasets at LoD1 do not have information on roof surfaces and angles, which are key to estimating solar potential. LoD0 (building footprints alone) lack building height, which is important for shading considerations.

Enhancing the scalability and robustness of building stock assessments

Datasets tend to face gaps in obtainability, reliability, usability, and sufficiency. These include challenges in finding and interpreting data due to inconsistent naming, poor documentation, variable quality, limited geographic and temporal coverage, and inconsistent data models requiring manual aggregation.

Datasets tend to face gaps in obtainability, reliability, usability, and sufficiency. These include challenges in finding and interpreting data due to inconsistent naming, poor documentation, variable quality, limited geographic and temporal coverage, and inconsistent data models requiring manual aggregation.

Data Gap Type	Data Gap Details
O1: Obtainability > Findability	Datasets are sometimes published by cities, regions, and states. This depends on the country. They may be named and described in the national language without reference to standard terms that would enable more effective searches. Projects like EUBUCCO have gathered a list of such datasets. Further adoption of multilingual metadata standards and controlled vocabularies (e.g. INSPIRE, ISO 19115) would enhance dataset discoverability and interoperability.
R1: Reliability > Quality	The quality of certain attributes and geometries differs across datasets. Of particular importance, building height is often not correctly processed from point clouds, leading to large errors in computing floor areas or shading. More transparent documentation of the processing pipelines would help as a first step.
U4: Usability > Documentation	Documentation about provenance, validation, and in general detailed metadata is often missing.
S2: Sufficiency > Coverage	Such datasets are typically only available in high-income countries that have the capacity to digitize and share cadaster data, as well as run aerial surveys. Most datasets are found in Europe.
U2: Usability > Aggregation	Despite existing naming standards, various datasets use various data models, requiring aggregation efforts involving custom scripts for each dataset. See the EUBUCCO project. Promoting adoption of harmonized schemas (e.g. CityGML or GeoJSON with common attribute keys) would streamline integration and reduce preprocessing effort.
S4: Sufficiency > Timeliness	Aerial surveys are typically not run more often than every 10 years. While some datasets are accessible via an API linked to an up-to-date cadastral database, some datasets are a one-off release that can be older than 10 years. Some datasets are updated every year. Further adoption of versioned APIs linked to authoritative cadastral databases could ensure that users always access the most current information, regardless of local survey cycles.
S3: Sufficiency > Granularity	There are different levels of detail: only the footprint, footprint + walls, footprint + walls + roof. Further details like doors and windows are very rare but would be highly valuable e.g. for thermal performance analyses. Having only footprints available limits applications, although building heights can be predicted with ML, with some errors.
S6: Sufficiency > Missing Components	Very few attributes such as usage type or age are typically available, but these are crucial for energy-related applications when aiming to distinguish different energy profiles across buildings. Enriching datasets through linkage with auxiliary registries (e.g. permits, tax data, utility records) can provide critical attributes for distinguishing building energy profiles.

Facilitating disaster risk assessments

These datasets are mainly available in rich countries from Europe, North America, and Asia, leaving large parts of the world with timely challenges involving their building stock without appropriate data for detailed assessments. The lack of attributes beyond the location and shape of the buildings also limits the breadth of applications. ML may help by inferring missing attributes, given the availability of sufficient training data. Research efforts in particular in Europe, including EUBUCCO (eubucco.com) or the Digital Building Stock Model by the Joint Research Centre of the European Commission (https://data.jrc.ec.europa.eu/collection/id-00382), are addressing several of the existing data gaps.

These datasets are mainly available in rich countries from Europe, North America, and Asia, leaving large parts of the world with timely challenges involving their building stock without appropriate data for detailed assessments. The lack of attributes beyond the location and shape of the buildings also limits the breadth of applications. ML may help by inferring missing attributes, given the availability of sufficient training data. Research efforts in particular in Europe, including EUBUCCO (eubucco.com) or the Digital Building Stock Model by the Joint Research Centre of the European Commission (https://data.jrc.ec.europa.eu/collection/id-00382), are addressing several of the existing data gaps.

Data Gap Type	Data Gap Details
O1: Obtainability > Findability	Certain datasets require searching and navigating websites in a foreign language.
O2: Obtainability > Accessibility	Some datasets are not publicly available and require either payment or governmental authorization. This situation is changing in Europe via the high-value dataset regulation in the European Union, which mandates member to release their building stock data with permissive licenses (https://data.europa.eu/en/news-events/news/unlocking-potential-high-value-datasets-impact-hvd-implementing-regulation).
U1: Usability > Structure	Datasets are released under a multitude of formats. Despite the existence of standards such as CityGML, one typically needs a particular pipeline for processing every new dataset.
U2: Usability > Aggregation	Datasets are typically released by local authorities and require aggregations. Some efforts in particular in Europe, including EUBUCCO (eubucco.com) or the Digital Building Stock Model by the Joint Research Centre of the European Commission (https://data.jrc.ec.europa.eu/collection/id-00382), have made this process easier, but without enabling yet seamless updates.
U3: Usability > Usage Rights	Most datasets use attribution-based licenses, but some datasets use custom licenses, unclear licenses, or restrictive licenses.
U4: Usability > Documentation	Most datasets do not provide appropriate documentation to fully understand how the dataset was created.
U5: Usability > Pre-processing	Certain fields may contain local codes that need to be translated and understood. Numerical values may contain encodings for NAs, such as -1 or 1000, that need to be cleaned.
U6: Usability > Large Volume	Precise 3D datasets can be voluminous for a city. Country-level datasets also tend to require significant computing resources.
R1: Reliability > Quality	The height estimation from LiDAR data may contain large errors, e.g., due to surrounding objects such as trees.
S2: Sufficiency > Coverage	There are very few datasets outside of rich countries from Europe, North America, and Asia. Precise 3D models and attribute-rich datasets are available for even fewer countries.
S4: Sufficiency > Timeliness	Practices vary widely between multiple updates per year to a one-off release that may be more than 10 years old. Aerial surveys with LiDAR are expensive and are rarely done more than once every ten years.

Building stock – satellite-derived

Details (click to expand)

Building stock maps enable a geolocalized understanding of where and which kind of buildings stand, relevant both to climate change mitigation and adaptation. Satellite-derived datasets, which often use ML for processing satellite imagery, can provide such maps on a global scale. Coarser-resolution maps come as raster data at resolutions varying from 10 to more than 100 m, while the maps with the highest resolution provide details on building footprint geometries as vector data. Some of these datasets may have a temporal resolution and some inferred attributes describing the building characteristics.

Use Case

Data Gap Summary

Enhancing the scalability and robustness of building stock assessments

Building datasets generated through large-scale ML extraction, such as Microsoft ML buildings, face reliability and sufficiency issues due to limited validation, positional inaccuracies, and inferred heights with low accuracy. Usability is also hindered by missing documentation on methodologies and input imagery, while datasets with coarse raster resolution or missing key attributes like usage type or age reduce the data’s applicability for detailed energy analyses.

Building datasets generated through large-scale ML extraction, such as Microsoft ML buildings, face reliability and sufficiency issues due to limited validation, positional inaccuracies, and inferred heights with low accuracy. Usability is also hindered by missing documentation on methodologies and input imagery, while datasets with coarse raster resolution or missing key attributes like usage type or age reduce the data’s applicability for detailed energy analyses.

Data Gap Type	Data Gap Details
R1: Reliability > Quality	Because the data has been produced by automated ML extraction at scale, there are issues of positional accuracy, missing buildings, and false positives. When available, heights are also inferred, and the accuracy may be low. Future model developments may better address this issue.
R2: Reliability > Provenance	For Microsoft ML buildings, limited validation results are provided for building footprints for small-scale case studies. No validation is provided for building heights. The lack of more comprehensive validation undermines the scientific value of the data and it is crucial to provide this information.
U4: Usability > Documentation	For Microsoft ML buildings, there is limited information about the methodology e.g. the ML approach used. Perhaps even more importantly, the vintage and satellite used for the input imagery are unknown, but these differences may create systematic biases in the data. This information could easily be provided.
S3: Sufficiency > Granularity	For products provided at rasters such as GHS-AGE or WSF-EVO, the resolution (typically 10 to 100 m) leads to noise in the estimations and in the matching with polygons. Future model developments and access to newer imagery may better address these issues.
S6: Sufficiency > Missing Components	Very few attributes, such as usage type or age, are typically available, but these are crucial for energy-related applications when aiming to distinguish different energy profiles across buildings.

Facilitating disaster risk assessments

These datasets are typically released at a scale that makes their validation complex and partial, implying potentially large uncertainties in the data. The lack of attributes beyond the location and shape of the buildings also limits the breadth of applications. ML may help by inferring missing attributes, given the availability of sufficient training data.

These datasets are typically released at a scale that makes their validation complex and partial, implying potentially large uncertainties in the data. The lack of attributes beyond the location and shape of the buildings also limits the breadth of applications. ML may help by inferring missing attributes, given the availability of sufficient training data.

Data Gap Type	Data Gap Details
U1: Usability > Structure	Some datasets have not been published as scientific datasets and lack appropriate documentation about the methodology. Users should be aware of uncertainties in case of insufficient documentation of potential errors.
R1: Reliability > Quality	The building footprint data can contain errors due to detection inaccuracies in the models used to generate the dataset, as well as limitations of satellite imagery. These limitations include outdated images that may not reflect recent developments and visibility issues such as cloud cover or obstructions that can prevent the accurate identification of buildings.
U6: Usability > Large Volume	When working at a large geographical scale, e.g. continental scale, the data volume requires significant computational resources for the processing.
S3: Sufficiency > Granularity	Raster datasets provide a noisy view of the building stock.
S4: Sufficiency > Timeliness	The data depends on the availability of satellite surveys. Some datasets may mix images from different years. The surveys may be more than 5 years old, mischaracterizing fast-growing areas. In case of disasters, the imagery pre-disaster may not be representative of the current building stock.
S6: Sufficiency > Missing Components	More attributes inferred with high confidence would unlock new use cases.

Understanding the impact of urban planning on travel emissions

Datasets of the evolution of the building stock derived from different vintages of satellite imagery provide valuable information on cities expansions and on new constructions in general. Those include the World Settlement Footprint Evolution dataset or the Global Human Settlement Layer - AGE. These datasets, however, are provided as raster data, and the resolution is insufficient for analyzing micro urban planning interventions.

Datasets of the evolution of the building stock derived from different vintages of satellite imagery provide valuable information on cities expansions and on new constructions in general. Those include the World Settlement Footprint Evolution dataset or the Global Human Settlement Layer - AGE. These datasets, however, are provided as raster data, and the resolution is insufficient for analyzing micro urban planning interventions.

Data Gap Type	Data Gap Details
S3: Sufficiency > Granularity	The resolution of these datasets is typically between 100 m and 1 km, which does not permit distinguishing specific infrastructure elements, which are crucial for many analyses.

CMIP6 (earth system model intercomparison data)

Details (click to expand)

CMIP6 (Coupled Model Intercomparison Project Phase 6) provides climate simulations from a consortium of state-of-the-art global climate models, covering historical periods and future scenarios through 2100. The dataset includes multiple climate variables at various spatial and temporal resolutions from modeling centers worldwide. Data can be found here https://pcmdi.llnl.gov/CMIP6/.

Use Case

Data Gap Summary

Accelerating data-driven generation of climate simulations

The dataset faces three key challenges: its large volume makes access and processing difficult with standard computational infrastructure; lack of uniform structure across models complicates multi-model analysis; and inherent biases and uncertainties in the simulations affect reliability.

The dataset faces three key challenges: its large volume makes access and processing difficult with standard computational infrastructure; lack of uniform structure across models complicates multi-model analysis; and inherent biases and uncertainties in the simulations affect reliability.

Data Gap Type	Data Gap Details
U6: Usability > Large Volume	Massive computational requirements - Cloud-based platforms and data subsetting tools can improve accessibility
U1: Usability > Structure	Inconsistent formats across models - Standardized naming conventions and preprocessing pipelines can enable seamless multi-model integration
R1: Reliability > Quality	Large uncertainties in future projections - Model evaluation frameworks and ensemble weighting methods can help quantify and reduce uncertainties

Enhancing bias-correction of climate projections

Large uncertainties in future climate projections limit confidence in bias-correction applications. The massive data volume and inconsistent formats across models—including variable naming conventions, resolutions, and file structures—hinder effective multi-model analysis. Improved model evaluation frameworks and data standardization efforts can enhance projection reliability and streamline ML model development.

Large uncertainties in future climate projections limit confidence in bias-correction applications. The massive data volume and inconsistent formats across models—including variable naming conventions, resolutions, and file structures—hinder effective multi-model analysis. Improved model evaluation frameworks and data standardization efforts can enhance projection reliability and streamline ML model development.

Data Gap Type	Data Gap Details
R1: Reliability > Quality	Large uncertainties in future projections - Model evaluation frameworks and ensemble weighting methods can help quantify and reduce uncertainties
U1: Usability > Structure	Inconsistent formats across models - Standardized naming conventions and preprocessing pipelines can enable seamless multi-model integration
U6: Usability > Large Volume	Massive computational requirements - Cloud-based platforms and data subsetting tools can improve accessibility

CPC Precipitation (global unified daily precipitation)

Details (click to expand)

CPC Global Unified gauge-based analysis of daily precipitation https://psl.noaa.gov/data/gridded/data.cpc.globalprecip.html

Use Case

Data Gap Summary

Weather forecasting: Subseasonal-to-seasonal horizon

High-fidelity weather forecasts at the subseasonal-to-seasonal (S2S) scale (i.e., 10-46 days ahead) are important for a variety of climate change-related applications. ML can be used to postprocess outputs from physics-based numerical forecast models in order to generate higher-fidelity forecasts.

High-fidelity weather forecasts at the subseasonal-to-seasonal (S2S) scale (i.e., 10-46 days ahead) are important for a variety of climate change-related applications. ML can be used to postprocess outputs from physics-based numerical forecast models in order to generate higher-fidelity forecasts.

Data Gap Type	Data Gap Details
R1: Reliability > Quality	There is large uncertainty in data as data is derived via interpolating station data. There are large biases over areas where rain gauge stations are sparse
S3: Sufficiency > Granularity	Resolution is 0.5 deg (roughly 50km) and not sufficiently fine for many applications.

City-level transportation mode share data

Details (click to expand)

City-level modal share data represents the distribution of transportation modes—such as walking, cycling, public transit, and driving—used by residents within a city. This data is typically gathered through travel surveys, censuses, or mobility studies and provides insights into how people commute and travel locally.

Use Case

Data Gap Summary

Enabling inference of city-level transportation mode shares

There are issues with the quality of the data and consistent time series: there are no datasets with data for multiple cities that were produced with the same methodology, that are directly usable and highly trustworthy for scientific research.

There are issues with the quality of the data and consistent time series: there are no datasets with data for multiple cities that were produced with the same methodology, that are directly usable and highly trustworthy for scientific research.

Data Gap Type	Data Gap Details
R1: Reliability > Quality	Different surveys may ask different questions and use different methodologies, making the data hard to compare. The European Union is organizing a centralized survey that may address this issue and provide an example of how this may be reproduced elsewhere.
S4: Sufficiency > Timeliness	There is a lack of consistent time series over time, and certain cities don’t have recent data. The European Platform on Mobility Management is an interesting gathering initiative, but the website is not maintained anymore, and the data is getting old.

ClimSim (benchmark data for hybrid ML-physics research)

Details (click to expand)

ClimSim is an ML-ready benchmark dataset designed for hybrid ML-physics research, for example, for emulating subgrid clouds and convection processes in climate models.

Use Case

Data Gap Summary

Hybrid ML-physics climate models for enhanced simulations

ClimSim faces challenges with its large data volume, making downloading and processing difficult for most ML practitioners, and its resolution is insufficient to resolve some fine-scale physical processes critical for accurate climate modeling.

ClimSim faces challenges with its large data volume, making downloading and processing difficult for most ML practitioners, and its resolution is insufficient to resolve some fine-scale physical processes critical for accurate climate modeling.

Data Gap Type	Data Gap Details
U6: Usability > Large Volume	A common challenge for emulating climate model components, especially subgrid scale processes is the large data volume, which makes data downloading, transferring, processing, and storing challenging. Computation resources, including GPUs and storage, are urgently needed for most ML practitioners. Technical help on optimizing code for large volumes of data would also be appreciated.
S3: Sufficiency > Granularity	The current resolution is still sufficient to resolve some physical processes, e.g. turbulence, and tornado. Extremely high-resolution simulations, like large-eddy-simulations are needed.

Climate-related laws and regulations

Details (click to expand)

Laws and regulations for climate action that are published through national and subnational governments. There are some centralized databases, such as Climate Policy Radar, International Energy Agency, and New Climate Institute that have selected, aggregated, and structured these data into comprehensive resources.

Use Case

Data Gap Summary

Scaling identification and mapping of climate policy

Laws and regulations for climate action are published in various formats through national and subnational governments, and most are not labeled as a “climate policy”. There are a number of initiatives that take on the challenge of selecting, aggregating, and structuring the laws to provide a better overview of the global policy landscape. This, however, requires a great deal of work, needs to be permanently updated, and datasets are not complete.

Laws and regulations for climate action are published in various formats through national and subnational governments, and most are not labeled as a “climate policy”. There are a number of initiatives that take on the challenge of selecting, aggregating, and structuring the laws to provide a better overview of the global policy landscape. This, however, requires a great deal of work, needs to be permanently updated, and datasets are not complete.

Data Gap Type

Data Gap Details

U1: Usability > Structure

Much of the data are in PDF format and need to be structured into machine-readable format. Much of the data in original languages of the publishing country and needs to be translated into English.

U2: Usability > Aggregation

Legislation data is published through national and subnational governments, and often is not explicitly labeled as “climate policy”. Determing whether it is climate-related is not simple.

This information is usually published on local websites and must be downloaded or scraped manually. There are a number of initiatives, such as Climate Policy Radar, International Energy Agency, and New Climate Institute that are working to address this by selecting, aggregating, and structuring these data to provide a better overview of the global policy landscape. However, this process is labor-intensive, requires continuous updates, and often results in incomplete datasets.

ClimateBench v1.0 (benchmark dataset for earth system models)

Details (click to expand)

ClimateBench v1.0 is a benchmark dataset derived from the NorESM2 Earth System Model (a participant in CMIP6) designed specifically for evaluating machine learning methods that emulate key climate variables. The dataset is publicly available at https://zenodo.org/records/7064308

Use Case

Data Gap Summary

Accelerating data-driven generation of climate simulations

The dataset currently includes simulations from only one Earth system model, limiting the diversity of training data and potentially affecting the robustness and generalizability of ML emulators trained on it.

The dataset currently includes simulations from only one Earth system model, limiting the diversity of training data and potentially affecting the robustness and generalizability of ML emulators trained on it.

Data Gap Type

Data Gap Details

S2: Sufficiency > Coverage

Currently, the dataset includes information from only one model. Training a machine learning model with this single source of data may result in limited generalization capabilities. To improve the model’s robustness and accuracy, it is essential to incorporate data from multiple models. This approach not only enhances the model’s ability to generalize across different scenarios but also helps reduce uncertainties associated with relying on a single model.

ClimateSet (ML-ready earth system model inputs/outputs)

Details (click to expand)

ClimateSet is an ML-ready benchmark dataset compiled from inputs and outputs of the Input4MIPS and CMIP6 archives, structured for various machine learning tasks including climate model emulation, downscaling, and prediction. More information is available at https://arxiv.org/pdf/2311.03721.pdf

Use Case

Data Gap Summary

Accelerating data-driven generation of climate simulations

No significant data gap identified yet.

No significant data gap identified yet.

Data Gap Type	Data Gap Details

Computational fluid dynamics simulation for building energy models

Details (click to expand)

Computational fluid dynamics (CFD) simulation output from building energy models is a means of precisely assessing thermal (e.g. insulation of the walls) and ventilation (e.g. natural ventilation or HVAC) properties of a building. Given the building geometry, terrain, presence of neighboring buildings, and boundary conditions Navier-Stokes equations are typically solved. Datasets including precise building inputs and outputs from CFD would help build ML surrogate models. Surrogate models, such as GANs or physics constrained deep neural network architectures have been shown to provide promising results though further research with respect to turbulence representation needs to be taken into account.

Use Case

Data Gap Summary

Accelerating building energy models

Despite its usefulness in ventilation studies for new construction, CFD simulations are computationally expensive making it difficult to include in the early phase of the design process where building morphosis can be optimized to reduce future operational consumption associated with building lighting, heating, and cooling. Simulations require accurate input information with respect to material properties that may not be present in traditional urban building types. Output of models require the integration of domain knowledge to interpret results from large volumes of synthetic data for different wind directions becoming challenging to manage. Future data collection with respect to simulation output verification can benefit surrogate or proxy approaches to computationally expensive Navier-Stokes equations, and coverage is often restricted to modern building approaches, leaving out passive building techniques known as vernacular architecture from indigenous communities from being taken into design consideration.

Despite its usefulness in ventilation studies for new construction, CFD simulations are computationally expensive making it difficult to include in the early phase of the design process where building morphosis can be optimized to reduce future operational consumption associated with building lighting, heating, and cooling. Simulations require accurate input information with respect to material properties that may not be present in traditional urban building types. Output of models require the integration of domain knowledge to interpret results from large volumes of synthetic data for different wind directions becoming challenging to manage. Future data collection with respect to simulation output verification can benefit surrogate or proxy approaches to computationally expensive Navier-Stokes equations, and coverage is often restricted to modern building approaches, leaving out passive building techniques known as vernacular architecture from indigenous communities from being taken into design consideration.

Data Gap Type	Data Gap Details
W: Wish	Such datasets do not exist and require dedicated work to gather inputs, generate the data via simulations, and ensure that the simulations are reliable by verifying them with real-world data. Licensing and privacy issues may also be important aspects of such efforts.

DOE Atmospheric Radiation Measurement research facility data products

Details (click to expand)

The DOE Atmospheric Radiation Measurement (ARM) dataset comprises ground-based measurements from various field programs sponsored by the US Department of Energy, including sun-tracking photometers, radiometers, and spectrometer data useful for solar radiation time series forecasting and solar potential assessment.

Use Case

Data Gap Summary

Improving solar power forecasting: nowcasting/very-short-term (0-30min)

ARM data presents challenges with data volume management, measurement verification (especially for aerosol composition), limited spatial coverage (ARM sites only), and sensor calibration issues. Solutions include AI-based data compression, enhanced aerosol composition measurements, collaboration with partner networks to expand coverage, and automated quality control.

ARM data presents challenges with data volume management, measurement verification (especially for aerosol composition), limited spatial coverage (ARM sites only), and sensor calibration issues. Solutions include AI-based data compression, enhanced aerosol composition measurements, collaboration with partner networks to expand coverage, and automated quality control.

Data Gap Type	Data Gap Details
U6: Usability > Large Volume	ARM sites generate large datasets which can be challenging to store, analyze, stream, and archive. AI-based data compression and novel indexing can improve data management.
S3: Sufficiency > Granularity	Enhanced aerosol composition and ice nucleating particle measurements are needed for a better understanding of cloud dynamics and solar irradiance for DER site planning.
S2: Sufficiency > Coverage	Spatial coverage is limited to ARM sites within the United States. Collaboration with partner networks can expand coverage both within and outside the US.
R1: Reliability > Quality	Sensor data can be sensitive to noise and calibration issues, requiring automated systems to identify measurement drift.

DYAMOND (global atmospheric circulation model intercomparison data)

Details (click to expand)

DYAMOND (DYnamics of the Atmospheric general circulation Modeled On Non-hydrostatic Domains) is an intercomparison of global storm-resolving model simulations at 5 km resolution or less, used as targets for climate model emulators.

Use Case

Data Gap Summary

Hybrid ML-physics climate models for enhanced simulations

DYAMOND faces similar challenges to ClimSim: its large volume creates processing difficulties, and its resolution, while high, remains insufficient for resolving fine-scale atmospheric processes needed for accurate climate modeling.

DYAMOND faces similar challenges to ClimSim: its large volume creates processing difficulties, and its resolution, while high, remains insufficient for resolving fine-scale atmospheric processes needed for accurate climate modeling.

Data Gap Type	Data Gap Details
U6: Usability > Large Volume	A common challenge for emulating climate model components, especially subgrid scale processes is the large data volume, which makes data downloading, transferring, processing, and storing challenging. Computation resources, including GPUs and storage, are urgently needed for most ML practitioners. Technical help on optimizing code for large volumes of data would also be appreciated.
S3: Sufficiency > Granularity	The current resolution is still sufficient to resolve some physical processes, e.g. turbulence, and tornado. Extremely high-resolution simulations, like large-eddy-simulations are needed.

Digital elevation model

Details (click to expand)

Surface elevation data, often called digital elevation model or terrain surface model, provide a 3D representation of the bare surface of the Earth. These topographic inputs are important for disaster risk assessments and modeling to assess risks due to floods, sea level rise, or landslides, where the elevation of a given location determines whether it is at risk. These digital models are typically estimated from remote sensing data, for example, the Shuttle Radar Topography Mission. They are often provided as raster but may also be provided as points (vector).

Use Case

Data Gap Summary

Facilitating disaster risk assessments

Very high-resolution reference data is currently not freely open to the public.

Very high-resolution reference data is currently not freely open to the public.

Data Gap Type

Data Gap Details

O1: Obtainability > Findability

Surface elevation data defined by a digital elevation model (DEM) is one of the most essential types of reference data. The high-resolution elevation data has huge value for disaster risk assessment, particularly for the Global South.

Open DEM data with global coverage now goes to a resolution of 30-m, but the resolution is still insufficient for many disaster risk assessments. Higher-resolution datasets exist, but they are either with limited spatial coverage or are commercial products and are very expensive to get.

Direct measurement of methane emission of rice paddies

Details (click to expand)

With sampling systems placed in rice paddies, methane concentrations can be directly measured in the air above the fields or in the soil.

Use Case

Data Gap Summary

Enhancing estimations of methane emissions from rice paddies

There is a lack of direct observation of methane emissions from rice paddies.

There is a lack of direct observation of methane emissions from rice paddies.

Data Gap Type	Data Gap Details
W: Wish	Direct measurement of methane emissions is often expensive and labor-intensive. But this data is essential as it provides the ground truth for training and constraining ML models. Increased funding is needed to support and encourage comprehensive data collection efforts.

Distribution system simulators

Details (click to expand)

Distribution system simulators such as OpenDSS and GridLab-D enable analysis of hosting capacity for distribution-level substation feeders by simulating how various factors affect grid stability and reliability. These open-source tools allow researchers to model voltage limits, thermal capabilities, control parameters, and fault currents under different scenarios, providing insights into how distribution grids can safely accommodate distributed energy resources like solar panels. These simulators serve as critical alternatives when real circuit feeder data from utilities is unavailable.

Use Case

Data Gap Summary

Accelerating distribution-side hosting capacity estimations

While OpenDSS and GridLab-D provide valuable simulation capabilities, their utility is limited by challenges in obtaining verification data from real distribution circuits, aggregating necessary input data from multiple sources, and navigating usage rights for proprietary utility data. Closing these gaps through improved utility-researcher partnerships and data sharing protocols would significantly enhance the accuracy of hosting capacity assessments, enabling greater renewable energy integration in distribution networks.

While OpenDSS and GridLab-D provide valuable simulation capabilities, their utility is limited by challenges in obtaining verification data from real distribution circuits, aggregating necessary input data from multiple sources, and navigating usage rights for proprietary utility data. Closing these gaps through improved utility-researcher partnerships and data sharing protocols would significantly enhance the accuracy of hosting capacity assessments, enabling greater renewable energy integration in distribution networks.

Data Gap Type	Data Gap Details
U2: Usability > Aggregation	Realistic distribution system studies require aggregating and collating data from multiple external sources regarding network topology, load profiles, and DER penetration for the specific region of interest.
U3: Usability > Usage Rights	Rights to external data for use with OpenDSS or GridLab-D may require purchase or partnerships with utilities and/or the Distribution System Operator (DSO) to perform scenario studies with high DER penetration and load demand.
R1: Reliability > Quality	Simulator studies require real deployment data from substations for verification, as actual hosting capacity may vary based on load conditions, environmental factors, and DER penetration levels in the service area.

Drone imagery

Details (click to expand)

Drone imagery provides high-resolution, close-range visual data for species identification, individual tracking, and environmental reconstruction. These images offer detailed insights into habitats and sometimes direct observation of species populations (e.g. trees and large mammals), similar to camera traps but with greater flexibility in coverage. Currently, most drone imagery data is scattered across disparate sources, though some collections are hosted on platforms like www.lila.science.

Use Case

Data Gap Summary

Automating individual re-identification for wildlife

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Data Gap Type

Data Gap Details

S1: Sufficiency > Insufficient Volume

This scarcity of publicly open and well-annotated datasets arises from the labor-intensive nature of data annotation and the dearth of expertise necessary to accurately identify species. This is the case even for relatively well-studied taxa such as birds, and is even more notable for taxa such as Orthoptera (grasshoppers and relatives). The incompleteness of our current taxonomy also compounds this issue.

Though some data cannot be shared with the public because of concerns about poaching or other legal restrictions imposed by local governments, there are still considerable sources of well-annotated data that currently reside within the confines of individual researchers’ or organizations’ hard drives. Unlocking these data will partially address the data gap.

To achieve this goal:

A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.

S2: Sufficiency > Coverage

There exists a significant geographic imbalance in data collected, with insufficient representation from highly biodiverse regions. The data also does not cover a diverse spectrum of species, intertwining with taxonomy insufficiencies.

Addressing these gaps involves several strategies:

Data sharing: Unlocking existing datasets scattered across individual laboratories and organizations can partially alleviate the geographic and taxonomic gaps in data coverage.
Annotation Efforts:
- Leveraging crowdsourcing platforms enables collaborative data collection and facilitates volunteer-driven annotation, playing a pivotal role in expanding annotated datasets.
- Collaborating with domain experts, including ecologists and machine learning practitioners, is crucial. Incorporating local ecological knowledge, particularly from indigenous communities in target regions, enriches data annotation efforts.
- Developing AI-powered methods for automatic cataloging, searching, and data conversion is imperative to streamline data processing and extract relevant information efficiently.
Data Collection:
- Prioritize continuous data collection efforts, especially in underrepresented regions and for less documented species and taxonomic groups.
- Explore innovative approaches like Automated Multisensor Stations for Monitoring of Species Diversity (AMMODs) to enhance data collection capabilities.
- Strategically plan data collection initiatives to target species at risk of extinction, ensuring data capture before irreversible loss occurs.

U2: Usability > Aggregation

Many well-annotated datasets are currently scattered within the confines of individual researchers’ or organizations’ hard drives. Aggregating and sharing these datasets would largely address the lack of publicly open annotated data needed for model training. This initiative would also mitigate geographic and taxonomic imbalances in existing data.

To achieve this goal:

A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.

To achieve this goal:

A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.

Enhancing digital reconstructions of the environment

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Data Gap Type	Data Gap Details
O2: Obtainability > Accessibility
S1: Sufficiency > Insufficient Volume	This scarcity of publicly open and well-annotated datasets arises from the labor-intensive nature of data annotation and the dearth of expertise necessary to accurately identify species. This is the case even for relatively well-studied taxa such as birds, and is even more notable for taxa such as Orthoptera (grasshoppers and relatives). The incompleteness of our current taxonomy also compounds this issue. Though some data cannot be shared with the public because of concerns about poaching or other legal restrictions imposed by local governments, there are still considerable sources of well-annotated data that currently reside within the confines of individual researchers’ or organizations’ hard drives. Unlocking these data will partially address the data gap. To achieve this goal: A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative. Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets. The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
S2: Sufficiency > Coverage	There exists a significant geographic imbalance in data collected, with insufficient representation from highly biodiverse regions. The data also does not cover a diverse spectrum of species, intertwining with taxonomy insufficiencies. Addressing these gaps involves several strategies: Data sharing: Unlocking existing datasets scattered across individual laboratories and organizations can partially alleviate the geographic and taxonomic gaps in data coverage. Annotation Efforts: Leveraging crowdsourcing platforms enables collaborative data collection and facilitates volunteer-driven annotation, playing a pivotal role in expanding annotated datasets. Collaborating with domain experts, including ecologists and machine learning practitioners, is crucial. Incorporating local ecological knowledge, particularly from indigenous communities in target regions, enriches data annotation efforts. Developing AI-powered methods for automatic cataloging, searching, and data conversion is imperative to streamline data processing and extract relevant information efficiently. Data Collection: Prioritize continuous data collection efforts, especially in underrepresented regions and for less documented species and taxonomic groups. Explore innovative approaches like Automated Multisensor Stations for Monitoring of Species Diversity (AMMODs) to enhance data collection capabilities. Strategically plan data collection initiatives to target species at risk of extinction, ensuring data capture before irreversible loss occurs.
U2: Usability > Aggregation	Many well-annotated datasets are currently scattered within the confines of individual researchers’ or organizations’ hard drives. Aggregating and sharing these datasets would largely address the lack of publicly open annotated data needed for model training. This initiative would also mitigate geographic and taxonomic imbalances in existing data. To achieve this goal: A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative. Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets. The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources. To achieve this goal: A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative. Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets. The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.

Facilitating forest restoration monitoring

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Data Gap Type

Data Gap Details

U1: Usability > Structure

For restoration projects, there is an urgent need for standardized protocols to guide data collection and curation for measuring biodiversity and other complex ecological outcomes of restoration projects. For example, there is an urgent need of clearly written guidance on what variables to collect and how to collect them.

Turning raw data into usable insights is also a significant challenge. There is a lack of standardized tools and workflows to turn the raw data into analysis-ready data and analyze the data in a consistent way across projects.

U4: Usability > Documentation

For restoration projects, there is an urgent need for standardized protocols to guide data collection and curation for measuring biodiversity and other complex ecological outcomes of restoration projects. For example, there is an urgent need for clearly written guidance on what variables to collect and how to collect them.

Turning raw data into usable insights is also a significant challenge. There is a lack of standardized tools and workflows to turn the raw data into analysis-ready data and analyze the data in a consistent way across projects.

Improving terrestrial wildlife detection and species classification

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.

Data Gap Type

Data Gap Details

S1: Sufficiency > Insufficient Volume

This scarcity of publicly open and well-annotated datasets arises from the labor-intensive nature of data annotation and the dearth of expertise necessary to accurately identify species. This is the case even for relatively well-studied taxa such as birds, and is even more notable for taxa such as Orthoptera (grasshoppers and relatives). The incompleteness of our current taxonomy also compounds this issue.

Though some data cannot be shared with the public because of concerns about poaching or other legal restrictions imposed by local governments, there are still considerable sources of well-annotated data that currently reside within the confines of individual researchers’ or organizations’ hard drives. Unlocking these data will partially address the data gap.

To achieve this goal:

A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.

S2: Sufficiency > Coverage

There exists a significant geographic imbalance in data collected, with insufficient representation from highly biodiverse regions. The data also does not cover a diverse spectrum of species, intertwining with taxonomy insufficiencies.

Addressing these gaps involves several strategies:

Data sharing: Unlocking existing datasets scattered across individual laboratories and organizations can partially alleviate the geographic and taxonomic gaps in data coverage.
Annotation Efforts:
- Leveraging crowdsourcing platforms enables collaborative data collection and facilitates volunteer-driven annotation, playing a pivotal role in expanding annotated datasets.
- Collaborating with domain experts, including ecologists and machine learning practitioners, is crucial. Incorporating local ecological knowledge, particularly from indigenous communities in target regions, enriches data annotation efforts.
- Developing AI-powered methods for automatic cataloging, searching, and data conversion is imperative to streamline data processing and extract relevant information efficiently.
Data Collection:
- Prioritize continuous data collection efforts, especially in underrepresented regions and for less documented species and taxonomic groups.
- Explore innovative approaches like Automated Multisensor Stations for Monitoring of Species Diversity (AMMODs) to enhance data collection capabilities.
- Strategically plan data collection initiatives to target species at risk of extinction, ensuring data capture before irreversible loss occurs.

U2: Usability > Aggregation

Many well-annotated datasets are currently scattered within the confines of individual researchers’ or organizations’ hard drives. Aggregating and sharing these datasets would largely address the lack of publicly open annotated data needed for model training. This initiative would also mitigate geographic and taxonomic imbalances in existing data.

To achieve this goal:

A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.

To achieve this goal:

A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.

ECMWF ENS (global 9-km 15-day ahead weather model ensemble)

Details (click to expand)

Ensemble forecast system providing 50 probabilistic forecasts up to 15 days ahead at 9-km resolution, generated twice daily by ECMWF’s numerical weather prediction model. This ensemble approach quantifies forecast uncertainty and serves as a benchmark for evaluating ML-based probabilistic weather forecasting. Data can be found here.

Use Case

Data Gap Summary

Enhancing bias-correction of weather forecasts

Same as HRES, the biggest challenge of ENS is that only a portion of it is available to the public for free.

Same as HRES, the biggest challenge of ENS is that only a portion of it is available to the public for free.

Data Gap Type	Data Gap Details
O2: Obtainability > Accessibility	High-resolution real-time ensemble forecasts are for purchase only and expensive to obtain for research purposes.
U6: Usability > Large Volume	Due to high spatio-temporal resolution and ensemble size, ENS data presents significant challenges for downloading, transferring, storing, and processing with standard computational infrastructure. The WeatherBench 2 benchmark dataset provides a solution for certain machine learning tasks involving ENS data.

Weather forecasting: Short-to-medium term (1-14 days)

The biggest challenge of ENS is that only a portion of it is available to the public for free.

The biggest challenge of ENS is that only a portion of it is available to the public for free.

Data Gap Type	Data Gap Details
O2: Obtainability > Accessibility	High-resolution real-time ensemble forecasts are for purchase only and expensive to obtain for research purposes.
U6: Usability > Large Volume	Due to high spatio-temporal resolution and ensemble size, ENS data presents significant challenges for downloading, transferring, storing, and processing with standard computational infrastructure. The WeatherBench 2 benchmark dataset provides a solution for certain machine learning tasks involving ENS data.

ECMWF ERA5 Atmospheric Reanalysis

Details (click to expand)

ERA5 is a comprehensive atmospheric reanalysis dataset covering 1940 to present that integrates in-situ and remote sensing observations from weather stations, satellites, and radar into a global, hourly gridded product at 31 km resolution. The dataset is continuously updated and available for download through the Copernicus Climate Data Store.

Use Case

Data Gap Summary

Enhancing bias-correction of climate projections

ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.

ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.

Data Gap Type	Data Gap Details
O2: Obtainability > Accessibility	Download delays from Copernicus Climate Data Store - Enhanced server infrastructure, regional mirror sites, and cloud-based access platforms can reduce download times from days/months to hours
U6: Usability > Large Volume	Massive storage and processing requirements - Cloud computing platforms with pre-loaded datasets and data subsetting tools can enable analysis without full downloads
R1: Reliability > Quality	Inherent biases limit ground truth applications - ML-enhanced data assimilation techniques and ensemble reanalysis approaches can reduce model-dependent biases, particularly improving precipitation and cloud field accuracy

Hybrid ML-physics climate models for enhanced simulations

While ERA5 is widely used due to its good structure and global coverage, users face significant challenges with downloading times that can take days to months, and the sheer data volume presents processing difficulties for many users.

While ERA5 is widely used due to its good structure and global coverage, users face significant challenges with downloading times that can take days to months, and the sheer data volume presents processing difficulties for many users.

Data Gap Type	Data Gap Details
U6: Usability > Large Volume	Massive storage and processing requirements - Cloud computing platforms with pre-loaded datasets and data subsetting tools can enable analysis without full downloads

Weather forecasting: Subseasonal horizon

ERA5 is widely used due to its high resolution and global coverage, but faces significant accessibility and reliability challenges. Download times from the Copernicus Climate Data Store can take days to months due to high demand and data storage on tape systems. ERA5’s own biases and uncertainties, particularly in precipitation fields, limit its effectiveness as ground truth for ML bias correction. Enhanced download infrastructure and improved reanalysis methods incorporating ML-based data assimilation can address these limitations.

ERA5 is widely used due to its high resolution and global coverage, but faces significant accessibility and reliability challenges. Download times from the Copernicus Climate Data Store can take days to months due to high demand and data storage on tape systems. ERA5’s own biases and uncertainties, particularly in precipitation fields, limit its effectiveness as ground truth for ML bias correction. Enhanced download infrastructure and improved reanalysis methods incorporating ML-based data assimilation can address these limitations.

Data Gap Type	Data Gap Details
O2: Obtainability > Accessibility	Download delays from Copernicus Climate Data Store - Enhanced server infrastructure, regional mirror sites, and cloud-based access platforms can reduce download times from days/months to hours
U6: Usability > Large Volume	Massive storage and processing requirements - Cloud computing platforms with pre-loaded datasets and data subsetting tools can enable analysis without full downloads
R1: Reliability > Quality	Inherent biases limit ground truth applications - ML-enhanced data assimilation techniques and ensemble reanalysis approaches can reduce model-dependent biases, particularly improving precipitation and cloud field accuracy

ECMWF HRES (global 9-km 10-day ahead weather model)

Details (click to expand)

Single high-resolution deterministic forecast up to 10 days ahead generated by ECMWF’s Integrated Forecasting System (IFS), providing global weather predictions at 9-km resolution updated twice daily. This dataset serves as a benchmark for evaluating ML-based weather forecasting approaches. Data can be found here.

Use Case

Data Gap Summary

Enhancing bias-correction of weather forecasts

Limited public access to real-time high-resolution forecasts and computational challenges from large data volumes restrict ML model development and validation for operational weather bias correction.

Limited public access to real-time high-resolution forecasts and computational challenges from large data volumes restrict ML model development and validation for operational weather bias correction.

Data Gap Type	Data Gap Details
O2: Obtainability > Accessibility	The high-resolution real-time forecast is for purchase only and it is expensive to get them.
U6: Usability > Large Volume	Due to its high spatio-temporal resolution, HRES data is quite large, creating significant challenges for downloading, transferring, storing, and processing with standard computational infrastructure. To address this, the WeatherBench 2 benchmark dataset provides a solution for certain machine learning tasks involving HRES data.

Weather forecasting: Short-to-medium term (1-14 days)

The biggest challenge of HRES is that only a portion of it is available to the public for free.

The biggest challenge of HRES is that only a portion of it is available to the public for free.

Data Gap Type	Data Gap Details
O2: Obtainability > Accessibility	The high-resolution real-time forecast is for purchase only and it is expensive to get them.
U6: Usability > Large Volume	Due to its high spatio-temporal resolution, HRES data is quite large, creating significant challenges for downloading, transferring, storing, and processing with standard computational infrastructure. To address this, the WeatherBench 2 benchmark dataset provides a solution for certain machine learning tasks involving HRES data.

EPRI10 (transmission control center alarm and operational data set)

Details (click to expand)

Supervisory Control and Data Acquisition (SCADA) systems collect data from sensors throughout the power grid. Alarm operational data, a portion of the data received by SCADA, provides discrete event-based information on the status of protection and monitoring devices in a tabular format, which includes semi-structured text descriptions of individual alarm events.

Often, the data is formatted based on timestamp (in milliseconds), station, signal identification information, location, description, and action. Encoded within the identification information is the alarm message.

Use Case

Data Gap Summary

Facilitating grid reliability events analysis

Access to EPRI10 grid alarm data is currently limited within EPRI. Data gaps with respect to usability are the result of redundancies in grid alarm codes, requiring significant preprocessing and analysis of code ids, alarm priority, location, and timestamps. Alarm codes can vary based on sensor, asset and line. Resulting actions as a consequence of alarm trigger events need field verifications to assess the presence of fault or non-fault events.

Access to EPRI10 grid alarm data is currently limited within EPRI. Data gaps with respect to usability are the result of redundancies in grid alarm codes, requiring significant preprocessing and analysis of code ids, alarm priority, location, and timestamps. Alarm codes can vary based on sensor, asset and line. Resulting actions as a consequence of alarm trigger events need field verifications to assess the presence of fault or non-fault events.

Data Gap Type	Data Gap Details
U6: Usability > Large Volume	Operational alarm data volume is large, given the cadence of measurements made in the system at every millisecond. This can result in high data volume that is tabular in nature, but also unstructured with respect to text details associated with alarm trigger events, sensor measurements, and controller actions. Since the data also contains locations and grid asset information, spatio-temporal analysis can be made with respect to a single sensor and the conditions over which that sensor is operating. Therefore, indexing and mining time series data can be an approach for facilitating faster search over alarm data leading up to a fault event. Additionally, natural language processing and text mining techniques can also be utilized to facilitate search over alarm text and details.
U5: Usability > Pre-processing	In addition to challenges with respect to the decoding of remote signal identification data, the description fields associated with alarm trigger events are unstructured and vary in the amount of text detail provided. Typically, the details cover information with respect to the grid asset and its action. For example, a text description from a line monitoring device may describe the power, temperature, and the action taken in response to the grid alarm trigger event. Often, in real-world systems, the majority of grid alarm trigger events are short circuit faults and non-fault events, limiting the diversity of fault types found in the data. To combat these issues, data pre-processing becomes necessary. For remote signal identification data, this includes parsing and hashing through text codes, assessing code components for redundancies, and building an associated reduced dictionary of alarm codes. For textual description fields and post-fault field reports, the use of natural language processing techniques to extract key information can provide more consistency between sensor data. Additionally, techniques like diverse sampling can account for the class imbalance with respect to the associated fault that can trigger the alarm.
U4: Usability > Documentation	Remote signaling identification information from monitoring sensors and devices encodes data with respect to the alarm trigger event in the context of fault priority. Based on the asset, line, or sensor, this identification code can vary depending on the naming conventions used. Documentation on remote signal ids associated with a dictionary of finite alarm code types can facilitate pre-processing of alarm data and assessment on the diversity of fault events occurring in real-time systems (as different alarm trigger codes may correspond to redundant events similar in nature).
U3: Usability > Usage Rights	Usage rights are currently constrained to those working within EPRI at this time.
U2: Usability > Aggregation	Reports on location, asset, and time can result in false alarm triggers requiring operators to send field workers to investigate, fix, and recalibrate field sensors. The data with respect to field assessments can be incorporated into the original data to provide greater context resulting in compilation of multimodal datasets which can enhance alarm data understanding.
U1: Usability > Structure	Grid alarm codes may be non-unique for different lines and grid assets. In other words, two different codes could represent equivalent information due to differences in naming conventions requiring significant alarm data pre-processing and analysis in identifying unique labels from over 2000 code words. Additional labels expressing alarm priority, for example high alarm type indicative of events such as fire, gas, or lightning, are also encoded into the grid alarm trigger event code. Creation of a standard structure for operational text data such as those already utilized in operational systems by companies such as General Electric or Siemens can avoid inconsistencies in data.
R1: Reliability > Quality	Alarm trigger events and the corresponding action taken by the events, require post assessment by field workers especially in cases of faults or perceived faults for verification.
O2: Obtainability > Accessibility	Data access is limited within EPRI due to restrictions with respect to data provided by utilities. Anonymization and aggregation of data to a benchmark or toy dataset by EPRI to the wider community can be a means of circumventing the security issues at the cost of operational context.

EUROSTAT city-level socio-economic data

Details (click to expand)

In order to increase the availability and quality of data at a more disaggregated level, Eurostat has promoted and coordinated the efforts of national statistical offices in delivering harmonised city statistics, and disseminates the data on its website.

The data offers statistics on a wide range of indicators, including population demographics, employment rates, education levels, income, and housing conditions across European cities. This data is collected and standardized with the goal of providing consistent comparisons and analysis of urban development, economic performance, and well-being.

Use Case

Data Gap Summary

Enabling inference of city-level transportation mode shares

EUROSTAT city-level socio-economic data faces challenges with inconsistent time series due to changing boundaries, incomplete validation and aggregation issues, and missing values that limit its reliability and usability.

EUROSTAT city-level socio-economic data faces challenges with inconsistent time series due to changing boundaries, incomplete validation and aggregation issues, and missing values that limit its reliability and usability.

Data Gap Type	Data Gap Details
R1: Reliability > Quality	Timeseries may be inconsistent, for example, because of the changing administrative boundaries (e.g. cities or regions merging or splitting), without the data being appropriately altered.
U5: Usability > Pre-processing	There are issues with the aggregation process of the data, and validation may be incomplete for certain variables.
S6: Sufficiency > Missing Components	The dataset provides many variables, but there are many missing values, compromising the usability of the data.

Electric vehicle charge station data

Details (click to expand)

Electric vehicle charging station datasets typically include location, charger specifications, energy delivery amounts, charge duration, costs, and usage patterns for both AC slow charging (depot-based) and DC fast charging (en-route) stations, though specific datasets vary by provider and region.

Use Case

Data Gap Summary

Optimizing electrified bus fleet in urban vehicle-to-grid systems

Critical gaps include limited findability of station-specific usage data due to proprietary restrictions and scattered data sources requiring aggregation from multiple providers. Manufacturer partnerships and utility collaboration can improve data access, while standardized reporting frameworks can consolidate fragmented datasets to enable comprehensive fleet optimization

Critical gaps include limited findability of station-specific usage data due to proprietary restrictions and scattered data sources requiring aggregation from multiple providers. Manufacturer partnerships and utility collaboration can improve data access, while standardized reporting frameworks can consolidate fragmented datasets to enable comprehensive fleet optimization

Data Gap Type	Data Gap Details
O1: Obtainability > Findability	Charging station usage profiles and vehicle-specific load data are often proprietary. Solution: Establish manufacturer partnerships and utility pilot programs to access detailed charging profiles.
U2: Usability > Aggregation	Charging data scattered across multiple providers and systems. Solution: Create standardized APIs and data sharing agreements between charging network operators.

Electric vehicle infrastructure transnational data

Details (click to expand)

Electric vehicle (EV) infrastructure data includes information shared across countries on charging station availability, grid capacity, energy sources, interoperability of charging systems, deployment rates, and usage patterns. Transnational datasets would help track how EV support systems are developing globally and understand if infrastructures across countries can sustain the likely increased number of cars coming from new and second-hand international markets.

Use Case

Data Gap Summary

Understanding fleet overturning and international second-hand vehicle markets

While certain countries have good electric vehicle infrastructure data, there is a need to create transnational datasets, for example to analyze infrastructure readiness in case of increasing international trade of EVs.

While certain countries have good electric vehicle infrastructure data, there is a need to create transnational datasets, for example to analyze infrastructure readiness in case of increasing international trade of EVs.

Data Gap Type	Data Gap Details
W: Wish	While certain countries have good electric vehicle infrastructure data, there is a need to create transnational datasets.

Emission dataset compiled from FAO statistics

Details (click to expand)

Dataset Introduction: This dataset comprises agricultural emissions data compiled from Food and Agriculture Organization (FAO) statistics and spatially extrapolated to provide geospatial coverage. It includes estimates of greenhouse gas emissions related to agricultural practices across different regions worldwide and is periodically updated as new FAO statistics become available.

Use Case

Data Gap Summary

Modeling effects of soil processes on soil organic carbon

Data is extrapolated from statistics on a national level. It is unknown how accurate this data is when focusing on local information

Data is extrapolated from statistics on a national level. It is unknown how accurate this data is when focusing on local information

Data Gap Type	Data Gap Details
R1: Reliability > Quality	Data is extrapolated from statistics on a national level. It is unknown how accurate this data is when focusing on local information.

Environmental DNA (eDNA)

Details (click to expand)

Environmental DNA (eDNA) datasets consist of genetic material obtained from environmental samples, like soil and water, after being shed by living or dead organisms. By analyzing this genetic material, researchers can detect and monitor species present in a non-invasive and efficient manner, aiding biodiversity studies, conservation efforts, and environmental monitoring. Some eDNA data can be found via GBIF (the Global Biodiversity Information Facility).

Use Case

Data Gap Summary

Automating individual re-identification for wildlife

A significant challenge for eDNA-based monitoring is the incomplete barcoding reference databases, limiting the ability to accurately identify species from genetic material. Initiatives like the BIOSCAN project are actively working to address this gap by expanding reference collections for diverse taxonomic groups, particularly for understudied regions and species.

A significant challenge for eDNA-based monitoring is the incomplete barcoding reference databases, limiting the ability to accurately identify species from genetic material. Initiatives like the BIOSCAN project are actively working to address this gap by expanding reference collections for diverse taxonomic groups, particularly for understudied regions and species.

Data Gap Type	Data Gap Details
S2: Sufficiency > Coverage	Barcoding reference databases remain far from complete. However, much attention and effort are being devoted to filling this data gap, for example, the BIOSCAN project.

Enhancing digital reconstructions of the environment

One gap in data is the incomplete barcoding reference databases.

One gap in data is the incomplete barcoding reference databases.

Data Gap Type	Data Gap Details
S2: Sufficiency > Coverage	Barcoding reference databases remain far from complete. However, much attention and effort are being devoted to filling this data gap, for example, the BIOSCAN project.

Improving terrestrial wildlife detection and species classification

eDNA is an emerging new technique in biodiversity monitoring. There are still many issues impeding the application of eDNA-based tools. One important gap in data is the incompleteness of barcoding reference databases.

eDNA is an emerging new technique in biodiversity monitoring. There are still many issues impeding the application of eDNA-based tools. One important gap in data is the incompleteness of barcoding reference databases.

Data Gap Type	Data Gap Details
S2: Sufficiency > Coverage	Barcoding reference databases remain far from complete. However, much attention and effort are being devoted to filling this data gap, for example, the BIOSCAN project.

Equivalent circuit models

Details (click to expand)

Equivalent circuit models are simplified representations of batteries represented by networks of resistors and capacitors to model battery behavior due to electrochemical reactions. Due to their ease of use, they can integrate easily into battery management control systems and customized to model a variety of battery chemistries and conditions. Different types of equivalent circuit models include the Rint model, hysteresis models, Randles models, and Thevenin models. These models differ in complexity with respect to the extent with which battery behavior is captured. For example, the simplest model, the Rint model, is static while other models vary in their representation of dynamic properties such as state of charge and battery lifetime.

Use Case

Data Gap Summary

Improving battery management systems

While ECMs enable real-time battery SoC predictions due to their computational efficiency, they often oversimplify real-life operating conditions which limits the accuracy of SoH and RUL estimates. Additionally, verification with data from physical battery systems is required to validate simulated outcomes and improve prediction reliability across diverse operational environments.

While ECMs enable real-time battery SoC predictions due to their computational efficiency, they often oversimplify real-life operating conditions which limits the accuracy of SoH and RUL estimates. Additionally, verification with data from physical battery systems is required to validate simulated outcomes and improve prediction reliability across diverse operational environments.

Data Gap Type

Data Gap Details

R1: Reliability > Quality

Due to their simplified nature and assumptions based on ideal laboratory conditions, ECMs have limited accuracy in predicting battery aging and dynamics in real systems. Verification with real-life battery system data from diverse operational environments is essential for improving state of health (SoH) and remaining useful life (RUL) predictions.

S3: Sufficiency > Granularity

The resolution of SoH and SoC predictions of ECMs are impacted by assumptions made with respect to battery performance. These include constant internal resistance assumptions that don’t account for sensitivity to complex current profiles or temperature variations, leading to inaccurate voltage and subsequent SoH/SoC calculations. ECMs also simplify electrochemical processes by ignoring electrode polarization, diffusion, and transfer kinetics, while neglecting battery aging effects like capacity fade Linearity assumptions, in simpler ECMs do not hold true under high charge/discharge rates. Solutions include increasing the complexity of ECMs by adding parallel RC networks to model the internal resistance of the battery with different time constants, introducing non-linear elements for different operating conditions, incorporating adaptive hysteresis models, and integrating aging parameters.

Faraday (Synthetic smart meter data)

Details (click to expand)

Due to consumer privacy protections, advanced metering infrastructure (AMI) data is unavailable for realistic demand response studies. In an effort to open smart meter data, the Octopus Energy’s Centre for Net Zero has generated a synthetic dataset conditioned on the presence of low carbon technologies, energy efficiency, and property type from a model trained on 300 million actual smart meter readings from a United Kingdom (UK) energy supplier. Faraday is currently accessible through the Centre for Net Zero’s API.

Use Case

Data Gap Summary

Improving short-term electricity load forecasting

Despite the model being trained on actual AMI data, synthetic data generation requires a priori knowledge with respect to building type, efficiency, and presence of low-carbon technologies. Furthermore, since the dataset is trained on UK building data, AMI time series generated may not be an accurate representation of load demand in regions outside the UK. Finally, since data is synthetically generated, studies will require validation and verification using real data or aggregated per substation level data to assess its effectiveness.

Despite the model being trained on actual AMI data, synthetic data generation requires a priori knowledge with respect to building type, efficiency, and presence of low-carbon technologies. Furthermore, since the dataset is trained on UK building data, AMI time series generated may not be an accurate representation of load demand in regions outside the UK. Finally, since data is synthetically generated, studies will require validation and verification using real data or aggregated per substation level data to assess its effectiveness.

Data Gap Type	Data Gap Details
O2: Obtainability > Accessibility	The Variational Autoencoder Model can generate synthetic AMI data based on several conditions. The presence of low carbon technology (LCT) for a given household or property type depends on access to battery storage solutions, solar rooftop panels, and the presence of electric vehicles. This type of data may require curation of LCT purchases by type and household. Building type and efficiency at the residential and commercial/industrial level may also be difficult to access, requiring the user to set initial assumptions or seek additional datasets. Furthermore, data verification requires a performance metric based on actual readings. This may be done through access to substation- level load demand data.
U3: Usability > Usage Rights	Faraday is open for alpha testing by request only.
S2: Sufficiency > Coverage	Faraday is trained from utility provided AMI data from the UK which may not be representative of load demand and corresponding building type and temperate zone of other global regions. To generate similar synthetic data, custom data may be retrieved through a pilot test bed for private collection or the result of a partnership with a local utility. Additionally, pre-existing AMI data over an area of interest can be utilized to generate similar synthetic data. Datasets are restricted to past pilot study coverage areas requiring further data collection for fine-tuning models to a different coverage area.
S3: Sufficiency > Granularity	Data granularity is limited to the granularity of data the model was trained on. Generative modeling approaches similar to Faraday, can be built using higher resolution data or interpolation methods could also be employed.
S4: Sufficiency > Timeliness	Timeliness of dataset would require continuous integration and development of the model using MLOps best practices to avoid data and model drift. By contributing to Linux Foundation Energy’s OpenSynth initiative, Centre for Net Zero hopes to build a global community of contributors to facilitate research.
R1: Reliability > Quality	Verification of AMI synthetic data requires verification, which can be done in a bottom-up grid modeling manner. For example, load demand at the substation level can be estimated based on the sum of individual building loads that the substation services. This value can then be compared to the actual substation load demand provided through private partnerships with distribution network operators (DNOs). However, the accuracy of a specific demand profile per property or section of properties would require identification of a population of buildings, a connected real-world substation, and residential low carbon technology investment for the set of properties under study.

FathomNet (marine wildlife annotated imagery)

Details (click to expand)

FathomNet is an open-source image database that standardizes and aggregates expertly curated labeled data. The data can be used to train, test, and validate ML algorithms to help us understand our ocean and its inhabitants.

Use Case

Data Gap Summary

Enhancing marine wildlife detection and species classification

The biggest challenge for the developer of FathomNet is enticing different institutions to contribute their data.

The biggest challenge for the developer of FathomNet is enticing different institutions to contribute their data.

Data Gap Type	Data Gap Details
O2: Obtainability > Accessibility	The biggest challenge for the developer of FathomNet is enticing different institutions to contribute their data.

Financial loss datasets related to the impacts of disasters

Details (click to expand)

Financial loss datasets related to disasters track the economic impacts of catastrophic events, including insurance claims and damages to infrastructure. They help assess financial repercussions and guide risk management and preparedness strategies.

Use Case

Data Gap Summary

Accelerating post-disaster damage assessments

Financial loss data for disasters is primarily proprietary and inaccessible to researchers, limiting the development of comprehensive disaster impact assessment models.

Financial loss data for disasters is primarily proprietary and inaccessible to researchers, limiting the development of comprehensive disaster impact assessment models.

Data Gap Type	Data Gap Details
O2: Obtainability > Accessibility	Financial loss data is typically proprietary and held by insurance and reinsurance companies, as well as financial and risk management firms. Some of the data should be made available for research purposes to improve disaster response and planning.

Facilitating disaster risk assessments

Financial loss data is typically proprietary and not publicly accessible.

Financial loss data is typically proprietary and not publicly accessible.

Data Gap Type	Data Gap Details
O1: Obtainability > Findability	Most consistent loss data is produced by the insurance industry and remains proprietary.
O2: Obtainability > Accessibility	Collecting robust, homogeneous loss data even for a single event presents significant challenges.
U4: Usability > Documentation	Loss data frequently lacks metadata, making it difficult to determine data completeness.

GPS travel trajectories and Origin-Destination data

Details (click to expand)

GPS travel routes refer to the detailed pathways captured by Global Positioning System (GPS) technology, which track the movement of individuals or vehicles from one location to another, providing precise data on the paths taken, speeds, and stops made during a journey. Origin-Destination (OD) data, on the other hand, specifically records the starting points (origins) and ending points (destinations) of trips, offering insights into travel patterns and demand without necessarily detailing the route taken. Together, these data types are essential for transportation planning, traffic management, and understanding mobility behaviors.

Use Case

Data Gap Summary

Understanding the impact of urban planning on travel emissions

Critical issues in using GPS and OD data include a lack of provenance details from commercial providers, causing analysis uncertainties. Accessibility is limited due to high costs and data silos from insufficient privacy-preserving sharing mechanisms. Essential trip details are often missing, and data from the Global South is less accessible despite global collection. Additionally, inadequate documentation undermines the data’s scientific value, as key information like sample representativeness is frequently absent, challenging accurate interpretation.

Critical issues in using GPS and OD data include a lack of provenance details from commercial providers, causing analysis uncertainties. Accessibility is limited due to high costs and data silos from insufficient privacy-preserving sharing mechanisms. Essential trip details are often missing, and data from the Global South is less accessible despite global collection. Additionally, inadequate documentation undermines the data’s scientific value, as key information like sample representativeness is frequently absent, challenging accurate interpretation.

Data Gap Type	Data Gap Details
R2: Reliability > Provenance	Commercial providers of GPS data often do not detail the provenance of the data that may come from different types of devices or vehicles. This creates uncertainties in downstream analyses. Such information should be released to increase the value of the data for scientific research.
O2: Obtainability > Accessibility	Most of the GPS traces are recorded by private actors who sell them at high prices, often also to researchers. Much GPS data recorded by apps whose primary business is not mobility is kept in silos and not shared, inter alia because of a lack of privacy-preserving, easy to use and trustworthy sharing mechanisms. The emergence of more intermediaries that convince private actors to share their data and facilitate the sharing might increase accessibility.
S6: Sufficiency > Missing Components	Certain variables describing the trip and enabling differentiation between different cases e.g. vehicle type or commercial vs private, would facilitate usage of the data. Such information may be available to the provider and could be shared easily, as it is not privacy-sensitive.
S2: Sufficiency > Coverage	Although smartphone data is being collected globally, data from the Global South is harder to access. Sharing initiatives and more demand from research projects focusing on Global South cities are also needed here.
U4: Usability > Documentation	There is a lack of documentation in general that undermines the scientific value of the data and downstream analyses. For example, information about the representativeness of the sample is not always provided or demonstrated.

Grid2Op and PandaPower (power systems simulation outputs))

Details (click to expand)

Grid2Op is a power systems simulation framework to perform reinforcement learning for electricity network operation that focuses on the use of topology to control the flows of the grid.

Grid2Op allows users to control voltages by manipulating shunts or changing setpoint values of generators, influence active generation by use of redispatching, and manipulate storage units such as batteries or pumped storage to produce or absorb energy from the grid when needed. The grid is represented as a graph with nodes being buses and edges corresponding to power lines and transformers. Grid2Op has several available environments with different network topologies as well as variables that can be monitored as observations. The environment is designed for reinforcement learning agents to act upon with a variety of actions some of which are binary or continuous. This includes changes in topology such as changing bus, changing line status, setting storage, curtailment, redispatching, setting bus values, and setting line status. Multiple reward functions are also available in the platform for experimentation with different agents. It is important to note that Grid2Op has no internal modeling of equations of the grids or what kind of solver is necessary to adopt. Data on how the power grid is evolving is represented by the “Chronics.” The solver that computes the state of the grid is represented by the “Backend” which utilizes PandaPower to compute power flows.

Use Case

Data Gap Summary

Improving power grid optimization

Grid2Op faces several data gaps related to usability, reliability, and coverage. Key issues include poor documentation, limited customization options (especially for reward functions and cascading failure scenarios), and a lack of support for multi-agent setups. The framework also lacks realistic system dynamics, fine time resolution, and flexible backend modeling, making it challenging to use for advanced research or real-world grid simulations without significant modification. These gaps can hinder the framework’s ability to accurately train reinforcement learning agents and simulate real-world power grid behavior.

Grid2Op faces several data gaps related to usability, reliability, and coverage. Key issues include poor documentation, limited customization options (especially for reward functions and cascading failure scenarios), and a lack of support for multi-agent setups. The framework also lacks realistic system dynamics, fine time resolution, and flexible backend modeling, making it challenging to use for advanced research or real-world grid simulations without significant modification. These gaps can hinder the framework’s ability to accurately train reinforcement learning agents and simulate real-world power grid behavior.

Data Gap Type	Data Gap Details
U4: Usability > Documentation	In the customization of the reward function, there are several TODOs in place concerning the units and attributes of the reward function related to redispatching. Documentation and code comments can sometimes provide conflicting information. Modularity of reward, adversary, action, environment, and backend are non-intuitive, requiring pregenerated dictionaries rather than dynamic inputs or conversion from single agent to multi-agent functionality. Refactoring of documentation and comments to reflect updates can assist users and avoid having to cross-reference information from the Discord channel for “Learning to Run a Power Network” and github issues.
U5: Usability > Pre-processing	The game over rules and constraints are difficult to adapt when customizing the environment for cascading failure scenarios and more complex adversaries such as natural disasters. Code base variations between versions especially between the native and Gym formatted framework lose features present in the legacy version including topology graphics. Open source refactoring efforts can assist in updating the code base to run latest and previous versions without loss of features.
R1: Reliability > Quality	The grid2op framework relies on mathematical robust control laws and rewards which train the RL agent based on set observation assumptions rather than actual system dynamics which are susceptible to noise, uncertainty, and disturbances not represented in the simulation environment. It has no internal modeling of the equations of the grids nor can it suggest which solver should be adopted to solve traditional nonlinear optimal power flow equations. Specifics concerning modeling and preferred solver require users to customize or create a new “Backend.” Additionally, such RL human-in-the-loop systems in practice require trustworthiness and quantification of risk. A library of open source contributed “Backends” from independent projects that customize the framework with supplemental documentation and paper references can assist in further development of the environment for different conditions. Human-in-the-loop studies can be completed by testing the environment scenario and control response of the system over a model of a real grid. Generated observations and control actions can then be compared to historical event sequences and grid operator responses.
S2: Sufficiency > Coverage	Coverage is limited to the network topologies provided by the grid2op environment which is based on different IEEE bus topologies. While customization of the environment in terms of the “Backend,” “Parameters,” and “Rules” are possible, there may be dependent modules that may still enforce game-over rules. Furthermore, since backend modeling is not the focus of grid2op, verification that customization obeys physical laws or models is necessary.
S3: Sufficiency > Granularity	The time resolution of 5-minute increments may not represent realistic observation time series grid data or chronics. Furthermore, the granularity may limit the effectiveness of specific actions in the provided action space. For example, the use of energy storage devices in the presence of overvoltage has little effect on energy absorption, incentivizing the agent to select from grid topology actions such as line changing line status or changing bus rather than setting storage. Expansion of the framework with efforts from the open source community to include multiple time resolutions may allow for generalization of the tool for different forecasting time horizons as well as action evaluation.

Ground survey of land use and land management

Details (click to expand)

Ground surveys collect direct field observations on land use practices and management approaches, providing critical ground-truth data that complements remote sensing. This information is essential for understanding human impacts on ecosystems and validating satellite-derived land cover classifications.

Use Case

Data Gap Summary

Facilitating the detection of climate-induced ecosystem changes

Access to comprehensive ground survey data is restricted due to institutional barriers and privacy concerns, limiting its availability for ecosystem change analysis.

Access to comprehensive ground survey data is restricted due to institutional barriers and privacy concerns, limiting its availability for ecosystem change analysis.

Data Gap Type	Data Gap Details
O2: Obtainability > Accessibility	Access to the data is restricted, with limited availability to the public. Users often find themselves unable to access the comprehensive information they require and must settle for suboptimal or outdated data. Addressing this challenge necessitates a legislative process to facilitate broader access to data.

Ground-Based Weather Station Observations

Details (click to expand)

Ground-based weather station data provides point measurements of atmospheric variables including temperature, precipitation, and humidity from meteorological networks worldwide. These observations serve as ground truth for validating and bias-correcting climate model outputs, though spatial coverage varies significantly by region and is particularly sparse in developing countries.

Use Case

Data Gap Summary

Enhancing bias-correction of climate projections

Irregular spatial distribution and point-based measurements require extensive preprocessing to create gridded datasets suitable for ML applications. Limited station density in many regions, especially over oceans and remote areas, constrains bias-correction accuracy. Enhanced observation networks and improved interpolation techniques can provide more comprehensive spatial coverage for model validation.

Irregular spatial distribution and point-based measurements require extensive preprocessing to create gridded datasets suitable for ML applications. Limited station density in many regions, especially over oceans and remote areas, constrains bias-correction accuracy. Enhanced observation networks and improved interpolation techniques can provide more comprehensive spatial coverage for model validation.

Data Gap Type	Data Gap Details
U1: Usability > Structure	Point measurements require conversion to regular grids through statistical interpolation methods and geostatistical techniques before use in ML models designed for gridded forecast data.
S2: Sufficiency > Coverage	Sparse observation networks in remote regions and developing countries create significant gaps in ground truth data needed for validating bias-corrected forecasts, though expanded observation networks and satellite-derived proxies can help fill these spatial gaps.
O2: Obtainability > Accessibility	Weather station data access is heavily restricted in many regions, with only a small fraction available to the public, limiting the development of comprehensive bias-correction training datasets.

Enhancing bias-correction of weather forecasts

Sparse spatial coverage, restricted data access in many regions, and the need for gridding point measurements limit the effectiveness of station observations for training and validating ML bias-correction models.

Sparse spatial coverage, restricted data access in many regions, and the need for gridding point measurements limit the effectiveness of station observations for training and validating ML bias-correction models.

Data Gap Type	Data Gap Details
O2: Obtainability > Accessibility	Weather station data access is heavily restricted in many regions, with only a small fraction available to the public, limiting the development of comprehensive bias-correction training datasets.
U1: Usability > Structure	Point measurements require conversion to regular grids through statistical interpolation methods and geostatistical techniques before use in ML models designed for gridded forecast data.
S2: Sufficiency > Coverage	Sparse observation networks in remote regions and developing countries create significant gaps in ground truth data needed for validating bias-corrected forecasts, though expanded observation networks and satellite-derived proxies can help fill these spatial gaps.

Ground-survey based forest inventory data

Details (click to expand)

Forest information collected directly from forested areas through on-the-ground observations and measurements serves as ground truth for training and validating estimates. This data is crucial for accurate assessments, such as estimating forest canopy height using machine learning models. https://research.fs.usda.gov/programs/fia#data-and-tools

Use Case

Data Gap Summary

Improving estimations of forest carbon stock

Manual collection results in data quality issues and limited spatial coverage, requiring improved collection protocols and integration with remote sensing to expand usability.

Manual collection results in data quality issues and limited spatial coverage, requiring improved collection protocols and integration with remote sensing to expand usability.

Data Gap Type	Data Gap Details
U5: Usability > Pre-processing	Ground-survey data often contains missing values, measurement errors, and duplicates that require cleaning before use. Standardizing collection protocols and developing automated quality control procedures could improve data usability.
S2: Sufficiency > Coverage	Manual collection methods limit geographical coverage and collection frequency. Integrating ground surveys with remote sensing approaches and developing citizen science initiatives could help expand coverage while maintaining data quality.

High-Resolution Rapid Refresh (HRRR) weather forecast

Details (click to expand)

The High-Resolution Rapid Refresh (HRRR) dataset contains near-term weather forecasts produced at 3-km resolution with hourly updates. It is a cloud-resolving, convection-allowing atmospheric model that assimilates radar data every 15 minutes over a 1-hour period. See https://rapidrefresh.noaa.gov/hrrr/

Use Case

Data Gap Summary

Accelerating and improving weather forecasting: Near-term (< 24 hours)

Data volume is large, and only data covering the US is available.

Data volume is large, and only data covering the US is available.

Data Gap Type	Data Gap Details
U6: Usability > Large Volume	The large data volume, resulting from its high spatio-temporal resolution, makes transferring and processing the data very challenging. It would be beneficial if the data were accessible remotely and if computational resources were provided alongside it.
S2: Sufficiency > Coverage	This assimilated dataset currently covers only the continental US. It would be highly beneficial to have a similar dataset that includes global coverage.

Historical climate observations

Details (click to expand)

Historical climate observations encompass both global reanalysis datasets like ERA5, which provide comprehensive atmospheric data at coarse resolution worldwide, and local weather station records that offer higher temporal and spatial granularity for specific regions. These datasets typically cover multiple decades and include variables such as temperature, precipitation, humidity, and wind patterns.

Use Case

Data Gap Summary

Facilitating the detection of climate-induced ecosystem changes

For highly biodiverse regions, like tropical rainforests, there is a lack of high-resolution data that captures the spatial heterogeneity of climate, hydrology, soil, and the relationship with biodiversity.

For highly biodiverse regions, like tropical rainforests, there is a lack of high-resolution data that captures the spatial heterogeneity of climate, hydrology, soil, and the relationship with biodiversity.

Data Gap Type

Data Gap Details

S3: Sufficiency > Granularity

There is a lack of high-resolution data that captures the spatial heterogeneity of climate, hydrology, soil, and so on, which are important for biodiversity patterns. This is because of a lack of observation systems that are dense enough to capture the subtleties in those variables caused by terrain. It would be helpful to establish decentralized monitoring networks to cost-effectively collect and maintain high-quality data over time, which cannot be done by one single country.

Improving assessments of climate impacts on public health

Climate data accessibility and integration challenges limit ML applications in climate-health research. Data exists in diverse formats that require significant preprocessing, and researchers without climate expertise struggle to identify appropriate datasets for their specific health applications.

Climate data accessibility and integration challenges limit ML applications in climate-health research. Data exists in diverse formats that require significant preprocessing, and researchers without climate expertise struggle to identify appropriate datasets for their specific health applications.

Data Gap Type	Data Gap Details
U1: Usability > Structure	Climate datasets exist in various incompatible formats and structures, requiring extensive preprocessing before integration with health data.
O1: Obtainability > Findability	No centralized platform exists for discovering climate datasets appropriate for health applications, making it difficult for non-climate experts to identify suitable data sources.

Interpolating city-wide bicycle volumes from limited count data

It is worth noting that there are no major data gaps for this use case for cities where the other necessary data sources are available.

It is worth noting that there are no major data gaps for this use case for cities where the other necessary data sources are available.

Data Gap Type	Data Gap Details
M: Misc/Other	It is worth noting that there are no major data gaps for this use case for cities where the other necessary data sources are available.

JRC PVGIS (solar radiation data)

Details (click to expand)

PVGIS (Photovoltaic Geographical Information System), developed by the European Commission’s Joint Research Centre (JRC), is a comprehensive online tool designed to assess the solar energy potential of any location. It combines satellite-derived solar radiation data, weather information, and models that estimate photovoltaic system performance, including expected energy output and losses based on local climate and system parameters.

Use Case

Data Gap Summary

Assessing rooftop solar photovoltaic potential

This dataset does not have major data gaps for this use case, but there are some approximations and other errors in the data to be considered.

This dataset does not have major data gaps for this use case, but there are some approximations and other errors in the data to be considered.

Data Gap Type	Data Gap Details
S3: Sufficiency > Granularity	The irradiance values provided represent instantaneous measurements at the timestamp, which may not accurately represent hourly or daily averages used for energy estimates.

Lab measurements of material property and carbon absorption

Details (click to expand)

Lab measurements of material properties (such as chemical composition and physical properties) and their performance on carbon absorption (such as absorption capacity).

Use Case

Data Gap Summary

Accelerating the design of new carbon-absorbing materials

The major challenge is that data is not shared with the public.

The major challenge is that data is not shared with the public.

Data Gap Type

Data Gap Details

O2: Obtainability > Accessibility

Data related to carbon absorption materials is often not readily accessible to the public, as it is typically withheld until commercial products are developed. While it is possible to scrape data from published literature, this approach can be cumbersome, especially for large datasets. To advance research and innovation in this field, establishing mandatory data sharing as a requirement for publication is essential. When a paper is published, authors should be required to provide their data in open, machine-readable formats to facilitate accessibility and usability.

Creating open initiatives where companies and institutions recognize the mutual benefits of data sharing is also vital. Until such initiatives demonstrate clear advantages for all stakeholders, private companies may be hesitant to share proprietary data. Initiatives like OpenDAC are promising steps toward fostering collaboration and transparency in the field.

Large-eddy simulations (atmospheric processes)

Details (click to expand)

Large-eddy simulations are very high-resolution atmospheric simulations (finer than 150 m) where atmospheric turbulence is explicitly resolved in the model, providing detailed insights into small-scale atmospheric processes.

Use Case

Data Gap Summary

Hybrid ML-physics climate models for enhanced simulations

These simulations are essential for resolving turbulent processes that current climate models cannot capture, but they require significant computational resources and are not readily available as benchmark datasets for the wider research community.

These simulations are essential for resolving turbulent processes that current climate models cannot capture, but they require significant computational resources and are not readily available as benchmark datasets for the wider research community.

Data Gap Type

Data Gap Details

S6: Sufficiency > Missing Components

Current high-resolution simulations cannot resolve many physical processes like turbulence. Extremely high-resolution simulations (sub-kilometer or tens of meters) are needed to serve as ground truth for training ML models as they provide a more realistic representation of atmospheric processes. Creating and sharing benchmark datasets based on these simulations would facilitate model development and validation.

LiDAR point cloud – airbone

Details (click to expand)

Airborne LiDAR (Light Detection and Ranging) collects high-resolution, three-dimensional point clouds of forest structure using sensors mounted on aircraft or drones. This technology captures precise data about forest canopies, enabling detailed assessment of biomass and carbon stocks at local to regional scales.

Use Case

Data Gap Summary

Improving estimations of forest carbon stock

Limited geographical coverage due to high collection costs, combined with the need for domain expertise to process the complex point cloud data, restricts the use of this high-value data source.

Limited geographical coverage due to high collection costs, combined with the need for domain expertise to process the complex point cloud data, restricts the use of this high-value data source.

Data Gap Type	Data Gap Details
U5: Usability > Pre-processing	Domain expertise is required to process raw LiDAR point clouds and generate canopy height metrics used for training ML models. Developing open-source processing tools with standardized workflows would make this data more accessible to non-experts.
S2: Sufficiency > Coverage	Airborne LiDAR provides the most accurate measurements of canopy height but is not collected everywhere due to the high costs of aircraft or drone operations. Coordinated efforts to expand coverage and make existing data publicly available would significantly improve forest carbon stock estimation capabilities.

Marktstammdatenregister (solar photovoltaic data)

Details (click to expand)

The Marktstammdatenregister (MaStR) is Germany’s official registry for all energy-producing units, providing detailed and authoritative data on solar photovoltaic (PV) systems. It includes mandatory registration of all grid-connected solar PV installations, covering system type (e.g., rooftop or ground-mounted), installed capacity, commissioning date, geolocation, operator type, and grid connection details. Updated regularly and publicly accessible, MaStR enables comprehensive analysis of solar deployment patterns, supports grid planning, and helps track progress toward Germany’s energy transition and climate goals.

Use Case

Data Gap Summary

Mapping existing solar photovoltaic systems

Although MaStR is one of the best datasets for solar PVs, substantial errors exist in the data, e.g. in terms of temporal or position accuracy.

Although MaStR is one of the best datasets for solar PVs, substantial errors exist in the data, e.g. in terms of temporal or position accuracy.

Data Gap Type	Data Gap Details
R1: Reliability > Quality	Although MaStR is one of the best datasets for solar PVs, substantial errors exist, e.g. in terms of temporal or position accuracy. Software has been written on top of MaStR to preprocess the data and improve quality. See https://www.mdpi.com/2306-5729/7/9/128

Material intensity data

Details (click to expand)

Material intensity coefficients are numerical values that represent the amount of material used per unit of a reference measure, such as floor area, weight, or volume of a building or product. Typically expressed in units like kilograms per square meter (kg/m²), they are used to quantify how much of a specific material (e.g., concrete, steel, insulation) is needed for construction or manufacturing. These coefficients are essential in life cycle assessments (LCA), material flow analysis, and environmental impact studies, helping estimate resource use, waste generation, and embodied carbon.

Use Case

Data Gap Summary

Enhancing the scalability and robustness of building stock assessments

Material intensity coefficient datasets face key gaps in aggregation, provenance, documentation, granularity, and timeliness. These issues stem from inconsistent formats, missing metadata, outdated or high-level data, and limited transparency on how values are derived, all of which can hinder reliable, comparative use in material and emissions modeling.

Material intensity coefficient datasets face key gaps in aggregation, provenance, documentation, granularity, and timeliness. These issues stem from inconsistent formats, missing metadata, outdated or high-level data, and limited transparency on how values are derived, all of which can hinder reliable, comparative use in material and emissions modeling.

Data Gap Type	Data Gap Details
R2: Reliability > Provenance	Many material intensity values lack detailed documentation on how they were derived, including data sources, assumptions, and modeling techniques, making it difficult to assess the reliability or reproduce the values. Efforts should be put to ensure that future data collection is documented following best practices.
S6: Sufficiency > Missing Components	Key construction materials or components (e.g., finishes, insulation, MEP systems) are often omitted from available coefficient datasets, leading to underestimation of total material use in buildings or infrastructure.
U4: Usability > Documentation	Coefficient datasets may lack accompanying metadata that clarifies boundaries (e.g., cradle-to-gate vs. cradle-to-site), regional applicability, or assumptions about building typologies and lifespans.
S3: Sufficiency > Granularity	Many datasets only provide material intensities at a high level (e.g., per building or structural system), with limited resolution at the component or material subtype level, reducing their value for detailed modeling. There is also insufficient spatial granularity and measurements of how material intensity varies within countries, which should be the focus of future surveys.
S4: Sufficiency > Timeliness	Coefficients are often based on outdated construction practices or legacy data, and are not regularly updated to reflect changes in material composition, manufacturing efficiency, or new building technologies.

Micro-synchrophasors (µPMU data)

Details (click to expand)

Micro-phasor measurement units (µPMUs) provide synchronized voltage and current measurements with higher accuracy, precision, and sampling rate making it ideal for distribution network monitoring.

For example, µPMUs have an angle accuracy to the allowance of .01 degrees and a total vector error allowance of .05% in contrast to 1 degree and 1% total vector error allowance for classic PMUs. With sampling rates of 10-120 samples per second, µPMUs are capable of capturing dynamic and transient states within the low voltage distribution network allowing for improved event and fault detection and localization. Today most µPMU datasets can be accessed through manual field deployments in test-beds, collaborative research studies, or through publicly available datasets.

Use Case

Data Gap Summary

Facilitating fault detection in low voltage distribution grids

For effective fault localization using µPMU data, it is crucial that distribution circuit models supplied by utility partners or Distribution System Operators (DSOs) are accurate, though they often lack detailed phase identification and impedance values, leading to imprecise fault localization and time series contextualization. Noise and errors from transformers can degrade µPMU measurement accuracy. Integrating additional sensor data affects data quality and management due to high sampling rates and substantial data volumes, necessitating automated analysis strategies. Cross-verifying µPMU time series data, especially around fault occurrences, with other data sources is essential due to the continuous nature of µPMU data capture. Additionally, µPMU installations can be costly, limiting deployments to pilot projects over a small network. Furthermore, latencies in the monitoring system due to communication and computational delays challenge real-time data processing and analysis.

For effective fault localization using µPMU data, it is crucial that distribution circuit models supplied by utility partners or Distribution System Operators (DSOs) are accurate, though they often lack detailed phase identification and impedance values, leading to imprecise fault localization and time series contextualization. Noise and errors from transformers can degrade µPMU measurement accuracy. Integrating additional sensor data affects data quality and management due to high sampling rates and substantial data volumes, necessitating automated analysis strategies. Cross-verifying µPMU time series data, especially around fault occurrences, with other data sources is essential due to the continuous nature of µPMU data capture. Additionally, µPMU installations can be costly, limiting deployments to pilot projects over a small network. Furthermore, latencies in the monitoring system due to communication and computational delays challenge real-time data processing and analysis.

Data Gap Type	Data Gap Details
O2: Obtainability > Accessibility	Typically the distribution circuit lacks notation with respect to the phase identification and impedance values, often providing rough approximations which can ultimately influence the accuracy of localization as well as time series contextualization of a fault. Decreased accuracy of localization can then affect downstream control mechanisms to ensure operational reliability. For µPMU data to be utilized for fault localization, the distribution circuit model must be provided by the partnering utility or DSO.
U5: Usability > Pre-processing	µPMU data is sensitive to noise especially from geomagnetic storms which can induce electric currents in the atmosphere and impact measurement accuracy. Data can also be compromised by errors introduced by current and potential transformers. One way to mitigate this error is to monitor and re-calibrate transformers or deploy redundant µPMUs to verify measurements. Depending on whether additional data from other sensors or field reports is being used to classify µPMU time series data, creation of a joint sensor dataset may improve quality based on the overall sampling rate and format of the additional non-µPMU data.
U6: Usability > Large Volume	Due to the high sampling rates, data volume from each individual µPMU can be challenging to manage and analyze due to its continuous nature. Coupled with the number of µPMUs required to monitor a portion of the distribution network, the amount of data can easily exceed terabytes. Automation of indexing and mining time series by transient characteristics can facilitate domain specialist verification efforts.
R1: Reliability > Quality	Since µPMU data is continuously captured, time series data leading up to or even identifying a fault or potential fault requires verification from other data sources. Digital Fault Recorders (DFRs) capture high resolution event driven data such as disturbances due to faults, switching and transients. They are able to detect rapid events like lightning strikes and breaker trips while also recording the current and voltage magnitude with respect to time. Additionally, system dynamics over a longer period following a disturbance can also be captured. When used in conjunction with µPMU data, DFR data can assist in verifying significant transients found in the µPMU data which can facilitate improved analysis of both signals leading up to and after an event from the perspective of distribution-side state.
S2: Sufficiency > Coverage	Currently µPMU installation to existing distribution grids have significant financial costs so most deployments have been in the form of pilot projects with utilities. Pilot studies include the Flexgrid testing facility at Lawrence Berkeley National Laboratory (LBNL), Philadelphia Navy Yard microgrid (2016-2017), the micro-synchrophasors for distribution systems plus-up project (2016-2018), resilient electricity grids in the Philippines (2016), the GMLC 1.2.5- sensing and measurement strategy (2016), the bi-national laboratory for the intelligent management of energy sustainability and technology education in Mexico City (2017-2018) based on North American Synchrophasor Initiative (NASPI) reports. Coverage is also limited by acceptance to this technology due to a pre-existing reliance on SCADA systems which measure grid conditions on a 15 minute cadence. As transients become more common, especially on the low voltage distribution grid, transition to monitoring with higher resolution will become necessary. Multi-objective evaluation with respect to the value proposition of further µPMU sensor monitoring networks can provide utilities and DSOs with a framework for assessing the economic, environmental, and operational benefit to pursue larger scale studies.
S4: Sufficiency > Timeliness	µPMU data can suffer from multiple latencies within the monitoring system of the grid that are unable to keep up with the high sampling rate of the continuous measurements that µPMUs generate. Latencies occur in the context of the system communications surrounding signals as they are being recorded, processed, sent, and received. This can be due to the communication medium used, cable distance, amount of processing, and computational delay. More specifically, the named latencies are measurement, transmission, channel, receiver, and algorithm related. Identification of characteristics preceding fault events with lead times to overcome potential latencies through machine learning or other techniques can be of benefit.

Museum specimens

Details (click to expand)

Museum specimens contain detailed biological records documenting species’ characteristics, including morphological traits. Data on where and when they were collected is also often recorded. This offers documentation on the occurrence of species in both space and time. Museum specimens are valuable resources for various applications, such as species classification and species distribution modeling.

Use Case

Data Gap Summary

Improving terrestrial wildlife detection and species classification

The majority of the world’s museum specimens remain undigitized, creating a significant barrier to using these records in machine learning applications for biodiversity monitoring and climate change research.

The majority of the world’s museum specimens remain undigitized, creating a significant barrier to using these records in machine learning applications for biodiversity monitoring and climate change research.

Data Gap Type

Data Gap Details

S1: Sufficiency > Insufficient Volume

Museum specimens only become valuable to ML studies when they are digitized. Many museum specimens remain to be digitized, and this task presents significant challenges. Much of the information about these specimens, such as species traits and occurrence data, is often recorded in handwritten notes, making parsing and recognizing this information a complex and error-prone process.

Digitizing these specimens has become a priority for many museums. To support this effort, adequate funding, and technical and scientific assistance should be provided. Machine learning itself can help support some of these efforts e.g. when it comes to digitizing notes.

NEX-GDDP-CMIP6 (Global daily downscaled long-term climate projections)

Details (click to expand)

The NEX-GDDP-CMIP6 dataset provides high-resolution, bias-corrected global climate projections derived from Coupled Model Intercomparison Project Phase 6 (CMIP6) across four greenhouse gas emissions scenarios (Shared Socioeconomic Pathways). It includes daily climate variables such as temperature, precipitation, humidity, and radiation from 2015 to 2100 at approximately 25km resolution, enabling detailed analysis of climate change impacts sensitive to local topography and fine-scale climate gradients. For more information, see https://www.nccs.nasa.gov/services/data-collections/land-based-products/nex-gddp-cmip6.

Use Case

Data Gap Summary

Improving long-term extreme heat prediction

The dataset’s massive size (petabytes of data) creates significant barriers for access, transfer, and analysis, requiring specialized computing infrastructure and technical expertise that many researchers lack. Additionally, efficiently extracting relevant extreme heat information from this comprehensive climate dataset presents computational and methodological challenges.

The dataset’s massive size (petabytes of data) creates significant barriers for access, transfer, and analysis, requiring specialized computing infrastructure and technical expertise that many researchers lack. Additionally, efficiently extracting relevant extreme heat information from this comprehensive climate dataset presents computational and methodological challenges.

Data Gap Type

Data Gap Details

U6: Usability > Large Volume

The NEX-GDDP-CMIP6 dataset requires substantial computational resources for processing and analysis. While cloud platforms provide access, they involve usage costs that may be prohibitive for some researchers. Processing such large datasets requires specialized techniques like distributed computing frameworks (e.g., Dask, Spark) and occasionally large-memory computing nodes for certain statistical analyses. Many researchers and practitioners lack either the technical expertise or computational resources to effectively utilize this valuable data.

NIST campus photovoltaic arrays and weather station data

Details (click to expand)

This dataset contains measurements from PV arrays at the National Institute of Standards and Technology campus from August 2014-July 2017, including electrical, temperature, meteorological, and radiation data sampled at high frequency with one-minute averages.

Use Case

Data Gap Summary

Improving solar power forecasting: nowcasting/very-short-term (0-30min)

The dataset has limited spatial coverage (Gaithersburg, MD only) and is no longer maintained after July 2017, limiting its usefulness for current applications.

The dataset has limited spatial coverage (Gaithersburg, MD only) and is no longer maintained after July 2017, limiting its usefulness for current applications.

Data Gap Type	Data Gap Details
S2: Sufficiency > Coverage	Since testbeds are located on the NIST campus spatial coverage is limited to the institution’s site. Similar datasets outside which combine sensor information from the solar irradiance conditions, and the associated solar generated power at the output of the inverter would require investment in similar site-specific testbeds in different regions.
S4: Sufficiency > Timeliness	Spatial coverage is limited to the NIST campus in Gaithersburg, MD. Similar datasets in different regions would require investment in comparable testbeds.

NOAA's SOLRAD Network Solar Radiation Data

Details (click to expand)

The National Oceanic and Atmospheric Administration’s SOLRAD Network monitors surface radiation at nine locations across the United States. The data includes high-precision measurements from various instruments, including pyrheliometers, pyranometers, and UV radiometers that collect minute-interval measurements of incoming solar radiation. These measurements characterize the Earth’s surface radiation budget and can be used to accurately forecast solar energy generation for grid planning and management.

Use Case

Data Gap Summary

Improving solar power forecasting: short-term (30 min-6 hours)

While NOAA’s SOLRAD is an excellent data source for long term solar irradiance and climate studies, it has limitations for short-term solar forecasting applications. Key gaps include lower quality hourly averages compared to native resolution data, and limited geographic coverage with only nine monitoring stations across the United States. These constraints impact the effectiveness of forecasting for real-time energy management, grid stability, and market operations.

While NOAA’s SOLRAD is an excellent data source for long term solar irradiance and climate studies, it has limitations for short-term solar forecasting applications. Key gaps include lower quality hourly averages compared to native resolution data, and limited geographic coverage with only nine monitoring stations across the United States. These constraints impact the effectiveness of forecasting for real-time energy management, grid stability, and market operations.

Data Gap Type	Data Gap Details
S2: Sufficiency > Coverage	The coverage area is constrained to nine SURFRAD network locations in the United States (Albuquerque, NM; Bismark, ND; Hanford, CA; Madison, WI; Oak Ridge, TN; Salt Lake City, UT; Seattle, WA; Sterling, VA; Tallahassee, FL). For generalization to other regions, locations with similar climates and temperate zones would need to be identified.
S3: Sufficiency > Granularity	Data quality of the hourly averages is lower than that of the native resolution data, impacting effective short-term forecasting for real-time energy management, grid stability, demand response, and market operations. To address this gap, using very short-term data or supplementing with data from sky imagers and other sensors with frequent measurement outputs would be beneficial.

NREL NOW23 (wind data)

Details (click to expand)

National Renewable Energy Laboratory’s NOW23 data is the latest wind resource data set for offshore regions in the United States.

The NOW-23 data set was produced using the Weather Research and Forecasting Model (WRF) version 4.2.1. A regional approach was used: for each offshore region, the WRF setup was selected based on validation against available observations. The WRF model was initialized with the European Centre for Medium Range Weather Forecasts 5 Reanalysis (ERA-5) data set, using a 6-hour refresh rate. The model is configured with an initial horizontal grid spacing of 6 km and an internal nested domain that refined the spatial resolution to 2 km. The model is run with 61 vertical levels, with 12 levels in the lower 300m of the atmosphere, stretching from 5 m to 45 m in height.

It is accessible here: https://data.openei.org/submissions/4500

Use Case

Data Gap Summary

Improving offshore wind power forecasting: short-to long-term (3 hours–1 year)

This is numerically modeled data.

This is numerically modeled data.

Data Gap Type	Data Gap Details
S5: Sufficiency > Proxy	This is numerically modeled data.

NREL Physical Solar Model Solar Radiation Database

Details (click to expand)

The National Renewable Energy Laboratory (NREL)’s Solar Radiaion Database provides hourly and half-hourly solar radiation data modeled using NREL’s Physical Solar Model (PSM). The data is derived from multiple satellite sources including NOAA’s Geostationary Operational Environmental Satellites (GOES), the Interactive Multisensor Snow and Ice Mapping System (IMS), MODIS, and MERRA-2 reanalysis. The PSM derives cloud and aerosol properties as inputs for the Fast All-sky Radiation Model for Solar applications (FARMS), enabling users to access spectral irradiance data based on time, location, and PV orientation.

Use Case

Data Gap Summary

Improving solar power forecasting: short-term (30 min-6 hours)

While NSRDB offers global coverage using satellite-derived data, several challenges exist. The dataset requires periodic recalculation and updating to remain current, with unbalanced temporal coverage favoring the United States. Satellite-based estimations may be inaccurate in regions with frequent cloud cover, snow, or bright surfaces, requiring ground-based verification. Additionally, data derived from satellite imagery may need preprocessing to account for parallax effects and field-of-view issues that aren’t fully addressed in the higher-level FARMS products.

While NSRDB offers global coverage using satellite-derived data, several challenges exist. The dataset requires periodic recalculation and updating to remain current, with unbalanced temporal coverage favoring the United States. Satellite-based estimations may be inaccurate in regions with frequent cloud cover, snow, or bright surfaces, requiring ground-based verification. Additionally, data derived from satellite imagery may need preprocessing to account for parallax effects and field-of-view issues that aren’t fully addressed in the higher-level FARMS products.

Data Gap Type	Data Gap Details
U5: Usability > Pre-processing	Data derived from satellite imagery requires pre-processing to account for pixel variability, parallax effects, and additional modeling using radiative transfer to improve solar radiation estimates.
S4: Sufficiency > Timeliness	Data flow from satellite imagery to solar radiation measurement output from FARMS needs to be recalculated and updated to expand beyond the current coverage years of the represented global regions.
R1: Reliability > Quality	Satellite-based estimation of solar resource information for sites susceptible to cloud cover, snow, and bright surfaces may not be accurate, requiring verification from ground-based measurements.

NREL SRRL Baseline Measurement System for Multi-Variable Solar Research

Details (click to expand)

The NREL Solar Radiation Research Laboratory’s Baseline Measurement System (SRRL BMS) provides 130 variables at 60-second intervals for site-specific environmental factors at its Golden, Colorado facility. This comprehensive dataset includes co-located measurements of temperature, pressure, precipitation, wind parameters, humidity, UV index, aerosol optical depth, albedo, and cloud cover categorized as opaque, thin, and clear. This multi-variable dataset supports photovoltaic potential studies and renewable resource climatology research.

Use Case

Data Gap Summary

Improving solar power forecasting: short-term (30 min-6 hours)

While NREL’S SRRL BMS provides real-time joint variable data from ground-based sensors, its coverage is limited to the single location in Golden, CO in the United States. The diverse sensor network requires regular maintenance, and instrument malfunctions or calibration issues may lead to data inaccuracies if not promptly detected and addressed, affecting the reliability of solar forecasting applications.

While NREL’S SRRL BMS provides real-time joint variable data from ground-based sensors, its coverage is limited to the single location in Golden, CO in the United States. The diverse sensor network requires regular maintenance, and instrument malfunctions or calibration issues may lead to data inaccuracies if not promptly detected and addressed, affecting the reliability of solar forecasting applications.

Data Gap Type	Data Gap Details
U5: Usability > Pre-processing	Instrument malfunction or calibration requires human intervention, leading to inaccuracies in measured data quantities, especially if detection is delayed, affecting solar forecast accuracies. Despite this, the dataset continues to be maintained.
S2: Sufficiency > Coverage	Coverage is reserved to Golden, CO. Other locations would benefit from similar sensor monitoring systems, especially those with variations in weather patterns that could affect solar irradiance forecasting and energy harvesting.

NREL Solar panel PV system dataset

Details (click to expand)

The Solar Panel PV System Dataset (https://www.kaggle.com/datasets/arnavsharmaas/solar-panel-pv-system-dataset/data) is a tabular dataset from the National Renewable Energy Laboratory that includes specific feature data on PV system size, rebate, construction, tracking, mounting, module types, number of inverters and types, capacity, electricity pricing, and battery-rated capacity in the US.

The solar panel PV system dataset was created by collecting and cleaning data for 1.6 million individual PV systems, representing 81% of all U.S. distributed PV systems installed through 2018. The analysis of installed prices focused on a subset of roughly 680,000 host-owned systems with available installed price data, of which 127,000 were installed in 2018. The dataset was sourced primarily from state agencies, utilities, and organizations administering PV incentive programs, solar renewable energy credit registration systems, or interconnection processes.

Use Case

Data Gap Summary

Mapping existing solar photovoltaic systems

The solar panel PV system dataset excluded third-party-owned systems and systems with battery backup. Since data was self-reported, component cost data may be imprecise. The dataset also has a data gap in timeliness as some of the data includes historical data, which may not reflect current pricing and costs of PV systems.

The solar panel PV system dataset excluded third-party-owned systems and systems with battery backup. Since data was self-reported, component cost data may be imprecise. The dataset also has a data gap in timeliness as some of the data includes historical data, which may not reflect current pricing and costs of PV systems.

Data Gap Type	Data Gap Details
S2: Sufficiency > Coverage	The dataset excluded third-party-owned systems, systems with battery backup, self-installed systems, and data that was missing installation prices. Data was self-reported and may be inconsistent based on the reporting of component costs. Furthermore, some state markets were underrepresented or missing, which can be alleviated by new data collection jointly with simulation studies.
S4: Sufficiency > Timeliness	The dataset includes historical data that may not reflect current pricing for PV systems. To alleviate this, updated pricing may be incorporated in the form of external data or as additional synthetic data from simulation.

NREL WIND toolkit (wind and weather)

Details (click to expand)

The National Renewable Energy Laboratory WIND Toolkit includes instantaneous meteorological conditions from computer model output and calculated turbine power for more than 126,000 sites in the continental United States for the years 2007–2013. While the dataset mostly covers onshore areas, it also has an offshore component.

It features three datasets:

The meteorological dataset includes basic information on the weather conditions in each 2-km x 2-km grid cell. The meteorological dataset also includes parameters such as wind profiles, atmospheric stability, and solar radiation data in those cells.

The power dataset was created using wind data at 100-meter hub height and site-appropriate turbine power curves to estimate the power produced at each of the turbine sites.

The forecast dataset includes forecasts for 1-hour, 4-hour, 6-hour, and 24-hour forecast horizons.

The data features 2–4-km spatial resolution and 5-minute to hourly temporal resolution for offshore and land-based wind in the continental United States, Hawaii, and Alaska. It uses ensemble-based modeling for uncertainty quantification. Accessible here: https://www.nrel.gov/grid/wind-toolkit

Use Case

Data Gap Summary

Improving offshore wind power forecasting: short-to long-term (3 hours–1 year)

The data is outdated and only a proxy of actual meteorological conditions.

The data is outdated and only a proxy of actual meteorological conditions.

Data Gap Type	Data Gap Details
S4: Sufficiency > Timeliness	The dataset only covers 2007-2013, and also its boundary conditions is ERA-Interim, which is also the out-of-date reanalysis data. NREL has a new version of onshore wind numerical-modeled data called WTK-LED. However, they only provide hourly data from WTK-LED.
S5: Sufficiency > Proxy	This is numerically modeled data.

NREL Wind Active Power Control Simulation Tools

Details (click to expand)

NREL has developed simulation tools to understand the effects of wind power on interconnection system frequency, including the Flexible Energy Scheduling Tool for Integrating Variable Generation (FESTIV) and Multi-Area Frequency Response Integration Tool (MAFRIT). These tools use traditional commercial software and custom-developed models to perform dynamic simulations and wind generation studies for active power control of the grid.

These simulation tools include:

NREL Flexible Energy Scheduling Tool for Integrating Variable Generation (FESTIV)

NREL Multi-Area Frequency Response Integration Tool (MAFRIT)

Use Case

Data Gap Summary

Enhancing wind power grid integration and stability

Access to NREL’s FESTIV model requires special permission, limiting broader research applications. The model’s hourly temporal resolution cannot capture sub-hourly dynamics critical for frequency response and system stability. Additionally, the simulation-based approach requires validation with real-world operational data to ensure accuracy for practical grid applications.

Access to NREL’s FESTIV model requires special permission, limiting broader research applications. The model’s hourly temporal resolution cannot capture sub-hourly dynamics critical for frequency response and system stability. Additionally, the simulation-based approach requires validation with real-world operational data to ensure accuracy for practical grid applications.

Data Gap Type	Data Gap Details
O2: Obtainability > Accessibility	Access to the FESTIV model requires permission by contacting the group manager
R1: Reliability > Quality	The model may not account for all real-time system dynamics and complexities, requiring verification from operational data. Scenario-based forecasting may not capture real-world uncertainties, and operating reserve values may be inaccurate without practical validation.
S3: Sufficiency > Granularity	FESTIV operates on hourly unit commitment time resolution, which cannot capture reliability impacts occurring on sub-hourly scales including frequency response, voltage magnitudes, and reactive power flows that affect system stability.

NREL solar power data for integration studies

Details (click to expand)

The NREL Solar Power Data for Integration Studies provides one year (2006) of 5-minute solar power data and hourly day-ahead forecasts for 6,000 simulated PV plants across the United States. The dataset was created using sub-hour irradiance algorithms and Numeric Weather Prediction simulations, covering both utility-scale (with single-axis tracking) and distributed-scale (fixed-tilt) PV systems.

Use Case

Data Gap Summary

Improving solar power forecasting: long-term (>24 hours)

While valuable for renewable energy integration studies, this dataset has limitations in geographic coverage (limited to the US), temporal scope (only 2006 data), and relies on simulated rather than measured PV outputs. Addressing these gaps would enable more accurate and globally applicable ML-based solar forecasting models.

While valuable for renewable energy integration studies, this dataset has limitations in geographic coverage (limited to the US), temporal scope (only 2006 data), and relies on simulated rather than measured PV outputs. Addressing these gaps would enable more accurate and globally applicable ML-based solar forecasting models.

Data Gap Type	Data Gap Details
R1: Reliability > Quality	The dataset uses simulated outputs based on weather predictions rather than actual PV measurements, which may introduce systematic biases. Site-specific projects require additional validation with real measurements from solar power inverters. Developers can improve model accuracy by supplementing with local measurements and adapting simulation parameters to better represent specific regions.
S2: Sufficiency > Coverage	The dataset is limited to US locations based on 2006 solar conditions and is not representative of other geographic regions or more recent climate patterns. Expanding data collection to include diverse global regions and updating with more recent measurements would improve model transferability.
S4: Sufficiency > Timeliness	The dataset only covers 2006, which may not capture recent climate trends or technology improvements in PV systems. Updated datasets with more recent time periods would better represent current conditions and improve forecasting accuracy.

Natural hazards forecasts

Details (click to expand)

Natural hazard data used for risk assessments can usually be modeled with characteristics derived from, and statistically consistent with, the observational record. Some hazard data catalogs can be found here https://sedac.ciesin.columbia.edu/theme/hazards/data/sets/browse, as well as from the Risk Data Library of the World Bank.

Use Case

Data Gap Summary

Facilitating disaster risk assessments

The resolution of current natural hazard forecast data is not sufficient for effective physical risk assessment.

The resolution of current natural hazard forecast data is not sufficient for effective physical risk assessment.

Data Gap Type

Data Gap Details

S3: Sufficiency > Granularity

Climate hazard data (e.g., floods, tropical cyclones, droughts) is often too coarse for effective physical risk assessments, which focus on evaluating damage to infrastructure such as buildings and power grids. While exposure data, including information on buildings and power grids, is available at resolutions ranging from 25 meters to 250 meters, climate hazard projections, especially those extending beyond a year, are typically at resolutions of 25 kilometers or more.

To provide meaningful risk assessments, more granular data is required. This necessitates downscaling efforts, both dynamical and statistical, to refine the resolution of climate hazard data. Machine learning (ML) can play a valuable role in these downscaling processes. Additionally, the downscaled data should be made publicly available, and a dedicated portal should be established to facilitate access and sharing of this refined information.

R1: Reliability > Quality

Projecting future climate hazards is crucial for assessing long-term risks. Climate simulations from CMIP models are currently our primary source for future climate projections. However, these simulations come with significant uncertainties due to both uncertainties in model and emission scenarios. To improve their utility for disaster risk assessment and other applications, increased funding and efforts are needed to advance climate model development for greater accuracy. Additionally, machine learning methods can help mitigate some of these uncertainties by bias-correcting the simulations.

S6: Sufficiency > Missing Components

Seasonal climate hazard forecasts are crucial for disaster risk assessment, management, and preparation. However, high-resolution data at this scale is often lacking for many hazards. This challenge is likely due to the difficulty in generating accurate seasonal weather forecasts. ML has the potential to address this gap by improving forecast accuracy and granularity.

Negative experimental synthesis data

Details (click to expand)

Experimental data typically include a combination of structural, performance, and synthesis information. This may involve crystallographic data (e.g., X-ray diffraction patterns), adsorption isotherms, gas or liquid permeability measurements, catalytic conversion rates, and thermal or chemical stability profiles. Additionally, metadata about synthesis conditions—such as temperature, pressure, reactant concentrations, and time—are recorded alongside characterization data like spectroscopy (e.g., NMR, IR, UV-Vis) and microscopy (e.g., SEM, TEM) to evaluate material morphology and composition. This data is often heterogeneous, high-dimensional, and collected under varying experimental conditions.

Use Case

Data Gap Summary

Enabling predictions of materials optimised for filtration, catalysis, electrics & magnetics

Negative – not only positive – synthesis experimental data is necessary to train algorithms, but such data is not publicly available.

Negative – not only positive – synthesis experimental data is necessary to train algorithms, but such data is not publicly available.

Data Gap Type	Data Gap Details
W: Wish	In material synthesis, the literature only includes successful experimental results. To optimize synthesis routes, we need to know which experiments were not successful. This data is usually only kept in industrial organizations’ internal Laboratory Information Management Systems.

Occupancy data from rest areas from cameras

Details (click to expand)

Occupancy data from rest areas using cameras refers to information collected about how many vehicles—especially trucks—are parked at highway rest stops over time. Cameras installed at these sites capture images or video, which can be analyzed (manually or with computer vision algorithms) to determine how full the parking area is at different times of day or year. This data helps understand peak usage patterns, identify underused or overcrowded locations, and support planning for infrastructure like electric truck charging stations, making it a valuable tool for both transportation planning and climate policy.

Use Case

Data Gap Summary

Enabling assessments of rest area capacity and use for electric truck charging

This data is generally not shared and only accessible for few rest areas.

This data is generally not shared and only accessible for few rest areas.

Data Gap Type	Data Gap Details
O2: Obtainability > Accessibility	These data are very rarely shared by highway operators.
S1: Sufficiency > Insufficient Volume	This data is shared for some rest areas in some regions, which do not offer enough ground truth data to understand the occupancy across a network or regions.

Ocean observations from floating infrastructure (FINO3)

Details (click to expand)

FINO3 is an off-shore wind mast based wind speed and wind direction research platform datasets which include time series data with respect to temperature, air pressure, relative humidity, global radiation, and precipitation. Images from the perspective of the platform provide a snapshot of of environmental conditions directly.

The platform is located in the northern part of the German Bight, 80km northwest of the island of Sylt in the midst of wind farms. Wind measurements are taken between 32 to 102 meters above sea level with wind speed measurements taken every 10 meters. Data is collected from August 2009 until the present day.

Use Case

Data Gap Summary

Improving offshore wind power forecasting: short-to long-term (3 hours–1 year)

Due to the location, FINO platform sensors are prone to failures of measurement sensors due to adverse outdoor conditions such as high wind and high waves. Often, when sensors fail, manual recalibration or repair is nontrivial requiring weather conditions to be amenable for human intervention. This can directly affect the data quality which can last for several weeks to a season. Coverage is also constrained to the platform and associated wind farm location.

Due to the location, FINO platform sensors are prone to failures of measurement sensors due to adverse outdoor conditions such as high wind and high waves. Often, when sensors fail, manual recalibration or repair is nontrivial requiring weather conditions to be amenable for human intervention. This can directly affect the data quality which can last for several weeks to a season. Coverage is also constrained to the platform and associated wind farm location.

Data Gap Type	Data Gap Details
O2: Obtainability > Accessibility	The data is free to use but requires sign up through a login account at: https://login.bsh.de/fachverfahren/
U5: Usability > Pre-processing	The dataset is prone to failures of measurement sensors. Issues with data loggers, power supplies, and effects of adverse conditions such as low aerosol concentrations can influence data quality. High wind and wave conditions impact the ability to correct or recalibrate sensors creating data gaps that can last for several weeks or seasons.
S2: Sufficiency > Coverage	The relevance of the data for a given farm depends on the proximity of the farm. For locations with different offshore characteristics, similar testbed platforms or buoys can be developed.
S5: Sufficiency > Proxy	Due to the nature of sensors exposed to environmental ocean conditions and storms, often FINO sensors may need maintenance and repair but are difficult to physically access. Gaps in the data from lack of data points can be addressed by utilizing mesoscale wind modeling output.

Offshore wind data from masts and LiDAR

Details (click to expand)

The offshore operation data from the Danish energy company Orsted provides 2 years worth of 10-minute Supervisory Control and Data Acquisition (SCADA) information for nacelle wind speed, electrical power, rotor speed, yaw position, as well as pitch angle for turbines with on-site wave buoy data and ground based LiDAR from different offshore wind farm sites.

For one site, the Anholt Westermost Rough offshore wind farm, data is collected from 111 Siemens SWT-120-3.6 MW wind turbines arranged in a layout of 20 km by 8 km with internal spacing between turbines being 5-7 rotors and a depth of 15-19 m. In another site, The Northeast of Withernsea off Holderness coast in North Sea, England, has a wind farm with a 35 km by 35 km spatial coverage area.

Use Case

Data Gap Summary

Improving offshore wind power forecasting: short-to long-term (3 hours–1 year)

The spatiotemporal coverage of the offshore windspeed mast data is restricted to the dimensions of the platform/tower itself as well as the time of construction. Depending on the data provider access to the data may require the signing of a non-disclosure agreement.

The spatiotemporal coverage of the offshore windspeed mast data is restricted to the dimensions of the platform/tower itself as well as the time of construction. Depending on the data provider access to the data may require the signing of a non-disclosure agreement.

Data Gap Type

Data Gap Details

O2: Obtainability > Accessibility

Access to data must be requested with different data providers having varying levels of restrictions. For data obtained from Orsted, access is only provided by signing a standard non-disclosure agreement. For more information mail R&D at datasharing@orsted.com.

S2: Sufficiency > Coverage

Spatiotemporal coverage of the dataset varies depending on the construction of the platform testbed and location but overall data is available from 2014 to the present. While measurements from LiDAR have higher resolution than wind mast data, sensor information is still restricted to the dimensions of the platform and the associated off-shore windfarm when present. Data provided by Orsted from LiDAR sensors includes 10 minute statistics.

Offshore wind farm operation data (Orsted)

Details (click to expand)

The offshore operation data from the Danish energy company Orsted provides 2 years worth of 10-minute Supervisory Control and Data Acquisition (SCADA) information for nacelle wind speed, electrical power, rotor speed, yaw position, as well as pitch angle for turbines with on-site wave buoy data and ground based LiDAR from different offshort wind farm sites.

For one site, the Anholt Westermost Rough offshore wind farm, data is collected from 111 Siemens SWT-120-3.6 MW wind turbines arranged in a layout of 20 km by 8 km with internal spacing between turbines being 5-7 rotors and a depth of 15-19 m. In another site, The Northeast of Withernsea off Holderness coast in North Sea, England, has a windfarm with a 35 km by 35 km spatial coverage area.

Use Case

Data Gap Summary

Improving offshore wind power nowcasting (10 min)

Data can be accessed by requesting access via the Orsted form. Sufficiency of the dataset is constrained by volume where only a finite amount of short term off-shore wind farms exist to which expanding the coverage area, volume and time granularity of data to under 10 minutes may enable transient detection from generated active power.

Data can be accessed by requesting access via the Orsted form. Sufficiency of the dataset is constrained by volume where only a finite amount of short term off-shore wind farms exist to which expanding the coverage area, volume and time granularity of data to under 10 minutes may enable transient detection from generated active power.

Data Gap Type	Data Gap Details
O2: Obtainability > Accessibility	Access requests are needed via a form from Orsted. It is also unclear if an NDA needs to be signed allows publication using the data.
S1: Sufficiency > Insufficient Volume	Data from multiple wind farms over a variety of regions would be required to get a more accurate comparison against simulated weather data.
S2: Sufficiency > Coverage	The coverage is over parts of Europe; off-shore wind conditions will vary depending on the environment and cannot scale or transfer to other temperate regions of the world. In addition to dependence on local weather conditions, wind farm performance depends on the wind turbine make and model (i.e. turbine aerodynamic properties and controller) and also the layout of the turbines in the farm. Additional data from other offshore wind farms are needed to uncover impacts depending on both wind conditions and wind farm parameters.
S3: Sufficiency > Granularity	The time granularity of 10 min is too coarse to capture transients in active power generated.
S4: Sufficiency > Timeliness	Only two years worth of data from 2016–2018 is provided. Additional data collection from offshore wind farms or simulations are needed.

OpenStreetMap (land use map)

Details (click to expand)

Open Street Map is an open-source map database providing worldwide geographic features such as buildings, roads, and land uses, maintained by a community of mappers who add objects manually or trace them from remote sensing imagery.

Use Case

Data Gap Summary

Facilitating disaster risk assessments

The quality of OpenStreetMap is very variable in terms of coverage of geometries e.g. buildings and attributes. Roads are better mapped than buildings in general. The very permissive data model from OpenStreetMap enables users to provide a variety of information, but it is often not well harmonized. Recent corporate editing efforts have increased dramatically the coverage in previously poorly mapped regions.

The quality of OpenStreetMap is very variable in terms of coverage of geometries e.g. buildings and attributes. Roads are better mapped than buildings in general. The very permissive data model from OpenStreetMap enables users to provide a variety of information, but it is often not well harmonized. Recent corporate editing efforts have increased dramatically the coverage in previously poorly mapped regions.

Data Gap Type	Data Gap Details
U6: Usability > Large Volume	When working at a large geographical scale, e.g. continental scale, the data volume requires significant computational resources for the processing.
U4: Usability > Documentation	The origin of attributes is often unknown, creating uncertainty about values.
U5: Usability > Pre-processing	The flexible data model lacks type enforcement, requiring additional processing for analysis.
R1: Reliability > Quality	Data quality ranges from excellent (sometimes surpassing official sources) to very low (including mapping vandalism).
S2: Sufficiency > Coverage	Street coverage is generally good, while building coverage varies widely.
S4: Sufficiency > Timeliness	Update frequency varies from multiple times per year to decades-old data, with disaster areas often updated quickly by active communities.
S6: Sufficiency > Missing Components	Most attributes remain incomplete, with completeness levels below 10%.

Interpolating city-wide bicycle volumes from limited count data

Bike infrastructure data in OpenStreetMap faces reliability and usability issues due to a lack of validation and inconsistent naming conventions, requiring extensive pre-processing. Elements like bike parking are often missing, reducing data completeness. Cities and shared tools can help address these gaps.

Bike infrastructure data in OpenStreetMap faces reliability and usability issues due to a lack of validation and inconsistent naming conventions, requiring extensive pre-processing. Elements like bike parking are often missing, reducing data completeness. Cities and shared tools can help address these gaps.

Data Gap Type	Data Gap Details
R1: Reliability > Quality	The quality of the data infrastructure is often unclear. There is a lack of validation of whether the type of cycle lanes has been correctly identified. While this is something that has been done for roads, no such studies have been conducted for bike infrastructure. Cities could have an active role here by comparing their data to what is available in OSM.
U5: Usability > Pre-processing	Different coding is used in different countries to name specific bike infrastructure elements (e.g. bike lanes, protected bike lanes, cycle street, etc.). Mappers may use different denominations to name the same thing. This requires pre-processing. Sharing code to preprocess the data can reduce the work required by individual researchers to use the data.
S6: Sufficiency > Missing Components	Certain important bike amenities are not well mapped e.g. bike parking.

Mapping existing solar photovoltaic systems

OpenStreetMap’s solar PV data suffers from uneven global coverage and missing critical attributes, limiting its utility for comprehensive energy assessments. Additionally, inconsistent tagging and lack of quality control hinder data usability, reliability, and integration.

OpenStreetMap’s solar PV data suffers from uneven global coverage and missing critical attributes, limiting its utility for comprehensive energy assessments. Additionally, inconsistent tagging and lack of quality control hinder data usability, reliability, and integration.

Data Gap Type	Data Gap Details
S2: Sufficiency > Coverage	While OSM includes some solar PV installations, coverage is uneven and highly dependent on local contributor activity. Rooftop solar systems are especially underrepresented at the global scale, with significant portions of the world—particularly in the Global South—lacking any mapped installations. This gap limits the utility of OSM for global or regional solar energy assessments. Supporting community mapping initiatives and remote mapping campaigns—especially in underrepresented regions—can help improve global coverage of rooftop solar in OSM.
S6: Sufficiency > Missing Components	Many mapped solar PV features in OSM are missing critical attribute data such as system capacity, installation date, tilt, azimuth, or ownership type. This lack of detailed tagging reduces the usefulness of the data for technical or economic analyses of solar deployment and grid integration.
U5: Usability > Pre-processing	Solar PV features in OSM are tagged inconsistently, using a variety of overlapping or conflicting keys and values (e.g., generator:source=solar, plant:source=solar, or roof:material=solar_panels). This inconsistency creates challenges for automated data extraction, integration, and comparison across regions or datasets.
R1: Reliability > Quality	There is limited systematic quality control or ground-truthing of solar PV data in OSM. As a result, mapped features may contain inaccuracies in location, geometry, or classification, and it is often difficult to verify whether tagged installations correspond to real-world infrastructure. Encouraging validation efforts using high-resolution satellite imagery, local knowledge, or machine learning-assisted review could help identify and correct errors in mapped solar features.

Understanding the impact of urban planning on travel emissions

OpenStreetMap data for mobility infrastructure is often incomplete outside of main roads and biased towards high-income countries. The data’s reliability is uncertain due to its crowdsourced nature, requiring quality checks. Its permissive data model leads to inconsistencies, necessitating thorough pre-processing, and it often lacks proper documentation, despite the benefits of documenting data provenance.

OpenStreetMap data for mobility infrastructure is often incomplete outside of main roads and biased towards high-income countries. The data’s reliability is uncertain due to its crowdsourced nature, requiring quality checks. Its permissive data model leads to inconsistencies, necessitating thorough pre-processing, and it often lacks proper documentation, despite the benefits of documenting data provenance.

Data Gap Type	Data Gap Details
S2: Sufficiency > Coverage	OSM data for mobility infrastructure beyond main roads (for car usage) is very heterogeneous, with a bias towards high-income countries. Further focus from the mapper community and corporate editors can alleviate these problems, as certain world regions display very well-mapped mobility infrastructure e.g. for bike lanes, parking slots, etc.
R1: Reliability > Quality	The quality of OSM data is inherently uncertain, given that the data is produced by millions of individual mappers. Quality checks can help increase the reliability of the data.
U5: Usability > Pre-processing	The very permissive data model of OSM enables mappers to write certain information without constraints, which involves information provided in different languages, sometimes not using the correct type, etc. Similarly, constructing valid geometries from the mapped nodes and lines in OSM is challenging and an opinionated process. Thus, thorough but also error-prone pre-processing is often needed. Sharing pre-processing code for similar use cases can reduce data cleaning needs.
U4: Usability > Documentation	Information often comes from observations from mappers, and no further documentation is provided. It is, however, possible and a good practice to document the provenance of the data within certain variables when relevant.

Optimal power flow simulation outputs

Details (click to expand)

PowerWorld Simulator and MATPOWER are software tools used for optimizing power systems and include representation of both alternating current (AC) and direct current (DC) systems. PowerWorld Simulator models, analyzes, and optimizes power systems for a wide range of configurations and scenarios with the ability to model small distribution networks as well as transmission systems.

MATPOWER is an open source alternative and also solves both the AC and DC versions of optimal power flow (OPF) with DC OPF simplified into a quadratic program using DC modeling assumptions and reducing polynomial costs to second order using real power flows as a function of voltage angles (thereby eliminating voltage magnitude and reactive power). PowerWorld Simulator utilizes a combination of iterative algorithms (Newton-Raphson) with traditional power flow equations.

MATPOWER is open source and PowerWorld Simulator has several options for industry practitioners as well as those who would like to use it for academic purposes. Demo software that is licensed for educational use that includes simulator features such as available transfer capability, optimal power flow, security-constrained OPF, OPF reserves, PV/QV curve tool, transient stability, and geomagnetically induced current. In terms of topology, the free version contains up to 13 buses while the full version of the simulator can handle 250,000 buses.

Use Case

Data Gap Summary

Improving power grid optimization

Traditional OPF simulation software may require the purchase of licenses for advanced features and functionalities. To simulate more complex systems or regions, additional data regarding energy infrastructure, region-specific load demand, and renewable generation may be needed to conduct studies. OPF simulation output would require verification and performance evaluation to assess results in practice. Increasing the granularity of the simulation model by increasing the number of buses, limits, or additional parameters increases the complexity of the OPF problem, thereby increasing the computational time and resources required.

Traditional OPF simulation software may require the purchase of licenses for advanced features and functionalities. To simulate more complex systems or regions, additional data regarding energy infrastructure, region-specific load demand, and renewable generation may be needed to conduct studies. OPF simulation output would require verification and performance evaluation to assess results in practice. Increasing the granularity of the simulation model by increasing the number of buses, limits, or additional parameters increases the complexity of the OPF problem, thereby increasing the computational time and resources required.

Data Gap Type	Data Gap Details
U2: Usability > Aggregation	In MATPOWER and PowerWorld outside data may be required to simulate conditions over a specific region with a given amount of DERs, generating sources, bus topology, and line limits. This will require collation of pre-existing synthetic grid data with additional data to model specific scenarios.
U3: Usability > Usage Rights	Depending on whether proprietary simulators are pursued (PowerWorld) there may be licensing costs for use of certain features.
R1: Reliability > Quality	Traditional OPF simulation software simplifies the power system and makes assumptions about the system behavior such as perfect power factor correction or constant system parameters. Simulation results may need to be verified with real-world results.
S3: Sufficiency > Granularity	In PowerWorld, bus topologies available may be simplified representations of actual grids to simplify the modeling and simulation techniques to represent overall system behavior. MATPOWER requires the user to define the bus matrix. As the number of buses in a power system increases the computational complexity of OPF increases, requiring more resources and time to solve. Additional parameters such as line limits, number of generating sources, number of DERs, and load demand also increase the complexity of the model as more constraints and assets are introduced.

Outputs from distribution connected inverter systems simulations

Details (click to expand)

There is a need to enhance existing simulation tools to study inverter based power systems rather than traditional machine based. Simulations should be able to represent a large number of distribution connected inverters which incorporate DER models with the larger power system. Uncertainty surrounding model outputs will require verification and testing.

NREL’s PREconfiguring and Controlling Inverter SEt-points (PRECISE) can identify interconnection located on network based on PV customer’s address and model the distribution feeder and preconfigure advanced inverter modes to provide grid support and minimize energy curtailment. The tool can allow utilities to perform power flow analysis and analyze inverter modes.

Furthermore, NREL’s Energy Systems Integration Facility (ESIF) has real-time simulation connected with power hardware that allows for smart inverter manufacturers to test operational control with simulated dynamics and scenarios.

Use Case

Data Gap Summary

Optimizing smart inverter management for distributed energy resources

There is a need to enhance existing simulation tools to study inverter-based power systems rather than traditional machine-based based. Simulations should be able to represent a large number of distribution-connected inverters that incorporate DER models with the larger power system. Uncertainty surrounding model outputs will require verification and testing. Furthermore, accessibility to simulations and hardware in the loop facilities and systems requires user access proposal submission for NREL’s Energy Systems Integration Facility access. Similar testing laboratories may require access requests and funding.

There is a need to enhance existing simulation tools to study inverter-based power systems rather than traditional machine-based based. Simulations should be able to represent a large number of distribution-connected inverters that incorporate DER models with the larger power system. Uncertainty surrounding model outputs will require verification and testing. Furthermore, accessibility to simulations and hardware in the loop facilities and systems requires user access proposal submission for NREL’s Energy Systems Integration Facility access. Similar testing laboratories may require access requests and funding.

Data Gap Type

Data Gap Details

O2: Obtainability > Accessibility

Contact NREL precise@nrel.gov for access to the PRECISE model

Submit an Energy Systems Integration Facility (ESIF) laboratory request form to userprogram.esif@nrel.gov to gain access to hardware in the loop inverter simulation systems. Access to particular hardware may require collaboration with inverter manufacturers which may have additional permission requirements.

R1: Reliability > Quality

The optimization routine of the simulation model may face challenges in determining the precise balance between grid operation criteria and impacts on customer PV generation. Generation may still require curtailment by the utility to prioritize grid stability. To circumvent this gap external data on distribution side operating conditions, load demand, solar generation, and utility-initiated generation curtailment can be collected and introduced into expanded simulation studies.

Passive acoustic monitoring for biodiversity assessment

Details (click to expand)

Passive acoustic recording provides continuous monitoring of both environment and species vocalizations. While some annotated datasets are available through repositories like ARBIMON (https://arbimon.org/) or Macaulay Library (www.macaulaylibrary.org), there remains a general lack of robust, large, and diverse annotated bioacoustic datasets for machine learning applications.

Use Case

Data Gap Summary

Facilitating forest restoration monitoring

There is a lack of standardized protocol to guide data collection for restoration projects, including which variables to measure and how to measure them. Developing such protocols would standardize data collection practices, enabling consistent assessment and comparison of restoration successes and failures on a global scale.

There is a lack of standardized protocol to guide data collection for restoration projects, including which variables to measure and how to measure them. Developing such protocols would standardize data collection practices, enabling consistent assessment and comparison of restoration successes and failures on a global scale.

Data Gap Type

Data Gap Details

U1: Usability > Structure

For restoration projects, there is an urgent need for standardized protocols to guide data collection and curation for measuring biodiversity and other complex ecological outcomes of restoration projects. For example, there is an urgent need of clearly written guidance on what variables to collect and how to collect them.

Turning raw data into usable insights is also a significant challenge. There is a lack of standardized tools and workflows to turn the raw data into analysis-ready data and analyze the data in a consistent way across projects.

U4: Usability > Documentation

For restoration projects, there is an urgent need of standardized protocol to guide data collection and curation for measuring biodiversity and other complex ecological outcomes of restoration projects. For example, there is an urgent need of a clearly written guidance on what variables to collect and how to collect them.

Turning raw data into usable insights is also a significant challenge. There is a lack of standardized tools and workflows to turn the raw data into some analysis-ready data and analyze the data in a consistent way across projects.

Facilitating the detection of climate-induced ecosystem changes

The major data gap is the limited institutional capacity to process and analyze data promptly to inform decision-making.

The major data gap is the limited institutional capacity to process and analyze data promptly to inform decision-making.

Data Gap Type

Data Gap Details

M: Misc/Other

There is a significant institutional challenge in processing and analyzing data promptly to inform decision-making. To enhance institutional capacity for leveraging global data sources and analytical methods effectively, a strategic, ecosystem-building approach is essential, rather than solely focusing on individual researcher skill development. This approach should prioritize long-term sustainability over short-term project-based funding.

Improving terrestrial wildlife detection and species classification

The first and foremost challenge of bioacoustic data is its sheer volume, which makes data sharing especially challenging. Solutions are needed for cheaper and more reliable data hosting and sharing platforms.

Additionally, there’s a significant shortage of large and diverse annotated datasets, which is even more severe than image data such as camera trap, drone, and crowd-sourced images.

The first and foremost challenge of bioacoustic data is its sheer volume, which makes data sharing especially challenging. Solutions are needed for cheaper and more reliable data hosting and sharing platforms.

Additionally, there’s a significant shortage of large and diverse annotated datasets, which is even more severe than image data such as camera trap, drone, and crowd-sourced images.

Data Gap Type	Data Gap Details
S1: Sufficiency > Insufficient Volume	This scarcity of publicly open and well-annotated datasets arises from the labor-intensive nature of data annotation and the dearth of expertise necessary to accurately identify species. This is the case even for relatively well-studied taxa such as birds, and is even more notable for taxa such as Orthoptera (grasshoppers and relatives). The incompleteness of our current taxonomy also compounds this issue. Though some data cannot be shared with the public because of concerns about poaching or other legal restrictions imposed by local governments, there are still considerable sources of well-annotated data that currently reside within the confines of individual researchers’ or organizations’ hard drives. Unlocking these data will partially address the data gap. To achieve this goal: A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative. Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets. The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
S2: Sufficiency > Coverage	There exists a significant geographic imbalance in data collected, with insufficient representation from highly biodiverse regions. The data also does not cover a diverse spectrum of species, intertwining with taxonomy insufficiencies. Addressing these gaps involves several strategies: Data sharing: Unlocking existing datasets scattered across individual laboratories and organizations can partially alleviate the geographic and taxonomic gaps in data coverage. Annotation Efforts: Leveraging crowdsourcing platforms enables collaborative data collection and facilitates volunteer-driven annotation, playing a pivotal role in expanding annotated datasets. Collaborating with domain experts, including ecologists and machine learning practitioners, is crucial. Incorporating local ecological knowledge, particularly from indigenous communities in target regions, enriches data annotation efforts. Developing AI-powered methods for automatic cataloging, searching, and data conversion is imperative to streamline data processing and extract relevant information efficiently. Data Collection: Prioritize continuous data collection efforts, especially in underrepresented regions and for less documented species and taxonomic groups. Explore innovative approaches like Automated Multisensor Stations for Monitoring of Species Diversity (AMMODs) to enhance data collection capabilities. Strategically plan data collection initiatives to target species at risk of extinction, ensuring data capture before irreversible loss occurs.
U2: Usability > Aggregation	Many well-annotated datasets are currently scattered within the confines of individual researchers’ or organizations’ hard drives. Aggregating and sharing these datasets would largely address the lack of publicly open annotated data needed for model training. This initiative would also mitigate geographic and taxonomic imbalances in existing data. To achieve this goal: A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative. Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets. The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources. To achieve this goal: A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative. Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets. The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
U6: Usability > Large Volume	One of the biggest challenges in bioacoustic data lies in its sheer volume, stemming from continuous monitoring processes. Researchers face significant hurdles in sharing and hosting this data, as existing online platforms often don’t provide sufficient long-term storage capacity or are very expensive. Solutions are needed to provide cheaper and more reliable hosting options. Moreover, accessing these extensive datasets demands advanced computing infrastructure and solutions. The availability of more funding sources may push more people to start sharing their bioacoustic data.

Pecan Street (appliance-level consumption data)

Details (click to expand)

Pecan Street DataPort began as a Smart Grid Demonstration program through the Pecan Street energy research nonprofit organization which worked closely with the University of Texas at Austin. Funded by the DOE in 2014, the project signed up 1000 research participants from the Mueller community in Austin, Texas to share green button, smart meter, and home energy management system (HEMS) data in 750 homes and 25 commercial properties. Financial incentivization of plug-in electric vehicle use and rooftop solar installation by Austin Energy encouraged residential lifestyle shifts. In addition to providing access to sub-metered appliance level consumption data, Pecan Street includes electric vehicle charging, rooftop solar, heating, cooling, and water usage data. Data coverage has expanded to volunteer households from California, New York and Colorado. Previously open for use, Pecan Street has been privatized and now data access and products are available for commercial and academic purchase depending on the level of access requested.

Use Case

Data Gap Summary

Enabling non-intrusive electricity load monitoring

Pecan Street DataPort requires non-academic and academic users to purchase access via licensing which varies depending on the building data features requested. Coverage area of data is primarily concentrated in the Mueller planned housing community in Austin, Texas–a modern built environment which is not representative of older historical buildings that may be in need of energy efficient upgrades and retrofits. In customer segmentation studies and consumer-in-the-loop load consumption modeling, annual socio-demographic survey data may be too coarse and not provide insight into behavioral effects of household members on consumption profiles with time.

Pecan Street DataPort requires non-academic and academic users to purchase access via licensing which varies depending on the building data features requested. Coverage area of data is primarily concentrated in the Mueller planned housing community in Austin, Texas–a modern built environment which is not representative of older historical buildings that may be in need of energy efficient upgrades and retrofits. In customer segmentation studies and consumer-in-the-loop load consumption modeling, annual socio-demographic survey data may be too coarse and not provide insight into behavioral effects of household members on consumption profiles with time.

Data Gap Type	Data Gap Details
U3: Usability > Usage Rights	Usage rights vary depending on the agreed upon licensing agreement.
S6: Sufficiency > Missing Components	The data does not track real-time occupancy of individuals in the household which may provide insight into behavioral effects on energy consumption. Addition of this data, can allow for improved consumption based customer segmentation models, as patterns can change with respect to time and day of the week. The data would also be amenable for consumer-in-the-loop energy management studies with respect to comfort based on customer habitual activity, location in the house, and number of occupants.
S3: Sufficiency > Granularity	Disaggregated data may provide greater granular context for customer segmentation studies than those relying on aggregate data only. However, such segmentation studies ultimately depend on the number of household members that may be using appliances at a given time. Pecan Street data contains annual survey responses by participants with respect to household demographics and home features which may be too coarse in granularity to tracking how customer segments change over time as members move in or out of a building. Jointly taking occupancy data, can address the gap in granularity but can potentially limit volunteer engagement as concerns with respect to privacy will need to be evaluated.
S2: Sufficiency > Coverage	Data coverage primarily focuses on Texas with limited coverage in New York and California. Though there are efforts to include Puerto Rico data hinges on volunteer participation. This could introduce self-selection bias, as households who participate are likely more interested in energy conservation than the general population. Furthermore, a majority of the dataset covers the Mueller community in Austin, a planned community developed after 1999 with modern built types. Enrollment of older built environment homes and different temperate regions within the United States and globally, may provide greater insight into household appliance usage patterns as well as generation patterns which vary depending on temperate region as well as appliance age. Identifying high consumption older appliances can assist in identifying upgrades.
O2: Obtainability > Accessibility	Data is downloadable as a static file or accessible via the DataPort API. Based on the licensing agreement, a small dataset is available for free for academic individuals with pricing for larger datasets. Commercial use requires paid access based on requested features ranging from the standard to unlimited customer tier and plan.

Points of Interest

Details (click to expand)

Points of Interest (POIs) refer to specific geographic locations or sites that hold particular interest or usefulness to individuals, businesses, or communities. These areas often include landmarks, tourist attractions, parks, historical sites, restaurants, retail stores, and other significant locations that people might want to visit or know about.

POIs are commonly used in mapping and navigation services to provide users with relevant information about notable locations within a given area. POIs can be either crowdsourced (e.g. OpenStreetMap) or from commercial providers (e.g. Google Maps).

POIs are relevant to analyze the sustainability and equity of cities, e.g. via analyses of the accessibility of important services to the overall population. From a climate perspective, it is relevant to understand the need to travel (and associated travel emissions) to access services, appearing for example in the 15-minute city concept.

Use Case

Data Gap Summary

Understanding the impact of urban planning on travel emissions

POI data coverage is uneven, often biased towards high-income countries, with even leading datasets like Google Maps facing gaps, especially in the Global South. Timeliness is an issue as datasets may not reflect current business statuses. Reliability suffers from locational inaccuracies, affecting data matching and analysis. Usability is hindered by varied categorizations and languages in assembled datasets, necessitating standardization.

POI data coverage is uneven, often biased towards high-income countries, with even leading datasets like Google Maps facing gaps, especially in the Global South. Timeliness is an issue as datasets may not reflect current business statuses. Reliability suffers from locational inaccuracies, affecting data matching and analysis. Usability is hindered by varied categorizations and languages in assembled datasets, necessitating standardization.

Data Gap Type	Data Gap Details
S2: Sufficiency > Coverage	The coverage of POIs can be highly heterogeneous with a bias toward high-income countries, in particular for crowdsourced datasets. Google Maps is typically considered the best dataset, although it also has coverage issues e.g. in the Global South.
S4: Sufficiency > Timeliness	POIs can be businesses that start or stop operating. These changes are not always reflected in the data. Some automated checks can help detect outdated POIs.
R1: Reliability > Quality	The GPS location of the POI may often be inaccurate, which may cause issues in matching the dataset with others or create errors in analyses. Improved data generation procedures and testing may help with this data gap.
U5: Usability > Pre-processing	POIs datasets are often generated through the assemblage of datasets, where different categorizations or languages may be used. Further standardization and sharing of code to harmonize data can ease usage.

Population and asset exposure to natural hazards

Details (click to expand)

Exposure is defined as the representative value of populations and assets potentially exposed to a natural hazard occurrence. such as population, physical assets (e.g. buildings), economic output (e.g. measured by GDP),, buildings, or agriculture output, depending on the risk exposed to.

There areopen datatasets with global coverage, for example, the Global Exposure Model, as well as proprietary data with more detailed information coming from well-established insurance companies.

Use Case

Data Gap Summary

Facilitating disaster risk assessments

Accessibility and reliability are the most significant challenges with exposure data.

Accessibility and reliability are the most significant challenges with exposure data.

Data Gap Type	Data Gap Details
O2: Obtainability > Accessibility	Country-specific exposure data varies widely in availability, with some existing only as hard copies in government offices.
U3: Usability > Usage Rights	OpenQuake GEM project provides comprehensive data on the residential, commercial, and industrial building stock worldwide, but it is restricted to non-commercial use only.
R1: Reliability > Quality	Population datasets show significant discrepancies, requiring validation before confident use. Some geospatial socioeconomic data from sources like UNEP are outdated or incomplete.
S3: Sufficiency > Granularity	Open global data often lacks sufficient resolution and completeness for hazard risk assessment, such as World Bank or US CIA GDP data.

Power Grid Lib (optimal power flow benchmark library)

Details (click to expand)

The Power Grid Library (PGLib-OPF) is a collection of git repositories that house benchmark data for validating power system simulations.

It contains 36 networks with 3-13,659 buses sourced from IEEE Power Flow Test Cases, IEEE Dynamic Test Cases, IEEE Reliability Test System, Polish Test Cases, PEGASE Test Cases, and RTE Test Cases which have been modified to raise optimality gaps to values between 1-10% thereby creating more challenging suboptimal solutions to AC-OPF.

By curating and collecting this data, users who want to study more realistic AC-OPF simulation scenarios can directly retrieve compiled bus IDs, branch IDs, generator IDs, power demand, shunt admittance, voltage magnitude range for buses, power injection range for generators, quadratic active power cost function coefficients for generators, branch parameters like series admittance, line charge, transformer parameters, thermal limits, and branch voltage angle difference range which are more realistic. All parameters are conveniently standardized to MATPOWER data file format for direct use. PGLib-OPF is open source.

Use Case

Data Gap Summary

Improving power grid optimization

While network datasets are open source, maintenance of the repository requires continuous curation and collection of more complex benchmark data to enable diverse AC-OPF simulation and scenario studies. Industry engagement can assist in developing more realistic data though such data without cooperative effort may be hard to find.

While network datasets are open source, maintenance of the repository requires continuous curation and collection of more complex benchmark data to enable diverse AC-OPF simulation and scenario studies. Industry engagement can assist in developing more realistic data though such data without cooperative effort may be hard to find.

Data Gap Type	Data Gap Details
O1: Obtainability > Findability	Industry engagement can assist in developing detailed and realistic networked datasets and operating conditions, limits, and constraints.
U2: Usability > Aggregation	Repository maintenance requires continuous curation of more complex networked benchmark data for more realistic AC-OPF simulation studies.

Power line robot inspection imagery

Details (click to expand)

Cable inspection robot data includes LiDAR and image captures of Specific Power Line (SPL) components such as dampers, insulators, broken strands, and attachments that may have degraded due to exposure to natural elements. The data also focuses on assessing risk at the lowest part of power lines near trees, roofs, and other crossing power lines. Since the robots physically traverse the lines, this data is particularly valuable for degradation detection of high voltage transmission lines and for maintenance scheduling.

Use Case

Data Gap Summary

Enhancing power grid-vegetation management for wildfire risk mitigation

Grid inspection robot imagery requires coordination with local utilities foraccess, multiple robot trips for complete coverage, image preprocessing to remove ambient artifacts, position and location calibration, and may be limited by camera resolution for detecting subtle degradation patterns.

Grid inspection robot imagery requires coordination with local utilities foraccess, multiple robot trips for complete coverage, image preprocessing to remove ambient artifacts, position and location calibration, and may be limited by camera resolution for detecting subtle degradation patterns.

Data Gap Type	Data Gap Details
U2: Usability > Aggregation	Data needs to be aggregated from multiple cable inspection robots for improved generalizability of detection models. Multiple robot trips over areas of interest can help identify target locations needing further inspection.
U3: Usability > Usage Rights	Data is proprietary and requires coordination with utility.
U5: Usability > Pre-processing	Data may need significant preprocessing and thresholding to perform image segmentation tasks.
S2: Sufficiency > Coverage	Data must be supplemented with position orientation system information for accurate robot localization, potentially requiring preliminary inspections followed by detailed autonomous inspection of targets.
S3: Sufficiency > Granularity	Spatial resolution depends on the type of cable inspection robot utilized. Data from multiple multispectral imagers, drones, cable-mounted sensors, and additional robots may be employed to improve the level of detail needed for specific obstructions.

Public health data

Details (click to expand)

Health surveillance data includes medical records, epidemiological surveys, disease registries, healthcare utilization statistics, and population health indicators. These datasets vary in geographic coverage, temporal frequency, and demographic scope, with some maintained by government health agencies and others by healthcare institutions or research organizations.

Use Case

Data Gap Summary

Improving assessments of climate impacts on public health

Limited accessibility and poor documentation of health datasets restrict their use in climate-health ML applications. Privacy concerns and institutional barriers prevent broader data sharing, while inconsistent documentation makes existing datasets difficult to use effectively.

Limited accessibility and poor documentation of health datasets restrict their use in climate-health ML applications. Privacy concerns and institutional barriers prevent broader data sharing, while inconsistent documentation makes existing datasets difficult to use effectively.

Data Gap Type	Data Gap Details
O2: Obtainability > Accessibility	Health datasets have restricted access due to privacy regulations and institutional policies, limiting availability for climate-health research across diverse populations and demographics.
U4: Usability > Documentation	Available health datasets often lack comprehensive documentation, metadata, or accompanying code, making it difficult to understand data collection methods and appropriate usage.
U2: Usability > Aggregation	Health data from multiple sources requires integration efforts to create comprehensive datasets suitable for climate-health analysis, but standardized aggregation frameworks are lacking.

Regularly gridded high-resolution atmospheric observations

Details (click to expand)

Though a lot of data is available, a set of regularly gridded 3D high-resolution observations of the atmosphere state (like a higher-resolution version of ERA5) is still needed. This is essential for both an improved understanding of the atmospheric processes and the development of ML-based weather forecast models and climate models.

Use Case

Data Gap Summary

Accelerating and improving weather forecasting: Near-term (< 24 hours)

An enhanced version of ERA5 with higher granularity and fidelity is needed. Many surface observations and remote sensing data are available but underutilized for developing such a dataset.

An enhanced version of ERA5 with higher granularity and fidelity is needed. Many surface observations and remote sensing data are available but underutilized for developing such a dataset.

Data Gap Type	Data Gap Details
W: Wish	ERA5 is currently widely used in ML-based weather forecasts and climate modeling because of its high resolution and ready-for-analysis characteristics. However, large volumes of observations from radiosondes, balloons, and weather stations are largely underutilized. Creating a well-structured dataset like ERA5 but with more observational data would be valuable.

Hybrid ML-physics climate models for enhanced simulations

While conceptually needed, this dataset does not exist in the form required. An enhanced version of ERA5 with higher resolution and fidelity would significantly improve ML model training and validation.

While conceptually needed, this dataset does not exist in the form required. An enhanced version of ERA5 with higher resolution and fidelity would significantly improve ML model training and validation.

Data Gap Type	Data Gap Details
W: Wish	ERA5 is currently widely used in ML-based weather forecasts and climate modeling because of its high resolution and ready-for-analysis characteristics. But large volumes of observations, e.g. data from radiosonde, balloons, and weather stations are largely under-utilized. It would be great to create a dataset that is well-structured like ERA5 but from more observations.

Residential daylight performance metric (DPM) data

Details (click to expand)

The amount of daylight that buildings are exposed to through windows is an important parameter for heating demand (via heat gains from solar radiations) and electricity demand for lighting (via the illumination of indoor spaces by natural light). Architects can optimize these dimensions via adjusting window placement and window-to-window ratios.

Daylight performance metrics (DPMs) have been developed by building researchers and architects based on daylight access simulation output to quantify the illumination of indoor spaces by natural light.

Residential daylight performance metric data (DPM) with respect to daylight autonomy (DA), continuous daylight autonomy (cDA), spatial daylight autonomy (sDA), and useful daylight illuminance (UDI) can be generated using physics-based ray tracing simulations that calculate illuminances over a prototype building layout. Some simulation software available to calculate DPMs include IES virtual environment (IESVE), DesignBuilder, VELUX daylight visualizer, and the open source RADIANCE 5.0. To generate synthetic data from these simulation frameworks, the user must provide a geometric model of the building, climate data with respect to the building location, reflectance and transmittance values for materials, desired radiance parameters, occupancy schedule, and a virtual sensor grid over which the incident illuminance is to be calculated. Strategies based on the output of the simulations can assist architects in optimizing window placement and size, incorporation of shading devices, and the design of floor plans to control building direct and diffuse natural light.

Use Case

Data Gap Summary

Accelerating building energy models

While daylight performance metric (DPM) evaluation is an important step in the planning of commercial buildings, residential buildings do not have a similar focus, which is unusual given that most new building construction occurs within the residential sector. Residential DPMs often lack metrics associated with direct sunlight access, rely on annual averages for seasons, and utilize fixed occupancy schedules that are overly simplified for residential spaces. Additionally, illuminance metrics and thresholds utilized in commercial spaces do not translate well for residential spaces where people may prefer higher or lower illuminances depending on their location and lifestyles. Lastly, DPM optimization is based on operational metrics and assumptions on illumination in a space and its effects on the resulting thermal comfort and operational consumption of a traditional urban residential spaces, vernacular architecture which is specific to a local region and culture may not share similar objectives, preferring more indoor-outdoor transitional spaces, earthen materials, and less focus on windows and incident natural sunlight.

While daylight performance metric (DPM) evaluation is an important step in the planning of commercial buildings, residential buildings do not have a similar focus, which is unusual given that most new building construction occurs within the residential sector. Residential DPMs often lack metrics associated with direct sunlight access, rely on annual averages for seasons, and utilize fixed occupancy schedules that are overly simplified for residential spaces. Additionally, illuminance metrics and thresholds utilized in commercial spaces do not translate well for residential spaces where people may prefer higher or lower illuminances depending on their location and lifestyles. Lastly, DPM optimization is based on operational metrics and assumptions on illumination in a space and its effects on the resulting thermal comfort and operational consumption of a traditional urban residential spaces, vernacular architecture which is specific to a local region and culture may not share similar objectives, preferring more indoor-outdoor transitional spaces, earthen materials, and less focus on windows and incident natural sunlight.

Data Gap Type

Data Gap Details

O2: Obtainability > Accessibility

Depending on the simulation software selected, intended use, and number of features requested, simulation software is available for purchase.

S2: Sufficiency > Coverage

Vernacular architecture, characterized by traditional building styles and techniques specific to a local region or culture, are not covered in simulation tools. In fact, most simulation output focus on residential areas in primarily urban regions to minimize future operational costs with assumptions made based on desired illuminance thresholds which may not be universal. By including the ability to evaluate passive design strategies adapted to a specific climate and expanding the materials expression to include high thermal inertia walls and roofs such as those of earthen or thatched construction, additional thermal comfort studies can be performed for given incident illuminance. Cultural considerations to outdoor spaces in relation to indoor spaces can provide even greater context of simulation studies and their usefulness in new construction for diverse regions.

S3: Sufficiency > Granularity

Simulations use fixed occupancy schedules which work well in the context of commercial buildings but are overly prescriptive in the context of residential buildings where user occupancy may vary depending on the number of occupants, time of day, day of week, and season. Residential buildings are multipurpose and can be characterized with a member spending more time in some areas rather than others depending on activity. This gap can be alleviated by adapting and expanding simulation inputs to take diverse occupancy scenarios into consideration.

Current DPMs take into account annual averages rather than granular information with respect to seasonal variations in daylight availability. While some advances have been made to incorporate this information through tools like Daysim which defines new DPMs for residential buildings, further work is needed for regions where occupants may want to minimize direct light access and focus more on diffuse lighting. Expanding studies for clients in warmer more arid climates may provide different thresholds and comfort parameters depending on preferences and lifestyle and may even take into account daylight oversupply, glare, and even thermal discomfort.

Materials used in the construction process of the building may change after initial simulation development depending on availability. Finalized building materials and interior absorption and reflectance may diverge from those simulated. Use of dynamic shading devices could also decrease indoor temperature due to incident irradiance. Simulated results could be provided over a range.

S2S forecast data

Details (click to expand)

NWP model output from S2S experiment https://confluence.ecmwf.int/display/S2S/Models

Use Case

Data Gap Summary

Weather forecasting: Subseasonal-to-seasonal horizon

More data is needed to take advantage of the large ML models.

More data is needed to take advantage of the large ML models.

Data Gap Type	Data Gap Details
S1: Sufficiency > Insufficient Volume	Currently available data is not sufficient for training large ML models. More data is needed.

SMA Solar Technology PV System Performance database

Details (click to expand)

PV Anlage-Reinhart System provides hourly photovoltaic power, energy production, CO2 emissions avoided, and system configuration information for publicly available PV installations worldwide. SMA, a leading German manufacturer of solar inverters, has compiled data from their international deployments across multiple countries including Germany, the US, Chile, Brazil, Mexico, Canada, Spain, Italy, France, China, Australia, Belgium, India, Poland, Japan, UK, South Africa, Türkiye, and the UAE. This dataset, which includes inverter specifications, module information, and sometimes battery data, supports microgrid studies and distributed energy resource forecasting.

Use Case

Data Gap Summary

Improving solar power forecasting: short-term (30 min-6 hours)

The SMA PV monitoring system requires user profile creation and specific system access requests, with documentation primarily in German creating potential language barriers. Data representation is geographically unbalanced with stronger coverage in Germany, Netherlands, and Australia despite its global presence. Additionally, only a subset of systems includes energy storage data, which would be valuable for comprehensive distributed energy resource load forecasting studies.

The SMA PV monitoring system requires user profile creation and specific system access requests, with documentation primarily in German creating potential language barriers. Data representation is geographically unbalanced with stronger coverage in Germany, Netherlands, and Australia despite its global presence. Additionally, only a subset of systems includes energy storage data, which would be valuable for comprehensive distributed energy resource load forecasting studies.

Data Gap Type	Data Gap Details
U4: Usability > Documentation	Documentation is primarily in German and lacks the same detail in the English version of the website. Companion research utilizing the data is not readily cited or linked. Language barriers can challenge the interpretation of displayed data values when accessed through the portal interface.
S2: Sufficiency > Coverage	Coverage varies significantly by country, with representation ranging from single systems to over 43,000 systems per country. Systems in Germany, the Netherlands, and Australia are more comprehensively represented than other regions. Additionally, battery storage information is inconsistently available across monitored systems. This gap could be addressed by increasing private user-contributed system data from diverse regions.
O2: Obtainability > Accessibility	Users must utilize the web interface or create a user profile to request access to additional data or preferred formats. Data cannot be freely downloaded in bulk or raw format and must be scraped from the web portal. Contact with SMA is required for membership or extended usage rights.

SOLETE Hybrid Solar-Wind Generation dataset

Details (click to expand)

SOLETE, developed by the Energy System Integration Lab (SYSLAB) at the Technical University of Denmark, provides 15 months of measurements at multiple resolutions (seconds to hours) from June 2018 to September 2019. The dataset includes timestamps, meteorological data (temperature, humidity, pressure, wind speed and direction), solar irradiance measurements (global horizontal and plane of array), and active power generated by an 11 kW Gaia wind turbine and a 10 kW PV inverter. This comprehensive dataset supports time-series forecasting for hybrid solar-wind distributed energy resource systems.

Use Case

Data Gap Summary

Improving solar power forecasting: short-term (30 min-6 hours)

While SOLETE offers valuable data for joint wind-solar distributed energy resource forecasting, several sufficiency gaps limit its application. The dataset’s 15-month temporal coverage doesn’t capture long-term seasonal variations, and it monitors only a single wind turbine and PV array, limiting analyseis of multi-source generation coordination. Additionally, maintenance schedule and system downtime data are missing, which would enhance realistic system dynamic modeling. Supplementing with external data sources or simulation could address these limitations.

While SOLETE offers valuable data for joint wind-solar distributed energy resource forecasting, several sufficiency gaps limit its application. The dataset’s 15-month temporal coverage doesn’t capture long-term seasonal variations, and it monitors only a single wind turbine and PV array, limiting analyseis of multi-source generation coordination. Additionally, maintenance schedule and system downtime data are missing, which would enhance realistic system dynamic modeling. Supplementing with external data sources or simulation could address these limitations.

Data Gap Type	Data Gap Details
S6: Sufficiency > Missing Components	SOLETE lacks maintenance schedule data and system downtime information. Retroactively supplementing this data through simulation or SYSLAB records would improve system forecasting to account for scheduled maintenance uncertainties.
S3: Sufficiency > Granularity	Varying resolution and sampling rates (seconds to hours) can impact analysis precision, particularly when fusing data of different temporal resolutions. Aggregating second-level data to hourly intervals may affect joint short-term solar and wind forecasting outcomes.
S2: Sufficiency > Coverage	The 15-month temporal coverage is insufficient to capture long-term seasonal variations in joint wind and irradiance patterns.
S1: Sufficiency > Insufficient Volume	The dataset covers only a single wind turbine and PV array, limiting insights into coordination between multiple generation sources. This gap could be addressed by physically expanding the network or combining SOLETE with external datasets from utility and energy technology companies to enable larger grid control studies.

SRRL TSI-880 (sky imagery)

Details (click to expand)

The SRRL TSI-880 contains data from ground-based sky imagers that provide high temporal and spatial resolution (<1 km) information at single locations to support cloud detection and solar forecasting.

Use Case

Data Gap Summary

Improving solar power forecasting: nowcasting/very-short-term (0-30min)

Data coverage is limited by camera locations, temporal resolution is restricted to 10-minute increments, and image resolution is limited to 352x288 24-bit jpeg images.

Data coverage is limited by camera locations, temporal resolution is restricted to 10-minute increments, and image resolution is limited to 352x288 24-bit jpeg images.

Data Gap Type	Data Gap Details
S3: Sufficiency > Granularity	Images have limited resolution (352x288 pixels) with 10-minute capture intervals, potentially insufficient for very-short-term forecasting.
S2: Sufficiency > Coverage	Coverage is constrained by sensor network location and density. Expanded networks in diverse environments would improve coverage.
S2: Sufficiency > Coverage	The current dataset derives from sky imager datasets in Singapore, requiring similar networks in other regions or alternative data sources.

SWINySEG (sky imagery)

Details (click to expand)

SWINySEG (Singapore whole sky Nychthemeron image SEGmentation database) contains 6,768 daytime and nighttime sky/cloud images with corresponding binary ground truth maps taken in Singapore over 12 months in 2016, with annotations by the Singapore Meteorological Services.

Use Case

Data Gap Summary

Improving solar power forecasting: nowcasting/very-short-term (0-30min)

The dataset provides valuable annotated data for cloud detection and segmentation but is limited to Singapore, has an insufficient volume of samples (especially nighttime images), and restricts commercial use.

The dataset provides valuable annotated data for cloud detection and segmentation but is limited to Singapore, has an insufficient volume of samples (especially nighttime images), and restricts commercial use.

Data Gap Type	Data Gap Details
S1: Sufficiency > Insufficient Volume	The dataset needs more manually annotated cloud mask labels and is imbalanced with fewer nighttime samples.
O2: Obtainability > Accessibility	The dataset is under a Creative Commons license that prohibits commercial use, and access must be requested.

Satellite imagery

Details (click to expand)

Satellite imagery datasets consist of Earth observation data captured from space-based sensors with varying spatial (size of the pixels), spectral (number and type of channels), and temporal (amount of time between collections) resolutions. Satellite imagery can have a global coverage, which enables global mapping applications.

This data is relevant to many ML use cases, but different applications require different spatial, spectral, and temporal resolutions and different kinds of labels.

Some of the most widely used satellite imagery include Sentinel-1 and 2, MODIS, VIIRS, Landsat, which are open to the public and of resolutions down to 5m. Commercial satellites can have much higher-resolution images (e.g. 30-cm of Maxar), but they are not open to the public. It is worth noting that Planet NICFI provides free high-resolution, analysis-ready mosaics of the world’s tropics for non-commercial use.

Use Case

Data Gap Summary

Accelerating post-disaster damage assessments

Satellite imagery for disaster assessment faces challenges with temporal currency and spatial resolution, with public datasets having insufficient resolution for accurate damage assessment and commercial high-resolution options being prohibitively expensive.

Satellite imagery for disaster assessment faces challenges with temporal currency and spatial resolution, with public datasets having insufficient resolution for accurate damage assessment and commercial high-resolution options being prohibitively expensive.

Data Gap Type	Data Gap Details
S4: Sufficiency > Timeliness	Both pre- and post-disaster imagery are needed, but pre-disaster imagery sometimes is outdated, not really reflecting the condition right before the disaster.
S3: Sufficiency > Granularity	Accurate damage assessment requires high-resolution images, but the resolution of current publicly open datasets is inadequate for this purpose. Some commercial high-resolution images should be made available for research purposes at no cost.

Enabling assessments of rest area capacity and use for electric truck charging

Satellite imagery for obtaining truck counts requires high-resolution imagery (here, both high temporal and spatial resolution matter) that is cloud-free over several kilometers. Usual cloud-free products are not suitable, because the time stamp attached to the image is important, and one image should cover several kilometers of a street or highway.

Satellite imagery for obtaining truck counts requires high-resolution imagery (here, both high temporal and spatial resolution matter) that is cloud-free over several kilometers. Usual cloud-free products are not suitable, because the time stamp attached to the image is important, and one image should cover several kilometers of a street or highway.

Data Gap Type	Data Gap Details
S3: Sufficiency > Granularity	This use case necessitates imagery at high temporal and spatial resolution, which is difficult to access and may be costly.

Enhancing digital reconstructions of the environment

Satellite images provide environmental information for habitat monitoring. Combined with other data, e.g. bioacoustic data, they have been used to model and predict species distribution, richness, and interaction with the environment. High-resolution images are needed but most of them are not open to the public for free.

Satellite images provide environmental information for habitat monitoring. Combined with other data, e.g. bioacoustic data, they have been used to model and predict species distribution, richness, and interaction with the environment. High-resolution images are needed but most of them are not open to the public for free.

Data Gap Type	Data Gap Details
S3: Sufficiency > Granularity	Resolution of the publicly open satellite images are not sufficient for some environment reconstruction studies.
O2: Obtainability > Accessibility	The resolution of publicly open satellite images is not high enough. High-resolution images are usually commercial and can be expensive.

Improving solar power forecasting: medium-term (6-24 hours)

Satellite remote sensing data for solar forecasting faces several challenges: variability in spatial and temporal resolution across different satellite sources, complex preprocessing requirements for multispectral data, and the need to accurately translate cloud observations into ground-level irradiance predictions. Improving granularity through supplementation with ground-based measurements and developing standardized preprocessing pipelines would significantly enhance forecast accuracy for grid management applications.

Satellite remote sensing data for solar forecasting faces several challenges: variability in spatial and temporal resolution across different satellite sources, complex preprocessing requirements for multispectral data, and the need to accurately translate cloud observations into ground-level irradiance predictions. Improving granularity through supplementation with ground-based measurements and developing standardized preprocessing pipelines would significantly enhance forecast accuracy for grid management applications.

Data Gap Type	Data Gap Details
U2: Usability > Aggregation	Data from different satellite sources (both geostationary and polar-orbiting) needs to be collated and harmonized when analyzing multiple regions of interest, creating challenges in data integration and standardization.
U5: Usability > Pre-processing	Multispectral remote sensing data requires preprocessing, including atmospheric correction and band combinations in the visible and infrared spectra, before it can be effectively used for solar forecasting models.
R1: Reliability > Quality	Different cloud types affect ground-level solar irradiance in varying ways that satellite imagery alone cannot fully capture, necessitating verification and supplementation with ground-based measurements for improved model accuracy.
S3: Sufficiency > Granularity	Spatial and temporal resolution varies significantly between satellite sources, limiting the ability to capture rapid changes in cloud cover that impact solar irradiance, particularly during partly cloudy conditions which create high variability in short timeframes.

Improving terrestrial wildlife detection and species classification

Some commercial high-resolution satellite images can also be used to identify large animals such as whales, but those images are not open to the public.

Some commercial high-resolution satellite images can also be used to identify large animals such as whales, but those images are not open to the public.

Data Gap Type	Data Gap Details
O2: Obtainability > Accessibility	The resolution of publicly open satellite images is not high enough. High-resolution images are usually commercial and can be expensive.

Mapping existing solar photovoltaic systems

Data gaps for this use case may stem from the need for large data volumes and high-resolution historical imagery.

Data gaps for this use case may stem from the need for large data volumes and high-resolution historical imagery.

Data Gap Type	Data Gap Details
U6: Usability > Large Volume	When undertaking such assessment at large scale (some studies are global) the volume of data is large. Strategies looking for proxies that are more easy to detect and can help identify most promising regions where to focus more efforts, e.g. using night light data.
S4: Sufficiency > Timeliness	To understand when PVs have been installed, one needs access to a time series of historical imagery with sufficient resolution (e.g. annual).

Scaling earth system monitoring

Satellite images face major challenges from massive data volumes that impede downloading and processing, lack of annotated data for training ML models, and limited access to high-resolution imagery, particularly affecting Global South applications.

Satellite images face major challenges from massive data volumes that impede downloading and processing, lack of annotated data for training ML models, and limited access to high-resolution imagery, particularly affecting Global South applications.

Data Gap Type	Data Gap Details
U6: Usability > Large Volume	The sheer volume of satellite data at terabyte scale makes downloading, transferring, and hosting extremely difficult, often exceeding storage capacity of data providers.
U5: Usability > Pre-processing	Satellite images contain extensive redundant information and require filtering of non-useful data areas during model training.
U5: Usability > Pre-processing	The lack of annotated data presents a major challenge, requiring coordinated sector-level annotation efforts and increased granularity in labeling.
U2: Usability > Aggregation	Imagery from different satellites varies in projection, temporal and spatial zones, and cloud coverage, making harmonization challenging when combining multiple sources.
S3: Sufficiency > Granularity	Publicly available datasets often lack sufficient spatial resolution, particularly challenging for Global South regions that cannot afford high-resolution commercial imagery.
O2: Obtainability > Accessibility	Very high-resolution satellite images typically come from commercial satellites and are not publicly available, with limited exceptions like the NICFI dataset for tropical regions.
M: Misc/Other	Cloud cover significantly reduces satellite imagery usability, requiring pixel substitution from clear-sky images that can introduce noise and errors.
M: Misc/Other	Technical capacity limitations in the Global South hinder effective utilization of available satellite imagery resources.

Scaling truck count inference from remote sensing data

Satellite imagery for monitoring rest area capacity and usage has the typical data gaps for use cases requiring high-resolution imagery (here, both high temporal and spatial resolution matter) that is cloud-free over several kilometers.

Satellite imagery for monitoring rest area capacity and usage has the typical data gaps for use cases requiring high-resolution imagery (here, both high temporal and spatial resolution matter) that is cloud-free over several kilometers.

Data Gap Type	Data Gap Details
S3: Sufficiency > Granularity	This use case necessitates imagery at high temporal and spatial resolution, which is difficult to access and may be costly. Partnering with open satellite programs and exploring commercial imagery licensing models for research may help reduce costs to access high-resolution, high-frequency data.
R1: Reliability > Quality	This application requires no cloud coverage over several kilometers to count the trucks, thus requiring additional filtering of satellite images to ensure they are cloud free. This puts additional constraints in terms of the quality of the data needed.

Satellite imagery – GEDI LiDAR

Details (click to expand)

The Global Ecosystem Dynamics Investigation (GEDI) is a NASA/University of Maryland mission that uses LiDAR to create detailed 3D maps of forest canopy height and structure. By measuring forests in 3D, GEDI data enables accurate estimation of forest biomass and carbon storage across global scales.

Use Case

Data Gap Summary

Improving estimations of forest carbon stock

Quality uncertainties in GEDI data affect carbon stock estimation reliability, requiring validation methods and calibration procedures to improve measurement accuracy.

Quality uncertainties in GEDI data affect carbon stock estimation reliability, requiring validation methods and calibration procedures to improve measurement accuracy.

Data Gap Type	Data Gap Details
R1: Reliability > Quality	GEDI data contains inherent uncertainties including geolocation errors and weak return signals in dense forests, which introduce errors into canopy height estimates and subsequent carbon calculations. Combining GEDI with other data sources like airborne LiDAR for validation and developing region-specific calibration methods could improve data reliability.

Satellite imagery – Hyperspectral

Details (click to expand)

This dataset consists of hyperspectral satellite imagery from platforms such as PRISMA and EnMAP, which capture hundreds of narrow spectral bands across the electromagnetic spectrum, providing detailed spectral information for detecting methane plumes with greater sensitivity than multispectral systems.

Use Case

Data Gap Summary

Scaling methane emission detection

Very few actual hyperspectral images of methane plumes exist, creating a significant data volume limitation for training robust detection algorithms.

Very few actual hyperspectral images of methane plumes exist, creating a significant data volume limitation for training robust detection algorithms.

Data Gap Type

Data Gap Details

S1: Sufficiency > Insufficient Volume

Images of methane plumes in hyperspectral satellite data are very rare, leading to insufficient data for developing and training robust detection algorithms. Consequently, researchers often use synthetic data, transposing high-resolution methane plume images from other sources such as Sentinel-2 onto hyperspectral images from platforms like PRISMA. Expanding the collection of actual hyperspectral methane plume observations or developing more sophisticated methods for generating realistic synthetic data would significantly improve detection capabilities.

Satellite imagery – Multi-Radar/Multi-Sensor System

Details (click to expand)

The Multi-Radar Multi-Sensor (MRMS) system combines data from multiple radars, satellites, surface observations, lightning reports, rain gauges, and numerical weather prediction models to produce decision-support products every two minutes. It provides detailed depictions of high-impact weather events such as heavy rain, snow, hail, and tornadoes, enabling forecasters to issue more accurate and earlier warnings. See https://www.nssl.noaa.gov/projects/mrms/

Use Case

Data Gap Summary

Accelerating and improving weather forecasting: Near-term (< 24 hours)

Obtaining and integrating radar data from various sources is challenging due to access restrictions, format inconsistencies, and limited global coverage.

Obtaining and integrating radar data from various sources is challenging due to access restrictions, format inconsistencies, and limited global coverage.

Data Gap Type	Data Gap Details
U3: Usability > Usage Rights	Many radar data are rescrited for academic and research purpose only.
O2: Obtainability > Accessibility	Radar data from many countries are not open to the public. They must be purchased or formally requested. Different agencies apply differing quality control protocols, making global-scale analysis challenging.
U1: Usability > Structure	Radar data from different sources vary in format, spatial resolution, and temporal resolution, making data assimilation difficult.
S2: Sufficiency > Coverage	There is insufficient data or no data available from the Global South.

Satellite imagery – Multispectral

Details (click to expand)

This dataset contains images captured by spectrometer-equipped satellites that record data at specific wavelengths to detect the spectral signatures associated with methane. Notable missions include the Sentinel-5P TROPOMI instrument and the upcoming MethaneSAT, which provide global coverage of methane concentrations in the atmosphere.

Use Case

Data Gap Summary

Scaling methane emission detection

Current multispectral satellite data has insufficient spatial resolution to detect smaller methane leaks.

Current multispectral satellite data has insufficient spatial resolution to detect smaller methane leaks.

Data Gap Type	Data Gap Details
S3: Sufficiency > Granularity	Many current satellites have limited spatial resolution, making it challenging to detect smaller or localized methane sources. This low resolution can result in inaccurate assessments, potentially missing smaller leaks or misidentifying emission sources. Higher resolution is necessary for accurately identifying and quantifying methane emissions from specific facilities or small-scale sources.

Satellite imagery – PALSAR radar images

Details (click to expand)

PALSAR (Phased Array type L-band Synthetic Aperture Radar) provides radar imagery that can capture the 3D structure of forests by penetrating cloud cover and forest canopies. This technology enables consistent monitoring regardless of weather conditions or time of day, making it valuable for continuous forest carbon stock estimation.

Use Case

Data Gap Summary

Improving estimations of forest carbon stock

Domain expertise is needed to preprocess this data, limiting its accessibility to researchers and practitioners without specialized knowledge in radar imagery interpretation.

Domain expertise is needed to preprocess this data, limiting its accessibility to researchers and practitioners without specialized knowledge in radar imagery interpretation.

Data Gap Type	Data Gap Details
U5: Usability > Pre-processing	Domain expertise is required to understand the raw radar data and preprocess it properly for use in ML models for forest carbon estimation. Developing standardized preprocessing pipelines and tools could make this valuable data more accessible to the broader ML and climate science communities.

Second-hand vehicle international trade data

Details (click to expand)

Second-hand vehicle international trade data refers to information on the cross-border movement of used vehicles, including details such as origin and destination countries, and volumes traded, but also ideally information such as vehicle types, age, fuel type, or mileage. This data can help track how used vehicles flow between regions, often from high-income to lower-income countries. It is critical for understanding environmental and economic impacts, such as the spread of older, higher-emission vehicles to areas with weaker regulations.

Use Case

Data Gap Summary

Understanding fleet overturning and international second-hand vehicle markets

Second-hand vehicle trade data is limited by poor country coverage, low volume based on few case studies, and missing key details like vehicle type, age, fuel type, and mileage, making it insufficient for understanding global trade patterns and technology shifts.

Second-hand vehicle trade data is limited by poor country coverage, low volume based on few case studies, and missing key details like vehicle type, age, fuel type, and mileage, making it insufficient for understanding global trade patterns and technology shifts.

Data Gap Type	Data Gap Details
S2: Sufficiency > Coverage	Such data is openly available only for a few countries, and often for a few country pairs instead of comprehensive flows towards all countries. Creating a centralized, open-access repository with harmonized second-hand vehicle trade data across all countries would significantly improve transparency and coverage.
S1: Sufficiency > Insufficient Volume	These datasets may be built on a few case studies and may not be representative of overall flows. Developing datasets grounded in comprehensive customs, registration, and export records rather than isolated case studies would enhance representativeness and policy relevance.
S6: Sufficiency > Missing Components	Detailed information such as vehicle types, age, fuel type, or mileage may not be available, which limits the applicability, for example, to understand the evolution between combustion-engine and electric vehicles.

Simulated variables from process-based models of soil organic carbon dynamics

Details (click to expand)

This dataset contains soil data generated by physics-based or process-based soil models that simulate soil organic carbon dynamics based on environmental and management inputs. These simulations provide alternatives to direct measurements where field data collection is prohibitively expensive or impractical.

Use Case

Data Gap Summary

Modeling effects of soil processes on soil organic carbon

Data collection is extremely expensive for some variables, leading to the use of simulated variables. Unfortunately, simulated values have large uncertainties due to the assumptions and simplifications made within simulation models.

Data collection is extremely expensive for some variables, leading to the use of simulated variables. Unfortunately, simulated values have large uncertainties due to the assumptions and simplifications made within simulation models.

Data Gap Type	Data Gap Details
R1: Reliability > Quality	Soil carbon generated from simulators is not reliable because these process-based models might be obsolete or might have a certain kind of systematic bias, which gets reflected in the simulated variables. But ML scientists who use those simulated variables usually don’t have the proper knowledge to properly calibrate these process-based models.

Smart inverter devices database

Details (click to expand)

The California Energy Commission keeps a list of smart inverters that meet strict standards for safety and communication. These inverters must pass extra tests to show they can handle things like voltage, frequency, timing, and how they connect or disconnect from the grid, along with other technical functions to keep the power system safe and stable.

Those include: CEC Grid Support Solar Inverters, CEC Grid Support Battery Inverters, CEC Grid Support Solar/Battery Inverters, CEC Inverters with Power Control Systems functionality.

Additional vendors can also be contacted for smart inverter information:

SMA-America Solar Inverters.

Use Case

Data Gap Summary

Optimizing smart inverter management for distributed energy resources

Smart inverter operational data is not publicly available and requires partnerships with research labs, utilities, and smart inverter manufacturers. However, the California Energy Commission maintains a database of UL 1741-SB compliant manufacturers and smart inverter models that can then be contacted for research partnerships. In terms of coverage area, while California and Hawaii are now moving towards standardizing smart inverter technology in their power systems, other regions outside of the United States may locate similar manufacturers through partnerships and collaborations.

Smart inverter operational data is not publicly available and requires partnerships with research labs, utilities, and smart inverter manufacturers. However, the California Energy Commission maintains a database of UL 1741-SB compliant manufacturers and smart inverter models that can then be contacted for research partnerships. In terms of coverage area, while California and Hawaii are now moving towards standardizing smart inverter technology in their power systems, other regions outside of the United States may locate similar manufacturers through partnerships and collaborations.

Data Gap Type	Data Gap Details
O2: Obtainability > Accessibility	Particularly for the CEC database, one will need to contact the CEC or manufacturer to receive additional information for a particular smart inverter. Detailed studies using smart inverter hardware may require collaboration with a utility and research organization to perform advanced research studies.
U2: Usability > Aggregation	To retrieve additional data beyond the single entry model and manufacturer of a particular smart inverter, one may need to contact a variety of manufacturers to get access to datasets and specifications for operational smart inverter data, laboratories to get access to hardware in the loop test centers, and utilities or local energy commissions for smart inverter safety compliance and standards.
S2: Sufficiency > Coverage	New grid support functions defined by UL1741-SA and UL1741-SB are optional but will be required for California and Hawaii as of now, public manufacturing data is available only via the CEC website. Collaborations and contact with manufacturers outside the US may be necessary to compile a similar database and contact with utilities can provide a better understanding of similar UL 1741-SB criteria adoption.

Soil Survey Geographic Database (SSURGO)

Details (click to expand)

The Soil Survey Geographic Database (SSURGO) contains soil organic carbon data collected through field observations and laboratory analysis of soil samples. It provides comprehensive soil information for the United States, including physical and chemical soil properties.

Use Case

Data Gap Summary

Modeling effects of soil processes on soil organic carbon

In general, there is insufficient soil organic carbon data for training a well-generalized ML model (both in terms of coverage and granularity).

In general, there is insufficient soil organic carbon data for training a well-generalized ML model (both in terms of coverage and granularity).

Data Gap Type	Data Gap Details
U1: Usability > Structure	Data is collected by different farmers on different farms, leading to consistency issues and a need to better structure the data.
S3: Sufficiency > Granularity	In general, there is insufficient soil organic carbon data for training a well-generalized ML model (both in terms of coverage and granularity). One reason is that collecting such data is very expensive – the hardware is costly and collecting data at a high frequency is even more expensive.
S2: Sufficiency > Coverage	In general, there is insufficient soil organic carbon data for training a well-generalized ML model (both in terms of coverage and granularity). One reason is that collecting such data is very expensive – the hardware is costly and collecting data at a high frequency is even more expensive.

Soil measurements – NorthWyke Farms platform

Details (click to expand)

The NorthWyke Farms platform data is a collection of soil measurements from the UK’s North Wyke Farm Platform, providing quarterly soil organic carbon values along with other environmental parameters. The dataset covers experimental farm plots under different management practices and is continuously updated with new measurements.

Use Case

Data Gap Summary

Modeling effects of soil processes on soil organic carbon

The common and biggest challenges for use cases involving soil organic carbon is the insufficiency of data and the lack of high granularity data.

The common and biggest challenges for use cases involving soil organic carbon is the insufficiency of data and the lack of high granularity data.

Data Gap Type	Data Gap Details
S3: Sufficiency > Granularity	Data is quarterly value for soil carbon, but this is not enough for capturing the weekly changes in soil carbon when we change the fertilizer amount or the tilling practices.

Solcast (global solar forecasting and historical solar irradiance data)

Details (click to expand)

Solcast is a global solar forecasting and historical solar irradiance data provider that combines satellite imagery from Himawari 8, GOES-16, GOES-17, and Numeric Weather Prediction models to deliver 10-15 minute scale solar irradiance data products.

Use Case

Data Gap Summary

Improving solar power forecasting: nowcasting/very-short-term (0-30min)

Solcast data is only accessible through academic or research institutions, uses coarse elevation models, has limited coverage (33 global sites), and provides data at 5-60 minute intervals, insufficient for very-short-term forecasting.

Solcast data is only accessible through academic or research institutions, uses coarse elevation models, has limited coverage (33 global sites), and provides data at 5-60 minute intervals, insufficient for very-short-term forecasting.

Data Gap Type	Data Gap Details
S3: Sufficiency > Granularity	Time resolution ranges from 5 to 60 minutes, which is insufficient for sub-5-minute forecasting needs.
S2: Sufficiency > Coverage	Coverage is limited to 33 global sites (18 tropical/subtropical, 15 temperate), requiring expansion to other regions and environmental conditions.
R1: Reliability > Quality	Significant elevation differences between ground sites and cell height affect clear-sky irradiance estimation accuracy.
O1: Obtainability > Findability	Data is only accessible through collaborating academic or research institutions.

Strava GPS-based cycling data

Details (click to expand)

Strava GPS-based cycling data is collected from cyclists who use the Strava app to track their rides via GPS. This data captures detailed information such as route choice, speed, time of day, and basic user characteristics (age, gender, etc.), providing rich insights into cycling behavior and network usage.

Use Case

Data Gap Summary

Interpolating city-wide bicycle volumes from limited count data

Strava GPS cycling data offers both the highest temporal and spatial resolution and the most comparable source across cities. However, it is accessible to cities but less so to academics and mainly represents specific user groups, limiting its coverage of the overall cycling population.

Strava GPS cycling data offers both the highest temporal and spatial resolution and the most comparable source across cities. However, it is accessible to cities but less so to academics and mainly represents specific user groups, limiting its coverage of the overall cycling population.

Data Gap Type	Data Gap Details
O2: Obtainability > Accessibility	The data can be accessed by cities, but it is harder to access for academics. It is particularly hard to get the data for multiple cities. Providing more options for researchers to access the data would be beneficial.
S2: Sufficiency > Coverage	The data primarily reflects the activity of Strava users, who are specific socio-economic groups who do not represent the general cycling population.

Street infrastructure data – LiDAR-derived

Details (click to expand)

LiDAR-derived datasets of street infrastructure can provide a vectorized representation of the street space allocation across multiple usages e.g. road space, special lanes such as bike or bus lanes, sidewalks, greenery, parking lots, etc.

Such data enables micro-level analyses of street designs and their impact on sustainability and equity of cities, for example, to understand the walkability of an area or traffic patterns in relation to the built environment. These datasets can be used for analyses of the status quo or prospective modeling.

Use Case

Data Gap Summary

Understanding the impact of urban planning on travel emissions

Very few such datasets exist. One of the only examples is the Berlin road survey (Straßenbefahrung) 2014, available at https://fbinter.stadt-berlin.de/fb/gisbroker.do;jsessionid=680EFD768EDCC386FBDF72B1637E71D7?cmd=navigationShowResult&mid=K.k_StraDa%40senstadt

Very few such datasets exist. One of the only examples is the Berlin road survey (Straßenbefahrung) 2014, available at https://fbinter.stadt-berlin.de/fb/gisbroker.do;jsessionid=680EFD768EDCC386FBDF72B1637E71D7?cmd=navigationShowResult&mid=K.k_StraDa%40senstadt

Data Gap Type	Data Gap Details
S2: Sufficiency > Coverage	Very few such datasets exist. Increasing the awareness of public authorities about the value of such datasets may help improve this situation. Street-view imagery may also be used as an alternative to LiDAR.
S4: Sufficiency > Timeliness	LiDAR surveys are expensive and typically not done more often than every ten years.

Street view imagery

Details (click to expand)

Street View imagery is a feature offered by mapping services like Google Maps, providing panoramic 360-degree views of streets and various locations worldwide. Crowdsourced alternatives such as Mapillary have also emerged.

Captured using high-resolution cameras mounted on vehicles or drones, these images are stitched together to create seamless panoramic scenes, allowing users to virtually explore and navigate environments.

This data can serve climate-relevant use cases, including the monitoring of urban infrastructure. It can be useful to compute walkability, perceived greenness, parking space usage, etc.

Use Case

Data Gap Summary

Understanding the impact of urban planning on travel emissions

Street view imagery generates massive data volumes, complicating usability, and access is restricted, often favoring larger cities and wealthier countries. Preprocessing for tasks like computer vision is intensive, and coverage can be incomplete or biased. Additionally, images may lack full 360-degree views or contextual details like weather conditions, impacting their treatment by computer vision algorithms.

Street view imagery generates massive data volumes, complicating usability, and access is restricted, often favoring larger cities and wealthier countries. Preprocessing for tasks like computer vision is intensive, and coverage can be incomplete or biased. Additionally, images may lack full 360-degree views or contextual details like weather conditions, impacting their treatment by computer vision algorithms.

Data Gap Type	Data Gap Details
U6: Usability > Large Volume	Street view imagery can reach terabytes of data, given that it provides a 360-degree view of streets at high resolution.
O2: Obtainability > Accessibility	Street view imagery from Google Maps is not freely accessible. Alternatives exist but may be inferior for a given area.
U5: Usability > Pre-processing	Making street view imagery ready for computer vision tasks requires substantial preprocessing, such as image alignment or matching images with other geospatial datasets. Sharing openly the toolboxes developed for facilitating pre-processing can alleviate this issue.
S2: Sufficiency > Coverage	Not all scenes in a city or even a street may have been captured exhaustively. There are also coverage gaps with biases towards larger cities and wealthier countries.
S6: Sufficiency > Missing Components	Certain parts of the 360 view may be missing, certain elements of interest may be obscured in the picture due to obstacles at the time when the image was shot. Other missing components may include elements contextualizing the conditions in which the picture was taken, for example, weather, which influences the lighting and may lead to processing errors.

Sub-metered appliance-level data

Details (click to expand)

This collection includes multiple international datasets of sub-metered building electricity consumption, primarily from residential buildings across North America, Europe, and Asia collected between 2011-2020. These datasets provide granular appliance-level energy consumption data at varying sampling frequencies (1Hz-15kHz) along with aggregate building-level measurements. Some datasets include additional measurements such as occupancy information, environmental conditions, and utility billing data. The datasets vary in coverage from single households to hundreds of homes, with monitoring periods ranging from two months to several years.

- Almanac of Minutely Power dataset (AMPds2): A single building electricity, water, and natural gas consumption dataset from a home in Burnaby, British Columbia, Canada from 2012-2014 which includes environment and utility billing data as well.

- Commercial building energy dataset (COMBED): A dataset of 6 commercial buildings on the Indraprastha Institute of Information Technology (IIIT-Delhi) from August 2013 to the present containing data with respect to the total power consumption, sub-metered data with respect to elevators, air handling units (AHUs), uninterruptible power supplies (UPS), and central campus heating, ventilation, and air conditioning (HVAC) pumps and chillers at a 30 second cadence.

- DEDDIAG: A dataset comprised of aggregate and disaggregated power consumption from 15 southern German homes monitored at 1Hz containing 50 appliances including dishwashers, washing machines, refrigerators and dryers over a span of 3.5 years (2016-2020). Aggregated data includes three-phase measurements. This dataset also contains event start and stop timestamps for 14 appliances.

- Dutch Residential Energy Dataset (DRED): Requires request. Consists of data collected from a single household in the Netherlands which contains the appliance level and total energy consumption over two months. Appliance consumption measured was a refrigerator, washing machine, central heating, microwve, oven, cooker, blender, toaster, television, fan, living room outlets, and a laptop recorded with a sampling frequency of 1 Hz. DRED additionally has data on human occupancy based on WiFi and bluetooth signals received from occupant smartphones and wearable devices to allow for locating the consumer without setting up the home with more intrusive monitoring devices. DRED can be accessed by request.

- Electricity Consumption and Occupation (ECO): A dataset collected from June 2012-January 2013 covering 6 home in Switzerland where 6-10 smart plugs were deployed in each household. Aggregate consumption at the building level was measured in three phases to capture voltage, current, and phase shifts. Occupancy data was tracked by residents manually and via a passive infrared entry door sensor.

- Greend: A dataset of 9 households in Austria and Italy for one year covering December 2013-April 2014. Data included aggregated and submetered appliance level data which varied depending on the appliance inventory of the household covering active power measurements taken at a frequency of 1Hz. GREEND can be requested by form.

- HIPE: A dataset from October 2017-December 2017 recording smart meter measurements from 10 machines and the main terminal of an electronics production site operated by the Institute of Data Processing and Electronics (IPE) at Karlsruhe Institute of Technology (KIT) in Germany at a cadence of 5 seconds with measurements with respect to active power, reactive power, voltage, frequency, and distortion.

- Indian data for Ambient Water and Electricity Sensing (iAWE): Total consumption, appliance level, as well as circuit panel level in a single family home in New Delhi, India was collected in summer of 2013 over the course of 73 days. Additional quantities such as water usage from an overhead tank, and network strength based on packet loss was also jointly measured.

- IDEAL: A joint electricity, gas, temperature, humidity, and light dataset for 255 homes in the UK from August 2016 to June 2018. Aggregate and sub-metered consumption was measured at 1 second intervals, while temperature, humidity and light were measured at 12 second intervals. Household occupancy was measured through initial surveys with respect to socio-demographic data and self-reported updates to the data in the event that there was a change in occupancy.

- Reference Energy Disaggregation Dataset (REDD): Contains 119 days worth of aggregate consumption taken in 2011 from 10 residential buildings located in the greater Boston area. The data includes meter level phases of power, and voltage recorded at 15kHz as well as sub-meter level 24 circuits labeled by appliance category and measured at a cadence of 0.5Hz and 1Hz for large and small plug level appliances respectively.

- REFIT: A dataset containing aggregate and individual appliance monitor sub-meter data taken every 8 seconds from 20 UK households from September 2013 to September 2015. Of the 8 households, 6 households had rooftop solar panels however, 3 were rewired to remove the effect of generation.

- UMass Smart Home data set: This dataset is comprised of metered and sub-metered data from three homes in west Massachussetts taken over a period of three years. Measurements included average household load, circuit-level load, and plug load per second. Accompanying generation data from solar panels and wind turbines is available for one of the three homes. Environmental data with respect to the outdoor weather and indoor temperature and humidity are provided as well as occupancy information through wall switch data, doors, and motion sensors. HVAC trigger events and corresponding temperature settings and operational status are also provided.

- UK Domestic Appliance-Level Electricity data set (UK-DALE): A dataset comprised of measurements of aggregated as well as individual appliance level consumption recorded every 6 seconds from 5 UK homes taken from researchers at Imperial College. The continuous coverage varied per house ranging from 39 to 786 days spanning dates from 2012 to 2015. Data included whole house active power, apparent power, and RMS voltage. Appliance level measurements were taken every 6 seconds using individual appliance monitors for up to 54 appliances per residence.

Use Case

Data Gap Summary

Enabling non-intrusive electricity load monitoring

For accurate NILM studies, benchmark datasets are required to include not only consumption but local power generation (e.g., from rooftop solar), as it can affect the overall aggregate load observed at the building level. While some datasets may include generation information, most studies do not take rooftop solar generation into account. Additionally, devices that can behave both as a load and generator such as electric vehicles or stationary batteries were also not included. The majority of building types are single family housing units limiting the diversity of representation. Furthermore, most datasets are no longer maintained following study close.

For accurate NILM studies, benchmark datasets are required to include not only consumption but local power generation (e.g., from rooftop solar), as it can affect the overall aggregate load observed at the building level. While some datasets may include generation information, most studies do not take rooftop solar generation into account. Additionally, devices that can behave both as a load and generator such as electric vehicles or stationary batteries were also not included. The majority of building types are single family housing units limiting the diversity of representation. Furthermore, most datasets are no longer maintained following study close.

Data Gap Type	Data Gap Details
U5: Usability > Pre-processing	Sub-metered data relies heavily on the sensor network installation used in monitoring the building. Depending on the technology used, some sensors require calibration or are prone to malfunctions and delays. Additionally, interference from other devices can be present in the aggregate building level readings, such as that experienced by REFIT, that need to be addressed manually to enhance the usability of the dataset. These may vary depending on the submeter dataset utilized, requiring a clear understanding of associated metadata and documentation specific to the testbed the study was built upon. Exploratory data analysis of the time series data may assist in identifying outliers that may be a result of sensor drift.
U1: Usability > Structure	When retrieving NILM data from a variety of sources from pre-existing studies as well as through custom data collection, the structure of the received data can vary. Testbed design, hardware, and variables monitored depend on sensor availability which can ultimately influence schema and data formats. Data structure may also differ based on the level of disaggregation at the plug level or the individual appliance level. When building future testbeds for data collection, it may help to to follow the standards set by APIs such as NILMTK which has successfully integrated multiple datasets from different sources. Using the REDD dataset format as inspiration, the toolkit developers created a standard energy disaggregation data format (NILMTK-DF) with common metadata and structure which requires manual dataset-specific preprocessing. When working with non-standardized data that may require aggregation, machine learning based data fusion strategies may automating schema matching and data integration.
S6: Sufficiency > Missing Components	While sub-metered data provides a means of verifying non-intrusive load monitoring techniques, it does not capture the hidden human motivators driving appliance usage (such as comfort, utility cost, and daily activities) as well as other important factors contributing to the aggregate load seen at the building level meter. The key to improving these studies is to provide greater context to the sub-metered data by taking additional joint measurements such as rooftop solar power production, electric vehicle load, occupancy related information, and battery storage. Some dataset-specific missing data components are highlighted below. All datasets mentioned do not include electric vehicle loads. REDD, AMPds2, COMBED, DEDDIAG, DRED, GREEND, iAWE, UK-DALE do not include generation from rooftop solar. REFIT contains solar from three homes but they were not the focus of the study and were treated as solar interference to the aggregate load. The UMass smart home dataset only had representation of one home with solar and wind generation, though at a significantly larger square footage and build compared to the other two homes that were featured. While DRED provided occupancy information through data collected from wearable devices with respect to the home and ECO and IDEAL through self-reporting and an infrared entryway sensor, all other studies did not. The majority of datasets are not amenable to human in the loop user behavior analysis with respect to consumption patterns, response to feedback, and the effectiveness in load shifting to promote energy conserving behaviors due to their lack of representation. While AMPds2 includes some utility data, most datasets do not incorporate billing or real time pricing. This type of data would be beneficial as it varies from time, season, region, and utility. Battery storage was not taken into account in all building consumption datasets.
S2: Sufficiency > Coverage	Gaps in dataset coverage are specific to the sub-metered dataset. These gaps may be due to unaccounted loads, level of disaggregation (e.g. circuit level, plug level, or individual appliance level), or limited appliance types. Diversity of building types are limited as most studies take place in single family residences. Some dataset specific gaps are detailed below that may be addressed by collecting new data on existing testbeds or by augmenting already collected data with synthetic information. Future data collection efforts should be mindful of avoiding the kinds of gaps associated with existing datasets. In AMPds2 data there were some missing data from electricity and water and natural gas readings. Additionally, there exist un-metered household loads which were not accounted for in the aggregate building level readings. With respect to dishwasher consumption, AMPds2 did not have a direct water meter monitoring. REFIT did not monitor appliances that could not be accessed through wall plugs such as electric ovens. Depending on the built environment and building type, larger loads may not be able to be connected to building level meters. For example, in the GREEND dataset, electric boilers in Austria were connected to separate external meters. In the UMass smart home dataset, gas furnaces, exhaust fans, and recirculator pump loads were not able to be monitored. AMPds2, DEDDIAG, DRED, iAWE, REDD, REFIT, and UMass smart home dataset all gather data in single family homes which may not be representative of the diversity of buildings in terms of age, location, construction, and potential household demographics. REFIT data covers different single family home types such as detached, semi-detached, and mid-terrace homes that ranged from 2-6 bedrooms with builds from the 1850s to 2005. GREEND covers apartments in addition to single family homes but the number of households was 9. AMPds2, DRED, iAWE only cover a single household. Additionally, datasets are specific to the location where the measurements were taken which are susceptible to the environmental conditions of the region as well as the culture of the population. For example, REDD consists of data from 10 monitored homes which may not be representative of common appliances contributing to the overall load of the broader population outside of Boston. COMBED contains complex load types that may rely on variable-speed drives as well as multi-state devices, which the other datasets do not contain which may be due to the difference in building type but could also be due to the lack of diversity of appliance representation. ECO data relied on smart plugs for disaggregated load consumption measurements which varied between households depending on the smart plug appliance coverage. For all households the total consumption was not equal to the sum of the consumption measured from the plugs alone, indicative of a high proportion of non-attributed consumption.
R1: Reliability > Quality	In the AMPds2 data, the sum of the sub-metered consumption data did not add up to the whole house consumption due to some rounding error in the meter measurements, highlighting not only the need for NILM studies with sub-metered data as ground truth, but also the type of building level meter. Future data collection efforts may want to not only focus on retrieval of utility-side building meter data but also supplemental aggregate meter data to detect mismatches in measurements. Datasets or studies that require self-reporting by customers may introduce participant bias, as the resolution with which households update voluntary information may vary. For example, if the number of household members, occupancy schedule, and the addition of new plug loads is self-reported, frequency of updates vary depending on volunteer engagement. Additionally, volunteers who participate in NILM studies may have a particular propensity for energy efficient actions, and may not be representative of the general population. For example, some participants of UK-DALE were Imperial College graduate students were motivated to participate to advance their own projects. To make sure that electricity usage represents the general population, future case studies can recruit potential volunteer communities with diversity of socioeconomic background and location.

TABULA building typology

Details (click to expand)

The TABULA project—short for Typology Approach for Building Stock Energy Assessment—was an EU initiative (2009–2012) to create a harmonized structure of residential building typologies across Europe. Countries define grids by construction era and building size, and for each “cell” select or model a representative building whose envelope, heating system, and energy use are characterized. These archetypes are used in a web tool to estimate heating demand, primary energy, CO₂ emissions, and to evaluate potential energy-saving measures—facilitating benchmarking, policy analysis, and refurbishment planning at national and regional scales.

Use Case

Data Gap Summary

Enhancing the scalability and robustness of building stock assessments

TABULA suffers from limited granularity, insufficient volume, and outdated information. They often provide only one representative value per archetype, lack typological diversity across countries, and include parameters with questionable accuracy.

TABULA suffers from limited granularity, insufficient volume, and outdated information. They often provide only one representative value per archetype, lack typological diversity across countries, and include parameters with questionable accuracy.

Data Gap Type	Data Gap Details
S3: Sufficiency > Granularity	Only a few typologies are available for each country. The number of typologies varies across countries. Some countries have several sub-regions and types, while some have little variability.
S1: Sufficiency > Insufficient Volume	Only one representative value is provided per archetype, while the real value distribution may have a substantial spread. New data collection should aim at collecting more granular data.
R1: Reliability > Quality	Some parameters have unrealistic values and may not accurately represent the typical values for the given building type. For certain parameters, more granular information e.g. coming from energy performance certificates, may be used to validate, and new data collection should aim at collecting more granular data.
S4: Sufficiency > Timeliness	The data was produced more than 10 years ago. New data collection following a similar general approach is needed.
U4: Usability > Documentation	The data was produced with different methodologies across countries, which are not clearly documented.

The Public Utility Data Liberation (PUDL)

Details (click to expand)

The Public Utility Data Liberation (PUDL) project, maintained by Catalyst Cooperative, integrates and standardizes energy sector data from US government agencies including EIA, FERC, EPA, and system operators into analysis-ready formats. This continuously updated database covers power generation, fuel consumption, emissions, and financial data from 2009 to present across the United States.

Use Case

Data Gap Summary

Enhancing energy policy and market analysis

Government energy datasets suffer from inconsistent formats, missing documentation, and aggregation challenges that prevent ready analysis. Key gaps include complex pre-processing requirements due to format changes, limited documentation maintenance, and missing weather and transmission data. Standardized reporting formats across agencies, improved documentation practices, and expanded data collection could significantly enhance the utility of integrated energy datasets for policy analysis.

Government energy datasets suffer from inconsistent formats, missing documentation, and aggregation challenges that prevent ready analysis. Key gaps include complex pre-processing requirements due to format changes, limited documentation maintenance, and missing weather and transmission data. Standardized reporting formats across agencies, improved documentation practices, and expanded data collection could significantly enhance the utility of integrated energy datasets for policy analysis.

Data Gap Type	Data Gap Details
U5: Usability > Pre-processing	Source data format changes (e.g., FERC’s shift from PDF to XBRL) and semi-structured formats require extensive preprocessing. PDF-based data extraction faces OCR challenges due to scan quality and inconsistent formatting. Standardized reporting formats and machine-readable data standards across agencies could reduce preprocessing burden.
U4: Usability > Documentation	Documentation updates lag behind source data changes, requiring continuous monitoring by maintainers. Proactive documentation standards and change notification systems from data providers could improve maintenance efficiency.
U3: Usability > Usage Rights	While PUDL uses Creative Commons licensing, some utility operator data has unclear public use rights despite being provided to regulatory agencies. Explicit public use licensing statements from government agencies could clarify usage permissions.
U2: Usability > Aggregation	Varying schema and naming conventions across agencies complicate data joining. Probabilistic entity matching helps but requires manual verification. Universal relational database standards and common identifiers across agencies could streamline aggregation.
U1: Usability > Structure	Source data structures vary significantly between reporting years and agencies, with inconsistent plant identification systems. Standardized data schemas and versioning practices could improve structural consistency.
S6: Sufficiency > Missing Components	Weather model data and transmission/congestion information from grid operators would enhance analysis capabilities. Integration partnerships with weather services and grid operators could expand dataset utility.
S3: Sufficiency > Granularity	Temporal resolution varies from hourly to annual across sources, requiring interpolation techniques. More frequent and standardized reporting intervals could improve data granularity.
S2: Sufficiency > Coverage	Dataset coverage limited to US regulatory agencies and organizations. International data partnerships could expand geographic scope for comparative analysis.

Travel surveys

Details (click to expand)

Travel surveys are essential tools for gathering detailed information on travel patterns, behaviors, modal shares (i.e., which percentage of the population uses which transportation mode), and preferences of individuals, crucial for transportation planning, policy development, and infrastructure improvements.

These surveys are collected through various methods, including household interviews, travel diaries, and online or telephone surveys. The effectiveness of travel surveys depends on high participation rates and accurate data collection to ensure the sample’s representativeness. Travel surveys may be conducted at the city or national level.

Use Case

Data Gap Summary

Understanding the impact of urban planning on travel emissions

Travel surveys often overlook smaller cities and rural areas, lack sufficient local data points, and provide data at ZIP code levels, limiting detailed urban planning. Modern technologies like GPS and privacy-preserving methods could address these gaps.

Travel surveys often overlook smaller cities and rural areas, lack sufficient local data points, and provide data at ZIP code levels, limiting detailed urban planning. Modern technologies like GPS and privacy-preserving methods could address these gaps.

Data Gap Type	Data Gap Details
S2: Sufficiency > Coverage	There is a tendency for travel surveys to focus on large cities and a lack of data for smaller cities and rural areas. Newer collection methods using GPS traces may enable to increase in coverage, but obtaining detailed information from questionnaires remains difficult to scale.
S1: Sufficiency > Insufficient Volume	When surveys are conducted at the national level, the number of data points per city or neighborhood is often very low. This impedes local analyses in general and, in particular, statistical analyses that can provide locally relevant insights.
S3: Sufficiency > Granularity	Data is often provided at the ZIP code level, often for privacy reasons. This impedes more granular analyses of micro-level urban planning interventions. Modern privacy-preserving technologies can help release data with higher geospatial granularity.

Truck count data

Details (click to expand)

Truck count data is typically collected using roadside sensors such as inductive loops, radar, or pneumatic tubes that detect and classify passing vehicles based on size and axle configuration. Cameras with computer vision algorithms can also identify and count trucks in real time, sometimes distinguishing between vehicle types. In some cases, weigh-in-motion systems or tolling infrastructure provide additional data on truck flows. This data helps transportation planners understand freight patterns and prioritize locations for interventions like electric vehicle charging to reduce emissions.

Use Case

Data Gap Summary

Scaling truck count inference from remote sensing data

Truck count data suffers from limited coverage, especially in middle- and low-income countries, and often lacks sufficient granularity. Additionally, fragmented collection methods, inconsistent documentation, and limited data sharing hinder usability.

Truck count data suffers from limited coverage, especially in middle- and low-income countries, and often lacks sufficient granularity. Additionally, fragmented collection methods, inconsistent documentation, and limited data sharing hinder usability.

Data Gap Type	Data Gap Details
S2: Sufficiency > Coverage	While there is good data in some high-income countries, very little data is typically available in middle- and low-income countries. International funding and technical support could help expand vehicle monitoring infrastructure and open data initiatives in underrepresented regions.
S3: Sufficiency > Granularity	The spatial resolution is often an issue (only few roads are counted) and counters are often short-term instead of continuous.
S6: Sufficiency > Missing Components	Vehicle types may often not be separated in count data, making it impossible to understand the share of trucks compared to cars. Standardizing data collection to include vehicle classification—e.g., through automated image recognition—would enable separation of trucks from cars for more granular analysis.
U2: Usability > Aggregation	There are only local initiatives aggregating data in a single platform and harmonizing content. Establishing a national or international data platform with standardized formats would support broader access and harmonization of truck count data.
O1: Obtainability > Findability	Data is dispersed as various actors collect the data (highway operators, states, cities) and each actor may release the data themselves. Implementing a coordinated data-sharing framework with common standards across jurisdictions would reduce fragmentation and improve usability.
O2: Obtainability > Accessibility	Many governments and highway operators collect this data, but do not share it publicly. Encouraging open data policies or at least tiered-access licensing for truck count data would unlock its potential for research and planning.
R2: Reliability > Provenance	Different methods are being used and often not documented. They each come with different measurement errors, and the lack of documentation of this aspect harms the robustness of scientific analyses that may be done with data. Raising awareness on the need for transparent documentation of data collection methods, sensor types, and known limitations may help.

US large-scale solar photovoltaic database (USPVDB)

Details (click to expand)

The US Large-scale Solar Photovoltaic Database (USPVDB) contains polygon representation of large-scale photovoltaic installations, associated with facility-specific data attributes.

They were mined from the US Energy Information Administration (EIA) form 860 and facility type designation by the US Environmental Protection Agency (EPA). The dataset also has information on whether the large-scale PV installations are for agrivoltaic purposes. Overall, 3,699 US ground mounted facilities with capacity greater than or equal to 1MWdc are represented. The USPVDB data must be accessed through the United States Geological Survey (USGS) mapper browser application or for download as GIS data in the form of shapefiles or geojsons. Tabular data and metadata are provided in CSV and XML format.

Use Case

Data Gap Summary

Mapping existing solar photovoltaic systems

Only the US are covered in this dataset and coverage in the US is not complete.

Only the US are covered in this dataset and coverage in the US is not complete.

Data Gap Type	Data Gap Details
S2: Sufficiency > Coverage	Coverage is over the US and specifically over densely populated regions that may or may not correlate to areas of low cloud cover and high solar irradiance. Representation of smaller scale private PV systems could expand the current dataset to less populated areas as well as regions outside the US.

US school bus fleet dataset

Details (click to expand)

The US school bus fleet dataset compiled by the World Resources Institute contains information on school district, model year, fuel type, manufacturer, seating capacity, and ownership mode for over 450,000 buses from 46 states and the District of Columbia, covering data collected from March to November 2022.

Use Case

Data Gap Summary

Optimizing electrified bus fleet in urban vehicle-to-grid systems

The dataset suffers from inconsistent state-level reporting structures and missing data from 4 US states, limiting comprehensive national analysis. Standardizing reporting formats and expanding state participation could enable more robust AI models for fleet electrification planning across diverse geographic and operational contexts.

The dataset suffers from inconsistent state-level reporting structures and missing data from 4 US states, limiting comprehensive national analysis. Standardizing reporting formats and expanding state participation could enable more robust AI models for fleet electrification planning across diverse geographic and operational contexts.

Data Gap Type	Data Gap Details
U1: Usability > Structure	Inconsistent state-level reporting creates varying data structures and fields, with some states excluding contractor-owned buses. Solution: Develop federal reporting standards for consistent data collection across all states.
S2: Sufficiency > Coverage	Inconsistent state-level reporting creates varying data structures and fields, with some states excluding contractor-owned buses. Solution: Develop federal reporting standards for consistent data collection across all states.
S4: Sufficiency > Timeliness	Dataset maintenance discontinued after November 2022. Solution: Establish ongoing federal or industry-supported data collection mechanisms.

Urban planning projects data

Details (click to expand)

Urban planning projects datasets would track urban infrastructure changes initiated by authorities, such as modifications to parking spaces, modal filters, low-traffic neighborhoods, new housing developments, and street space allocations.

It would aim to provide a reliable and temporally accurate record that is critical for causal inference analyses of urban planning interventions, where a more authoritative source than platforms like OpenStreetMap is required.

Use Case

Data Gap Summary

Understanding the impact of urban planning on travel emissions

The development of urban planning projects datasets faces significant hurdles, including a lack of machine-readable formats or a typical focus on large projects in existing publicly available data. These issues are compounded by geographical biases, where data availability and detail vary based on regional digitalization levels.

The development of urban planning projects datasets faces significant hurdles, including a lack of machine-readable formats or a typical focus on large projects in existing publicly available data. These issues are compounded by geographical biases, where data availability and detail vary based on regional digitalization levels.

Data Gap Type

Data Gap Details

W: Wish

Key challenges include the need for manual data interpretation due to non-machine-readable formats, an incomplete picture from focusing on large projects, and, in general, only a fraction of the projects being described in sufficient detail online.

Such information is already difficult to compile in the high-income cities where digitalization levels of the administrations are comparatively high, and may require even more manual transcription work in low-income contexts.

Engaging with local authorities on topics of digital education and the value of information for research, or developing tailored software to assist the digitalization of decisions, may help progress towards more of such datasets.

WHOI Martha’s Vineyard Coastal Observatory (wind speed and direction)

Details (click to expand)

Woods Hole Oceanographic Institute Martha’s Vineyard Coastal Observatory data is a three-year measurement dataset of wind speed and direction from 60-200 meters. It is accessible at https://mvco.whoi.edu/.

Use Case

Data Gap Summary

Improving offshore wind power forecasting: short-to long-term (3 hours–1 year)

The data only contains measurements close to coastline, constraining its applicability for offshore wind applications in deep sea.

The data only contains measurements close to coastline, constraining its applicability for offshore wind applications in deep sea.

Data Gap Type	Data Gap Details
S2: Sufficiency > Coverage	The main issue is that this location is quite close to the coastline so the wind profiles is significantly impacted by the land. Therefore, it might be less useful for far-offshore wind nowcasting.

WeatherBench 2

Details (click to expand)

Benchmark for global, medium-range (1-14 day) data-driven weather forecasting https://weatherbench2.readthedocs.io/en/latest/data-guide.html

Use Case

Data Gap Summary

Weather forecasting: Short-to-medium term (1-14 days)

Weather Bench 2 is based on ERA5, so the issues of ERA5 are also inherent here, that is, data has biases over regions where there are no observations.

Weather Bench 2 is based on ERA5, so the issues of ERA5 are also inherent here, that is, data has biases over regions where there are no observations.

Data Gap Type	Data Gap Details
R1: Reliability > Quality	Inherent biases limit ground truth applications - ML-enhanced data assimilation techniques and ensemble reanalysis approaches can reduce model-dependent biases, particularly improving precipitation and cloud field accuracy

Wind Forecast Improvement Project 3 (wind data)

Details (click to expand)

The Wind Forecast Improvement Project 3 is a multi-seasonal offshore field measurement campaign in 2024-2025, linked to intensive numerical modeling development and validation efforts.

This data will have not only wind speed and direction measurements, but also detailed observations of air-sea fluxes and atmospheric soundings, which could inform physics-based nowcasting. Infos:https://www2.whoi.edu/site/wfip3/

Use Case

Data Gap Summary

Improving offshore wind power forecasting: short-to long-term (3 hours–1 year)

This data is not yet available.

This data is not yet available.

Data Gap Type	Data Gap Details
M: Misc/Other	This data is not yet available.

subX

Details (click to expand)

NWP model output from subseasonal forecast experiment https://iridl.ldeo.columbia.edu/SOURCES/.Models/.SubX/.

Use Case

Data Gap Summary

Weather forecasting: Subseasonal horizon

More data is needed to develop a more accurate and robust ML model. It is also important to note that SUBX data contains biases and uncertainties, which can be inherited by ML models trained with this data.

More data is needed to develop a more accurate and robust ML model. It is also important to note that SUBX data contains biases and uncertainties, which can be inherited by ML models trained with this data.

Data Gap Type	Data Gap Details
S1: Sufficiency > Insufficient Volume	Larger models generally offer improved performance for developing data-driven sub-seasonal forecast models. However, with only a limited number of models contributing to the SUBX dataset, there is a scarcity of training data. To enhance ML model performance, more SUBX data generated by physics-based numerical weather forecast models is required.

xBD Dataset (pre- and post-disaster satellite imagery)

Details (click to expand)

xBD is an annotated benchmark dataset containing pre- and post-disaster satellite imagery used for training and evaluating ML models in disaster damage assessment. The dataset is publicly available at https://paperswithcode.com/dataset/xbd

Use Case

Data Gap Summary

Accelerating post-disaster damage assessments

The xBD dataset has two significant limitations: it is geographically biased toward North America and lacks granular damage severity classification, limiting its global applicability and assessment precision.

The xBD dataset has two significant limitations: it is geographically biased toward North America and lacks granular damage severity classification, limiting its global applicability and assessment precision.

Data Gap Type	Data Gap Details
S3: Sufficiency > Granularity	There is no differentiation of grades of damage. More granular information about the severity of damage is needed for more precise assessments.
S2: Sufficiency > Coverage	Data is highly biased towards North America. Similar data from the other parts of the world is urgently needed.