Artificial intelligence (AI) and machine learning (ML) offer a powerful suite of tools to accelerate climate change mitigation and adaptation across different sectors. However, the lack of high-quality, easily accessible, and standardized data often hinders the impactful use of AI/ML for climate change applications.
In this project, Climate Change AI, with the support of Google DeepMind, aims to identify and catalog critical data gaps that impede AI/ML applications in addressing climate change, and lay out pathways for filling these gaps. In particular, we identify candidate improvements to existing datasets, as well as "wishes" for new datasets whose creation would enable specific ML-for-climate use cases. We hope that researchers, practitioners, data providers, funders, policymakers, and others will join the effort to address these critical data gaps.
This project is currently in its beta phase, with ongoing improvements to content and usability. The information provided is not exhaustive, and may contain errors. We encourage you to provide input and contributions via the routes listed below, or by emailing us at datagaps@climatechange.ai. We are grateful to the many stakeholders and interviewees who have already provided input.
Accelerating and improving weather forecasting: Near-term (< 24 hours)
Details (click to expand)
Accurate near-term (< 24 hours ahead) weather forecasting is critical for climate change mitigation (e.g., solar panel deployment) and adaptation (e.g., crisis management during disasters), with applications requiring high spatial and temporal resolution of temperature, precipitation, wind, and cloud coverage.
Machine learning can help make these forecasts more computationally efficient and accurate while maintaining or improving the high resolution needed for climate applications.
The main data gaps include limited geographic coverage (primarily US-centric data), extremely large data volumes that are difficult to transfer and process, and inconsistent data formats from different sources.
Addressing these gaps requires expanding coverage to global regions (especially the Global South), providing cloud-based computational resources alongside the data, and developing standardized formats for multi-source data integration.
Data volume is large and only data specific to the US is available.
Data Gap Type
Data Gap Details
U6: Usability > Large Volume
The large data volume, resulting from its high spatio-temporal resolution, makes transferring and processing the data very challenging. It would be beneficial if the data were accessible remotely and if computational resources were provided alongside it.
S2: Sufficiency > Coverage
This assimilated dataset currently covers only the continental US. It would be highly beneficial to have a similar dataset that includes global coverage.
Data volume is large, and only data covering the US is available.
Data Gap Type
Data Gap Details
U6: Usability > Large Volume
The large data volume, resulting from its high spatio-temporal resolution, makes transferring and processing the data very challenging. It would be beneficial if the data were accessible remotely and if computational resources were provided alongside it.
S2: Sufficiency > Coverage
This assimilated dataset currently covers only the continental US. It would be highly beneficial to have a similar dataset that includes global coverage.
An enhanced version of ERA5 with higher granularity and fidelity is needed. Many surface observations and remote sensing data are available but underutilized for developing such a dataset.
An enhanced version of ERA5 with higher granularity and fidelity is needed. Many surface observations and remote sensing data are available but underutilized for developing such a dataset.
Data Gap Type
Data Gap Details
W: Wish
ERA5 is currently widely used in ML-based weather forecasts and climate modeling because of its high resolution and ready-for-analysis characteristics. However, large volumes of observations from radiosondes, balloons, and weather stations are largely underutilized. Creating a well-structured dataset like ERA5 but with more observational data would be valuable.
Obtaining and integrating radar data from various sources is challenging due to access restrictions, format inconsistencies, and limited global coverage.
Obtaining and integrating radar data from various sources is challenging due to access restrictions, format inconsistencies, and limited global coverage.
Data Gap Type
Data Gap Details
U3: Usability > Usage Rights
Many radar data are rescrited for academic and research purpose only.
O2: Obtainability > Accessibility
Radar data from many countries are not open to the public. They must be purchased or formally requested. Different agencies apply differing quality control protocols, making global-scale analysis challenging.
U1: Usability > Structure
Radar data from different sources vary in format, spatial resolution, and temporal resolution, making data assimilation difficult.
S2: Sufficiency > Coverage
There is insufficient data or no data available from the Global South.
Accelerating building energy models
Details (click to expand)
Building energy modeling (also called building performance simulation) is key across an array of use cases that can help reduce energy demand in buildings, including architectural design, heating, ventilation, and air conditioning (typically abbreviated HVAC) design and control, building performance rating, and building stock analysis.
Traditional building energy modeling, such as the software EnergyPlus relies on detailed physics models with significant computational complexity and processing time.. Machine learning models can significantly enhance evaluation by providing fast emulators for these models based on synthetic and real-world data, enabling faster prototyping and optimization of building design and operations along multiple comfort, consumption, and environmental objectives.
Traditional models and ML-based emulation both require precise inputs about the building design, its usage, as well as the physical and environmental conditions surrounding it. However, information building usage and design are often kept in silos, while information about the surroundings are, when available, dispersed across various datasets. There are very few benchmarks gathering all information for given buildings.
Closing these gaps involves releasing anonymized usage data, working on building bridges between relevant datasets, and developing benchmark datasets. This may enable testing models across more geographies and building types to reduce existing biases and uncertainties attached to building energy models.
Benchmark datasets for building energy modeling are few, are mostly available in the US, and cover a limited range of building types. The variables provided in such datasets are not always precise and comprehensive enough to test models adequately.
Benchmark datasets for building energy modeling are few, are mostly available in the US, and cover a limited range of building types. The variables provided in such datasets are not always precise and comprehensive enough to test models adequately.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Most of the energy demand data is not freely available. Reasons include the reluctance of private companies to share the data and privacy concerns with respect to the residents of the buildings. Such data may be obtained for research via non-disclosure agreements, often after lengthy bureaucratic approval. This situation makes the development of open-access benchmark datasets complex. Targeted stakeholder engagement via data collection projects is required to overcome this situation.
U2: Usability > Aggregation
The different variables needed may not always be available together. One may need to match energy demand with building stock information and climatic data. Reusable open-source tools may ease this process.
S2: Sufficiency > Coverage
Most datasets are from test beds, buildings, and contributing households from the United States. Similar data from other regions would require data collection as household usage behavior may differ depending on culture, location, building age, and weather. Targeted stakeholder engagement via data collection projects is required to overcome this situation.
S3: Sufficiency > Granularity
Dataset time resolution and period of temporal coverage vary depending on the dataset selected. To overcome this gap, interpolation techniques may be employed and recorded.
S6: Sufficiency > Missing Components
Certain detailed variables about the building design and occupancy may not be recorded. Such data points are difficult to obtain without new data collection. Building data typically does not include grid interactive data or signals from the utility side with respect to control or demand side management. Such data can be difficult to obtain or require special permissions. By enabling the collection of utility side signals, utility-initiated auto-demand response (auto-DR) and load shifting could be better assessed.
Despite its usefulness in ventilation studies for new construction, CFD simulations are computationally expensive making it difficult to include in the early phase of the design process where building morphosis can be optimized to reduce future operational consumption associated with building lighting, heating, and cooling. Simulations require accurate input information with respect to material properties that may not be present in traditional urban building types. Output of models require the integration of domain knowledge to interpret results from large volumes of synthetic data for different wind directions becoming challenging to manage. Future data collection with respect to simulation output verification can benefit surrogate or proxy approaches to computationally expensive Navier-Stokes equations, and coverage is often restricted to modern building approaches, leaving out passive building techniques known as vernacular architecture from indigenous communities from being taken into design consideration.
Despite its usefulness in ventilation studies for new construction, CFD simulations are computationally expensive making it difficult to include in the early phase of the design process where building morphosis can be optimized to reduce future operational consumption associated with building lighting, heating, and cooling. Simulations require accurate input information with respect to material properties that may not be present in traditional urban building types. Output of models require the integration of domain knowledge to interpret results from large volumes of synthetic data for different wind directions becoming challenging to manage. Future data collection with respect to simulation output verification can benefit surrogate or proxy approaches to computationally expensive Navier-Stokes equations, and coverage is often restricted to modern building approaches, leaving out passive building techniques known as vernacular architecture from indigenous communities from being taken into design consideration.
Data Gap Type
Data Gap Details
W: Wish
Such datasets do not exist and require dedicated work to gather inputs, generate the data via simulations, and ensure that the simulations are reliable by verifying them with real-world data. Licensing and privacy issues may also be important aspects of such efforts.
While daylight performance metric (DPM) evaluation is an important step in the planning of commercial buildings, residential buildings do not have a similar focus, which is unusual given that most new building construction occurs within the residential sector. Residential DPMs often lack metrics associated with direct sunlight access, rely on annual averages for seasons, and utilize fixed occupancy schedules that are overly simplified for residential spaces. Additionally, illuminance metrics and thresholds utilized in commercial spaces do not translate well for residential spaces where people may prefer higher or lower illuminances depending on their location and lifestyles. Lastly, DPM optimization is based on operational metrics and assumptions on illumination in a space and its effects on the resulting thermal comfort and operational consumption of a traditional urban residential spaces, vernacular architecture which is specific to a local region and culture may not share similar objectives, preferring more indoor-outdoor transitional spaces, earthen materials, and less focus on windows and incident natural sunlight.
While daylight performance metric (DPM) evaluation is an important step in the planning of commercial buildings, residential buildings do not have a similar focus, which is unusual given that most new building construction occurs within the residential sector. Residential DPMs often lack metrics associated with direct sunlight access, rely on annual averages for seasons, and utilize fixed occupancy schedules that are overly simplified for residential spaces. Additionally, illuminance metrics and thresholds utilized in commercial spaces do not translate well for residential spaces where people may prefer higher or lower illuminances depending on their location and lifestyles. Lastly, DPM optimization is based on operational metrics and assumptions on illumination in a space and its effects on the resulting thermal comfort and operational consumption of a traditional urban residential spaces, vernacular architecture which is specific to a local region and culture may not share similar objectives, preferring more indoor-outdoor transitional spaces, earthen materials, and less focus on windows and incident natural sunlight.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Depending on the simulation software selected, intended use, and number of features requested, simulation software is available for purchase.
S2: Sufficiency > Coverage
Vernacular architecture, characterized by traditional building styles and techniques specific to a local region or culture, are not covered in simulation tools. In fact, most simulation output focus on residential areas in primarily urban regions to minimize future operational costs with assumptions made based on desired illuminance thresholds which may not be universal. By including the ability to evaluate passive design strategies adapted to a specific climate and expanding the materials expression to include high thermal inertia walls and roofs such as those of earthen or thatched construction, additional thermal comfort studies can be performed for given incident illuminance. Cultural considerations to outdoor spaces in relation to indoor spaces can provide even greater context of simulation studies and their usefulness in new construction for diverse regions.
S3: Sufficiency > Granularity
Simulations use fixed occupancy schedules which work well in the context of commercial buildings but are overly prescriptive in the context of residential buildings where user occupancy may vary depending on the number of occupants, time of day, day of week, and season. Residential buildings are multipurpose and can be characterized with a member spending more time in some areas rather than others depending on activity. This gap can be alleviated by adapting and expanding simulation inputs to take diverse occupancy scenarios into consideration.
Current DPMs take into account annual averages rather than granular information with respect to seasonal variations in daylight availability. While some advances have been made to incorporate this information through tools like Daysim which defines new DPMs for residential buildings, further work is needed for regions where occupants may want to minimize direct light access and focus more on diffuse lighting. Expanding studies for clients in warmer more arid climates may provide different thresholds and comfort parameters depending on preferences and lifestyle and may even take into account daylight oversupply, glare, and even thermal discomfort.
Materials used in the construction process of the building may change after initial simulation development depending on availability. Finalized building materials and interior absorption and reflectance may diverge from those simulated. Use of dynamic shading devices could also decrease indoor temperature due to incident irradiance. Simulated results could be provided over a range.
Accelerating data-driven generation of climate simulations
Details (click to expand)
Climate simulation using physics-based Earth system models is computationally intensive and time-consuming, limiting the exploration of different climate scenarios.
ML can accelerate this process by creating surrogate models that approximate complex Earth system model simulations, enabling rapid generation of climate projections under various greenhouse gas emission scenarios.
Current ML approaches are limited by the availability of diverse training data from multiple climate models, with most datasets featuring only single-model simulations or inconsistent data structures across models.
Addressing these gaps requires standardizing data formats across climate models, making high-volume data more accessible through cloud-based solutions, and improving model quality to reduce biases and uncertainties in simulations.Closing these data gaps would enable more robust ML emulators capable of producing reliable climate projections at a fraction of the computational cost, accelerating climate research and supporting more informed policy decisions.
The dataset currently includes simulations from only one Earth system model, limiting the diversity of training data and potentially affecting the robustness and generalizability of ML emulators trained on it.
The dataset currently includes simulations from only one Earth system model, limiting the diversity of training data and potentially affecting the robustness and generalizability of ML emulators trained on it.
Data Gap Type
Data Gap Details
S2: Sufficiency > Coverage
Currently, the dataset includes information from only one model. Training a machine learning model with this single source of data may result in limited generalization capabilities. To improve the model’s robustness and accuracy, it is essential to incorporate data from multiple models. This approach not only enhances the model’s ability to generalize across different scenarios but also helps reduce uncertainties associated with relying on a single model.
The dataset faces three key challenges: its large volume makes access and processing difficult with standard computational infrastructure; lack of uniform structure across models complicates multi-model analysis; and inherent biases and uncertainties in the simulations affect reliability.
The dataset faces three key challenges: its large volume makes access and processing difficult with standard computational infrastructure; lack of uniform structure across models complicates multi-model analysis; and inherent biases and uncertainties in the simulations affect reliability.
Data Gap Type
Data Gap Details
U6: Usability > Large Volume
Massive computational requirements - Cloud-based platforms and data subsetting tools can improve accessibility
U1: Usability > Structure
Inconsistent formats across models - Standardized naming conventions and preprocessing pipelines can enable seamless multi-model integration
R1: Reliability > Quality
Large uncertainties in future projections - Model evaluation frameworks and ensemble weighting methods can help quantify and reduce uncertainties
Accelerating distribution-side hosting capacity estimations
Details (click to expand)
Transitioning power grids from carbon-based generation to renewable sources requires restructuring from unidirectional to bidirectional energy networks, which stresses existing systems—especially at the low-voltage distribution level. The hosting capacity of distribution feeders determines how much distributed renewable generation can be safely integrated without triggering safety equipment or compromising power quality.
Traditional methods for assessing distribution network hosting capacity rely on computationally expensive power flow simulations that are difficult to perform in real-time. Machine learning models can serve as surrogate models by capturing spatio-temporal patterns across multiple data streams, enabling real-time hosting capacity estimation and accelerated scenario evaluation through reinforcement learning.
A significant data gap is the limited availability of real distribution feeder data, requiring researchers to rely on simulations that may not accurately reflect actual grid conditions due to differences in load patterns, environmental factors, and DER penetration levels.
Distribution system operators, utilities, and researchers can collaborate to improve data sharing while protecting sensitive information, thereby enabling more accurate hosting capacity assessments and facilitating higher renewable energy integration in distribution networks.
While OpenDSS and GridLab-D provide valuable simulation capabilities, their utility is limited by challenges in obtaining verification data from real distribution circuits, aggregating necessary input data from multiple sources, and navigating usage rights for proprietary utility data. Closing these gaps through improved utility-researcher partnerships and data sharing protocols would significantly enhance the accuracy of hosting capacity assessments, enabling greater renewable energy integration in distribution networks.
While OpenDSS and GridLab-D provide valuable simulation capabilities, their utility is limited by challenges in obtaining verification data from real distribution circuits, aggregating necessary input data from multiple sources, and navigating usage rights for proprietary utility data. Closing these gaps through improved utility-researcher partnerships and data sharing protocols would significantly enhance the accuracy of hosting capacity assessments, enabling greater renewable energy integration in distribution networks.
Data Gap Type
Data Gap Details
U2: Usability > Aggregation
Realistic distribution system studies require aggregating and collating data from multiple external sources regarding network topology, load profiles, and DER penetration for the specific region of interest.
U3: Usability > Usage Rights
Rights to external data for use with OpenDSS or GridLab-D may require purchase or partnerships with utilities and/or the Distribution System Operator (DSO) to perform scenario studies with high DER penetration and load demand.
R1: Reliability > Quality
Simulator studies require real deployment data from substations for verification, as actual hosting capacity may vary based on load conditions, environmental factors, and DER penetration levels in the service area.
Accelerating post-disaster damage assessments
Details (click to expand)
Post-disaster evaluations are crucial for identifying vulnerabilities exposed by climate-related events, which is essential for enhancing resilience and informing climate adaptation strategies.
ML can help by rapidly identifying and quantifying damage, such as structural collapse or vegetation loss, thereby improving response and recovery efforts.
Current datasets for ML-based damage assessment face significant geographic bias and granularity issues, limiting their effectiveness in global contexts and for detailed damage classification.
Expanding geographic coverage beyond North America and enhancing damage severity classifications would enable more accurate and globally applicable ML damage assessment models, improving disaster response worldwide.
Financial loss data for disasters is primarily proprietary and inaccessible to researchers, limiting the development of comprehensive disaster impact assessment models.
Financial loss data for disasters is primarily proprietary and inaccessible to researchers, limiting the development of comprehensive disaster impact assessment models.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Financial loss data is typically proprietary and held by insurance and reinsurance companies, as well as financial and risk management firms. Some of the data should be made available for research purposes to improve disaster response and planning.
Satellite imagery for disaster assessment faces challenges with temporal currency and spatial resolution, with public datasets having insufficient resolution for accurate damage assessment and commercial high-resolution options being prohibitively expensive.
Satellite imagery for disaster assessment faces challenges with temporal currency and spatial resolution, with public datasets having insufficient resolution for accurate damage assessment and commercial high-resolution options being prohibitively expensive.
Data Gap Type
Data Gap Details
S4: Sufficiency > Timeliness
Both pre- and post-disaster imagery are needed, but pre-disaster imagery sometimes is outdated, not really reflecting the condition right before the disaster.
S3: Sufficiency > Granularity
Accurate damage assessment requires high-resolution images, but the resolution of current publicly open datasets is inadequate for this purpose. Some commercial high-resolution images should be made available for research purposes at no cost.
The xBD dataset has two significant limitations: it is geographically biased toward North America and lacks granular damage severity classification, limiting its global applicability and assessment precision.
The xBD dataset has two significant limitations: it is geographically biased toward North America and lacks granular damage severity classification, limiting its global applicability and assessment precision.
Data Gap Type
Data Gap Details
S3: Sufficiency > Granularity
There is no differentiation of grades of damage. More granular information about the severity of damage is needed for more precise assessments.
S2: Sufficiency > Coverage
Data is highly biased towards North America. Similar data from the other parts of the world is urgently needed.
Accelerating the design of new carbon-absorbing materials
Details (click to expand)
Carbon sequestration through absorption methods can effectively reduce CO2 levels in the atmosphere. Engineered molecules, carbon sorbents, can be designed to selectively bind to CO2. Traditionally, these molecules require in-lab experimentation, which can be time- and resource-intensive because they necessitate replication to identify adsorbent characteristics. Additionally, the search space of possible molecules can be very large and non-trivial to explore directly through experiment.
Machine learning can significantly accelerate materials discovery by systematically generating and evaluating candidate molecule properties based on structure, thereby facilitating rapid iteration.
There is a lack of openly-accessible lab measurements to train ML simulation models.
Multiple initiatives could be taken to close this gap, including creating industry-research data sharing initiatives or establishing mandatory data sharing requirements for scientific publications.
The major challenge is that data is not shared with the public.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Data related to carbon absorption materials is often not readily accessible to the public, as it is typically withheld until commercial products are developed. While it is possible to scrape data from published literature, this approach can be cumbersome, especially for large datasets. To advance research and innovation in this field, establishing mandatory data sharing as a requirement for publication is essential. When a paper is published, authors should be required to provide their data in open, machine-readable formats to facilitate accessibility and usability.
Creating open initiatives where companies and institutions recognize the mutual benefits of data sharing is also vital. Until such initiatives demonstrate clear advantages for all stakeholders, private companies may be hesitant to share proprietary data. Initiatives like OpenDAC are promising steps toward fostering collaboration and transparency in the field.
Assessment of climate impacts on public health
Details (click to expand)
Climate change has major implications for public health. ML can help analyze the relationships between climate variables and health outcomes to assess how changes in climate conditions affect public health.
The biggest issue for health data is its limited and restricted access.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
There is, in general, not a lot of datasets one can use to cover the spectrum of population, age, gender, economic, etc. To make good use of available data, there should be more efforts to integrate available data from disparate data sources, such as the creation of data repositories and the open community data standard.
U4: Usability > Documentation
There are some data repositories available. The existing issues are that data is not always accompanied by the source code that created the data or other types of good documentation.
U2: Usability > Aggregation
Integrating climate data and health data is challenging. Climate data is usually in raster files or gridded format, whereas health data is usually in tabular format. Mapping climate data to the same geospatial entity of health data is also computationally expensive.
Processing climate data and Integrating climate data with health data is a big challenge.
Data Gap Type
Data Gap Details
O1: Obtainability > Findability
For people without expertise in climate data only, it is hard to find the data right for their needs, as there is no centralized platform where they can turn for all available climate data.
U1: Usability > Structure
Datasets are of different formats and structures.
O1: Obtainability > Findability
For people without expertise in climate data only, it is hard to find the data right for their needs, as there is no centralized platform where they can turn for all available climate data.
Integrating climate data and health data is challenging. Climate data is usually in raster files or gridded format, whereas health data is usually in tabular format. Mapping climate data to the same geospatial entity of health data is also computationally expensive.
Automating individual re-identification for wildlife
Details (click to expand)
Identifying individual animals within wildlife populations is critical for monitoring endangered species, understanding their behaviors, and developing effective conservation strategies for biodiversity preservation.
Computer vision and machine learning techniques enable automatic individual identification at scale, helping researchers track specific animals over time without invasive tagging methods.
The scarcity of publicly available and well-annotated datasets poses a significant challenge for applying ML in wildlife identification, with the most valuable data scattered across individual research labs or organizations rather than centralized repositories.
Addressing this requires fostering a culture of data sharing in the ecological community through incentives like financial rewards and recognition for data collectors, while establishing standardized pipelines and infrastructures to aggregate existing annotated data for model training.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
Data Gap Type
Data Gap Details
S1: Sufficiency > Insufficient Volume
One of the foremost challenges is the scarcity of publicly open and adequately annotated data for model training. It is worth mentioning that the lack of annotated datasets is a common and major challenge applied to almost every modality of biodiversity data and is particularly acute for bioacoustic data, where the availability of sufficiently large and diverse datasets, encompassing a wide array of species, remains limited for model training purposes.
This scarcity of publicly open and well annotated datasets arises from the labor-intensive nature of data annotation and the dearth of expertise necessary to accurately identify species. This is the case even for relatively well-studied taxa such as birds, and is even more notable for taxa such as Orthoptera (grasshoppers and relatives). The incompleteness of our current taxonomy also compounds this issue.
Though some data cannot be shared with the public because of concerns about poaching or other legal restrictions imposed by local governments, there are still considerable sources of well-annotated data that currently reside within the confines of individual researchers’ or organizations’ hard drives. Unlocking these data will partially address the data gap.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
S2: Sufficiency > Coverage
There exists a significant geographic imbalance in data collected, with insufficient representation from highly biodiverse regions. The data also does not cover a diverse spectrum of species, intertwining with taxonomy insufficiencies.
Addressing these gaps involves several strategies:
Data sharing: Unlocking existing datasets scattered across individual laboratories and organizations can partially alleviate the geographic and taxonomic gaps in data coverage.
Annotation Efforts:
Leveraging crowdsourcing platforms enables collaborative data collection and facilitates volunteer-driven annotation, playing a pivotal role in expanding annotated datasets.
Collaborating with domain experts, including ecologists and machine learning practitioners, is crucial. Incorporating local ecological knowledge, particularly from indigenous communities in target regions, enriches data annotation efforts.
Developing AI-powered methods for automatic cataloging, searching, and data conversion is imperative to streamline data processing and extract relevant information efficiently.
Data Collection:
Prioritize continuous data collection efforts, especially in underrepresented regions and for less documented species and taxonomic groups.
Explore innovative approaches like Automated Multisensor Stations for Monitoring of Species Diversity (AMMODs) to enhance data collection capabilities.
Strategically plan data collection initiatives to target species at risk of extinction, ensuring data capture before irreversible loss occurs.
U2: Usability > Aggregation
A lot of well-annotated datasets are currently scattered within the confines of individual researchers’ or organizations’ hard drives. Aggregating and sharing these datasets would largely address the lack of publicly open annotated data needed for model training. This initiative would also mitigate geographic and taxonomic imbalances in existing data.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
Data Gap Type
Data Gap Details
S1: Sufficiency > Insufficient Volume
One of the foremost challenges is the scarcity of publicly open and adequately annotated data for model training. It is worth mentioning that the lack of annotated datasets is a common and major challenge applied to almost every modality of biodiversity data and is particularly acute for bioacoustic data, where the availability of sufficiently large and diverse datasets, encompassing a wide array of species, remains limited for model training purposes.
This scarcity of publicly open and well annotated datasets arises from the labor-intensive nature of data annotation and the dearth of expertise necessary to accurately identify species. This is the case even for relatively well-studied taxa such as birds, and is even more notable for taxa such as Orthoptera (grasshoppers and relatives). The incompleteness of our current taxonomy also compounds this issue.
Though some data cannot be shared with the public because of concerns about poaching or other legal restrictions imposed by local governments, there are still considerable sources of well-annotated data that currently reside within the confines of individual researchers’ or organizations’ hard drives. Unlocking these data will partially address the data gap.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
S2: Sufficiency > Coverage
There exists a significant geographic imbalance in data collected, with insufficient representation from highly biodiverse regions. The data also does not cover a diverse spectrum of species, intertwining with taxonomy insufficiencies.
Addressing these gaps involves several strategies:
Data sharing: Unlocking existing datasets scattered across individual laboratories and organizations can partially alleviate the geographic and taxonomic gaps in data coverage.
Annotation Efforts:
Leveraging crowdsourcing platforms enables collaborative data collection and facilitates volunteer-driven annotation, playing a pivotal role in expanding annotated datasets.
Collaborating with domain experts, including ecologists and machine learning practitioners, is crucial. Incorporating local ecological knowledge, particularly from indigenous communities in target regions, enriches data annotation efforts.
Developing AI-powered methods for automatic cataloging, searching, and data conversion is imperative to streamline data processing and extract relevant information efficiently.
Data Collection:
Prioritize continuous data collection efforts, especially in underrepresented regions and for less documented species and taxonomic groups.
Explore innovative approaches like Automated Multisensor Stations for Monitoring of Species Diversity (AMMODs) to enhance data collection capabilities.
Strategically plan data collection initiatives to target species at risk of extinction, ensuring data capture before irreversible loss occurs.
U2: Usability > Aggregation
A lot of well-annotated datasets are currently scattered within the confines of individual researchers’ or organizations’ hard drives. Aggregating and sharing these datasets would largely address the lack of publicly open annotated data needed for model training. This initiative would also mitigate geographic and taxonomic imbalances in existing data.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
A significant challenge for eDNA-based monitoring is the incomplete barcoding reference databases, limiting the ability to accurately identify species from genetic material. Initiatives like the BIOSCAN project are actively working to address this gap by expanding reference collections for diverse taxonomic groups, particularly for understudied regions and species.
A significant challenge for eDNA-based monitoring is the incomplete barcoding reference databases, limiting the ability to accurately identify species from genetic material. Initiatives like the BIOSCAN project are actively working to address this gap by expanding reference collections for diverse taxonomic groups, particularly for understudied regions and species.
Data Gap Type
Data Gap Details
S2: Sufficiency > Coverage
Incomplete barcoding reference databases limit the identification of many species from eDNA samples, particularly in biodiverse regions.
Bias-correction of climate projections
Details (click to expand)
Climate projections provide essential information about future climate conditions, guiding critical mitigation and adaptation efforts such as disaster risk assessments and power grid optimization.
ML enhances the accuracy of these projections by bias-correcting forecasts from physics-based climate models like CMIP6, learning relationships between historical simulations and observed ground truth data.
Large uncertainties in climate projections and inconsistent data formats across models create significant barriers for developing robust ML bias-correction methods.
Improved model ensemble techniques and standardized data formats can enhance projection reliability and enable more effective climate risk planning.
Large uncertainties in future climate projections limit confidence in bias-correction applications. The massive data volume and inconsistent formats across models—including variable naming conventions, resolutions, and file structures—hinder effective multi-model analysis. Improved model evaluation frameworks and data standardization efforts can enhance projection reliability and streamline ML model development.
Large uncertainties in future climate projections limit confidence in bias-correction applications. The massive data volume and inconsistent formats across models—including variable naming conventions, resolutions, and file structures—hinder effective multi-model analysis. Improved model evaluation frameworks and data standardization efforts can enhance projection reliability and streamline ML model development.
Data Gap Type
Data Gap Details
R1: Reliability > Quality
Large uncertainties in future projections - Model evaluation frameworks and ensemble weighting methods can help quantify and reduce uncertainties
U1: Usability > Structure
Inconsistent formats across models - Standardized naming conventions and preprocessing pipelines can enable seamless multi-model integration
U6: Usability > Large Volume
Massive computational requirements - Cloud-based platforms and data subsetting tools can improve accessibility
ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.
ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Download delays from Copernicus Climate Data Store - Enhanced server infrastructure, regional mirror sites, and cloud-based access platforms can reduce download times from days/months to hours
U6: Usability > Large Volume
Massive storage and processing requirements - Cloud computing platforms with pre-loaded datasets and data subsetting tools can enable analysis without full downloads
R1: Reliability > Quality
Inherent biases limit ground truth applications - ML-enhanced data assimilation techniques and ensemble reanalysis approaches can reduce model-dependent biases, particularly improving precipitation and cloud field accuracy
Irregular spatial distribution and point-based measurements require extensive preprocessing to create gridded datasets suitable for ML applications. Limited station density in many regions, especially over oceans and remote areas, constrains bias-correction accuracy. Enhanced observation networks and improved interpolation techniques can provide more comprehensive spatial coverage for model validation.
Irregular spatial distribution and point-based measurements require extensive preprocessing to create gridded datasets suitable for ML applications. Limited station density in many regions, especially over oceans and remote areas, constrains bias-correction accuracy. Enhanced observation networks and improved interpolation techniques can provide more comprehensive spatial coverage for model validation.
Data Gap Type
Data Gap Details
U1: Usability > Structure
Point measurements require gridding - Statistical interpolation methods and geostatistical techniques can convert station data to regular grids
S2: Sufficiency > Coverage
Sparse coverage in remote regions - Expanded observation networks and satellite-derived proxies can fill spatial gaps
O2: Obtainability > Accessibility
The access to weather station data in some regions can be very largely restricted; only a small fraction of the data is open to the public.
Bias-correction of weather forecasts
Details (click to expand)
ML can be used to improve the fidelity of high-impact weather forecasts by post-processing outputs from physics-based numerical forecast models and by learning to correct the systematic biases associated with physics-based numerical forecasting models.
Same as HRES, the biggest challenge of ENS is that only a portion of it is available to the public for free.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
The high-resolution real-time forecast is for purchase only and it is expensive to get them.
U6: Usability > Large Volume
Due to its high spatio-temporal resolution, HRES data is quite large, creating significant challenges for downloading, transferring, storing, and processing with standard computational infrastructure. To address this, the WeatherBench 2 benchmark dataset provides a solution for certain machine learning tasks involving ENS data.
The biggest challenge with using HRES data is that only a portion of it is available to the public for free.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
The high-resolution real-time forecast is for purchase only and it is expensive to get them.
U6: Usability > Large Volume
Due to its high spatio-temporal resolution, HRES data is quite large, creating significant challenges for downloading, transferring, storing, and processing with standard computational infrastructure. To address this, the WeatherBench 2 benchmark dataset provides a solution for certain machine learning tasks involving HRES data.
Data is not regularly gridded and needs to be preprocessed before being used in an ML model.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
The access to weather station data in some regions can be very largely restricted; only a small fraction of the data is open to the public.
U1: Usability > Structure
Point measurements require gridding - Statistical interpolation methods and geostatistical techniques can convert station data to regular grids
S2: Sufficiency > Coverage
Sparse coverage in remote regions - Expanded observation networks and satellite-derived proxies can fill spatial gaps
Earth observation for climate-related applications
Details (click to expand)
Many climate-related applications suffer from a lack of real-time and/or on-the-ground data. ML can be used to analyze satellite imagery at scale in order to fill some of these gaps, via applications such as land cover classification, footprint detection for buildings, solar panel detection, deforestation detection, and emissions monitoring.
Satellite images are intensively used for Earth system monitoring. One of the two biggest challenges of using satellite images is the sheer volume of data which makes downloading, transferring, and processing data all difficult. The other one is the lack of annotated data. For many use cases, the lack of publicly open high-resolution imagery is also a bottleneck.
Satellite images are intensively used for Earth system monitoring. One of the two biggest challenges of using satellite images is the sheer volume of data which makes downloading, transferring, and processing data all difficult. The other one is the lack of annotated data. For many use cases, the lack of publicly open high-resolution imagery is also a bottleneck.
Data Gap Type
Data Gap Details
S3: Sufficiency > Granularity
Publicly available datasets often lack sufficient granularity. This is particularly challenging for the Global South, which typically lacks the funding for high-resolution commercial satellite imagery.
U6: Usability > Large Volume
The sheer volume of data now poses one of the biggest challenges for satellite imagery. When data reaches the terabyte scale, downloading, transferring, and hosting become extremely difficult. Those who create these datasets often lack the storage capacity to share the data. This challenge can potentially be addressed by one or more of the following strategies:
Data compression: Compress the data while retaining lower-dimensional information.
Lightweight models: Build models with fewer features selected through feature extraction. A successful example can be found here.
Large foundation model for remote sensing data: Purposefully construct large models (e.g., foundation models) that can handle vast amounts of data. This requires changes in research architecture, such as preprocessing architecture modifications.
O2: Obtainability > Accessibility
Very high-resolution satellite images (e.g., finer than 10 meters) typically come from commercial satellites and are not publicly available. One exception is the NICFI dataset, which offers high-resolution, analysis-ready mosaics of the world’s tropics.
U5: Usability > Pre-processing
Satellite images often contain a lot of redundant information, such as large amounts of data over the ocean that do not always contain useful information. It is usually necessary to filter out some of this data during model training.
U2: Usability > Aggregation
Due to differences in orbits, instruments, and sensors, imagery from different satellites can vary in projection, temporal and spatial zones, and cloud blockage, each with its own pros and cons. To overcome data gaps (e.g. cloud blocking) or errors, multiple satellite images are often assimilated. Harmonizing these differences is challenging, and sometimes arbitrary decisions must be made.
U5: Usability > Pre-processing
The lack of annotated data presents another major challenge for satellite imagery. It is suggested that collaboration and coordination at the sector level should be organized to facilitate annotation efforts across multiple sectors and use cases. Additionally, the granularity of annotations needs to be increased. For example, specifying crop types instead of just “crops” and detailing flood damage levels rather than general “damaged” are necessary for more precise analysis.
M: Misc/Other
Cloud cover presents a major technical challenge for satellite imagery, significantly reducing its usability. To obtain information beneath the clouds, pixels from clear-sky images captured by other satellites are often used. However, this method can introduce noise and errors.
M: Misc/Other
There is also a lack of technical capacity in the Global South to effectively utilize satellite imagery.
Enabling 2D to 3D shape recovery and pose estimation of animals
Details (click to expand)
3D shape recovery and pose estimation refer to the reconstruction of the 3D shapes and poses of animals from 2D images. This information can provide non-invasive insights into animals’ health, age, or reproductive status in their natural environment, which are important for biodiversity monitoring.
ML-based computer vision techniques have been used to construct more accurate estimations of 3D animal shapes and poses.
However, there is a lack of open annotated datasets to train models.
More efforts going into the curation and release of such datasets could be pivotal towards unlocking this use case.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity studies. Addressing this gap requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity studies. Addressing this gap requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
Data Gap Type
Data Gap Details
S1: Sufficiency > Insufficient Volume
One of the foremost challenges is the scarcity of publicly open and adequately annotated data for model training. It is worth mentioning that the lack of annotated datasets is a common and major challenge applied to almost every modality of biodiversity data and is particularly acute for bioacoustic data, where the availability of sufficiently large and diverse datasets, encompassing a wide array of species, remains limited for model training purposes.
This scarcity of publicly open and well annotated datasets arises from the labor-intensive nature of data annotation and the dearth of expertise necessary to accurately identify species. This is the case even for relatively well-studied taxa such as birds, and is even more notable for taxa such as Orthoptera (grasshoppers and relatives). The incompleteness of our current taxonomy also compounds this issue.
Though some data cannot be shared with the public because of concerns about poaching or other legal restrictions imposed by local governments, there are still considerable sources of well-annotated data that currently reside within the confines of individual researchers’ or organizations’ hard drives. Unlocking these data will partially address the data gap.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
S2: Sufficiency > Coverage
There exists a significant geographic imbalance in data collected, with insufficient representation from highly biodiverse regions. The data also does not cover a diverse spectrum of species, intertwining with taxonomy insufficiencies. There is now increasing work in insect camera traps, but this field is still in its infancy and data remains limited.
Addressing these gaps involves several strategies:
Data sharing: Unlocking existing datasets scattered across individual laboratories and organizations can partially alleviate the geographic and taxonomic gaps in data coverage.
Annotation Efforts:
Leveraging crowdsourcing platforms enables collaborative data collection and facilitates volunteer-driven annotation, playing a pivotal role in expanding annotated datasets.
Collaborating with domain experts, including ecologists and machine learning practitioners, is crucial. Incorporating local ecological knowledge, particularly from indigenous communities in target regions, enriches data annotation efforts.
Developing AI-powered methods for automatic cataloging, searching, and data conversion is imperative to streamline data processing and extract relevant information efficiently.
Data Collection:
Prioritize continuous data collection efforts, especially in underrepresented regions and for less documented species and taxonomic groups.
Explore innovative approaches like Automated Multisensor Stations for Monitoring of Species Diversity (AMMODs) to enhance data collection capabilities.
Strategically plan data collection initiatives to target species at risk of extinction, ensuring data capture before irreversible loss occurs.
U2: Usability > Aggregation
A lot of well-annotated datasets are currently scattered within the confines of individual researchers’ or organizations’ hard drives. Aggregating and sharing these datasets would largely address the lack of publicly open annotated data needed for model training. This initiative would also mitigate geographic and taxonomic imbalances in existing data.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
Enabling non-intrusive electricity load monitoring
Details (click to expand)
Non-intrusive load monitoring (NILM) is critical for disaggregating building electricity consumption into individual appliance profiles, enabling targeted energy efficiency strategies, demand response, and better supply/demand matching to reduce carbon emissions and maintain grid stability.
AI techniques can analyze patterns in aggregate electricity data to identify individual appliance signatures without requiring separate meters for each device, providing cost-effective insights for both consumers and utilities.
The effectiveness of AI-based NILM is hindered by insufficient training data that represents diverse appliance types, usage patterns, and building characteristics across different regions, limiting model accuracy and generalizability in real-world settings.
Utilities, researchers, and manufacturers can collaborate to create standardized, privacy-preserving datasets through controlled data collection campaigns and by developing synthetic data generation techniques that capture the diversity of appliance signatures and usage patterns.
Pecan Street DataPort requires non-academic and academic users to purchase access via licensing which varies depending on the building data features requested. Coverage area of data is primarily concentrated in the Mueller planned housing community in Austin, Texas–a modern built environment which is not representative of older historical buildings that may be in need of energy efficient upgrades and retrofits. In customer segmentation studies and consumer-in-the-loop load consumption modeling, annual socio-demographic survey data may be too coarse and not provide insight into behavioral effects of household members on consumption profiles with time.
Pecan Street DataPort requires non-academic and academic users to purchase access via licensing which varies depending on the building data features requested. Coverage area of data is primarily concentrated in the Mueller planned housing community in Austin, Texas–a modern built environment which is not representative of older historical buildings that may be in need of energy efficient upgrades and retrofits. In customer segmentation studies and consumer-in-the-loop load consumption modeling, annual socio-demographic survey data may be too coarse and not provide insight into behavioral effects of household members on consumption profiles with time.
Data Gap Type
Data Gap Details
U3: Usability > Usage Rights
Usage rights vary depending on the agreed upon licensing agreement.
S6: Sufficiency > Missing Components
The data does not track real-time occupancy of individuals in the household which may provide insight into behavioral effects on energy consumption. Addition of this data, can allow for improved consumption based customer segmentation models, as patterns can change with respect to time and day of the week. The data would also be amenable for consumer-in-the-loop energy management studies with respect to comfort based on customer habitual activity, location in the house, and number of occupants.
S3: Sufficiency > Granularity
Disaggregated data may provide greater granular context for customer segmentation studies than those relying on aggregate data only. However, such segmentation studies ultimately depend on the number of household members that may be using appliances at a given time. Pecan Street data contains annual survey responses by participants with respect to household demographics and home features which may be too coarse in granularity to tracking how customer segments change over time as members move in or out of a building. Jointly taking occupancy data, can address the gap in granularity but can potentially limit volunteer engagement as concerns with respect to privacy will need to be evaluated.
S2: Sufficiency > Coverage
Data coverage primarily focuses on Texas with limited coverage in New York and California. Though there are efforts to include Puerto Rico data hinges on volunteer participation. This could introduce self-selection bias, as households who participate are likely more interested in energy conservation than the general population. Furthermore, a majority of the dataset covers the Mueller community in Austin, a planned community developed after 1999 with modern built types. Enrollment of older built environment homes and different temperate regions within the United States and globally, may provide greater insight into household appliance usage patterns as well as generation patterns which vary depending on temperate region as well as appliance age. Identifying high consumption older appliances can assist in identifying upgrades.
O2: Obtainability > Accessibility
Data is downloadable as a static file or accessible via the DataPort API. Based on the licensing agreement, a small dataset is available for free for academic individuals with pricing for larger datasets. Commercial use requires paid access based on requested features ranging from the standard to unlimited customer tier and plan.
For accurate NILM studies, benchmark datasets are required to include not only consumption but local power generation (e.g., from rooftop solar), as it can affect the overall aggregate load observed at the building level. While some datasets may include generation information, most studies do not take rooftop solar generation into account. Additionally, devices that can behave both as a load and generator such as electric vehicles or stationary batteries were also not included. The majority of building types are single family housing units limiting the diversity of representation. Furthermore, most datasets are no longer maintained following study close.
For accurate NILM studies, benchmark datasets are required to include not only consumption but local power generation (e.g., from rooftop solar), as it can affect the overall aggregate load observed at the building level. While some datasets may include generation information, most studies do not take rooftop solar generation into account. Additionally, devices that can behave both as a load and generator such as electric vehicles or stationary batteries were also not included. The majority of building types are single family housing units limiting the diversity of representation. Furthermore, most datasets are no longer maintained following study close.
Data Gap Type
Data Gap Details
U5: Usability > Pre-processing
Sub-metered data relies heavily on the sensor network installation used in monitoring the building. Depending on the technology used, some sensors require calibration or are prone to malfunctions and delays. Additionally, interference from other devices can be present in the aggregate building level readings, such as that experienced by REFIT, that need to be addressed manually to enhance the usability of the dataset. These may vary depending on the submeter dataset utilized, requiring a clear understanding of associated metadata and documentation specific to the testbed the study was built upon. Exploratory data analysis of the time series data may assist in identifying outliers that may be a result of sensor drift.
U1: Usability > Structure
When retrieving NILM data from a variety of sources from pre-existing studies as well as through custom data collection, the structure of the received data can vary. Testbed design, hardware, and variables monitored depend on sensor availability which can ultimately influence schema and data formats. Data structure may also differ based on the level of disaggregation at the plug level or the individual appliance level. When building future testbeds for data collection, it may help to to follow the standards set by APIs such as NILMTK which has successfully integrated multiple datasets from different sources. Using the REDD dataset format as inspiration, the toolkit developers created a standard energy disaggregation data format (NILMTK-DF) with common metadata and structure which requires manual dataset-specific preprocessing. When working with non-standardized data that may require aggregation, machine learning based data fusion strategies may automating schema matching and data integration.
S6: Sufficiency > Missing Components
While sub-metered data provides a means of verifying non-intrusive load monitoring techniques, it does not capture the hidden human motivators driving appliance usage (such as comfort, utility cost, and daily activities) as well as other important factors contributing to the aggregate load seen at the building level meter. The key to improving these studies is to provide greater context to the sub-metered data by taking additional joint measurements such as rooftop solar power production, electric vehicle load, occupancy related information, and battery storage. Some dataset-specific missing data components are highlighted below.
All datasets mentioned do not include electric vehicle loads. REDD, AMPds2, COMBED, DEDDIAG, DRED, GREEND, iAWE, UK-DALE do not include generation from rooftop solar. REFIT contains solar from three homes but they were not the focus of the study and were treated as solar interference to the aggregate load. The UMass smart home dataset only had representation of one home with solar and wind generation, though at a significantly larger square footage and build compared to the other two homes that were featured.
While DRED provided occupancy information through data collected from wearable devices with respect to the home and ECO and IDEAL through self-reporting and an infrared entryway sensor, all other studies did not.
The majority of datasets are not amenable to human in the loop user behavior analysis with respect to consumption patterns, response to feedback, and the effectiveness in load shifting to promote energy conserving behaviors due to their lack of representation.
While AMPds2 includes some utility data, most datasets do not incorporate billing or real time pricing. This type of data would be beneficial as it varies from time, season, region, and utility.
Battery storage was not taken into account in all building consumption datasets.
S2: Sufficiency > Coverage
Gaps in dataset coverage are specific to the sub-metered dataset. These gaps may be due to unaccounted loads, level of disaggregation (e.g. circuit level, plug level, or individual appliance level), or limited appliance types. Diversity of building types are limited as most studies take place in single family residences. Some dataset specific gaps are detailed below that may be addressed by collecting new data on existing testbeds or by augmenting already collected data with synthetic information. Future data collection efforts should be mindful of avoiding the kinds of gaps associated with existing datasets.
In AMPds2 data there were some missing data from electricity and water and natural gas readings. Additionally, there exist un-metered household loads which were not accounted for in the aggregate building level readings. With respect to dishwasher consumption, AMPds2 did not have a direct water meter monitoring. REFIT did not monitor appliances that could not be accessed through wall plugs such as electric ovens. Depending on the built environment and building type, larger loads may not be able to be connected to building level meters. For example, in the GREEND dataset, electric boilers in Austria were connected to separate external meters. In the UMass smart home dataset, gas furnaces, exhaust fans, and recirculator pump loads were not able to be monitored.
AMPds2, DEDDIAG, DRED, iAWE, REDD, REFIT, and UMass smart home dataset all gather data in single family homes which may not be representative of the diversity of buildings in terms of age, location, construction, and potential household demographics. REFIT data covers different single family home types such as detached, semi-detached, and mid-terrace homes that ranged from 2-6 bedrooms with builds from the 1850s to 2005. GREEND covers apartments in addition to single family homes but the number of households was 9. AMPds2, DRED, iAWE only cover a single household. Additionally, datasets are specific to the location where the measurements were taken which are susceptible to the environmental conditions of the region as well as the culture of the population. For example, REDD consists of data from 10 monitored homes which may not be representative of common appliances contributing to the overall load of the broader population outside of Boston.
COMBED contains complex load types that may rely on variable-speed drives as well as multi-state devices, which the other datasets do not contain which may be due to the difference in building type but could also be due to the lack of diversity of appliance representation.
ECO data relied on smart plugs for disaggregated load consumption measurements which varied between households depending on the smart plug appliance coverage. For all households the total consumption was not equal to the sum of the consumption measured from the plugs alone, indicative of a high proportion of non-attributed consumption.
R1: Reliability > Quality
In the AMPds2 data, the sum of the sub-metered consumption data did not add up to the whole house consumption due to some rounding error in the meter measurements, highlighting not only the need for NILM studies with sub-metered data as ground truth, but also the type of building level meter. Future data collection efforts may want to not only focus on retrieval of utility-side building meter data but also supplemental aggregate meter data to detect mismatches in measurements.
Datasets or studies that require self-reporting by customers may introduce participant bias, as the resolution with which households update voluntary information may vary. For example, if the number of household members, occupancy schedule, and the addition of new plug loads is self-reported, frequency of updates vary depending on volunteer engagement. Additionally, volunteers who participate in NILM studies may have a particular propensity for energy efficient actions, and may not be representative of the general population. For example, some participants of UK-DALE were Imperial College graduate students were motivated to participate to advance their own projects. To make sure that electricity usage represents the general population, future case studies can recruit potential volunteer communities with diversity of socioeconomic background and location.
Enhancing digital reconstructions of the environment
Details (click to expand)
Digital reconstruction of the environment using remote sensing data is crucial for understanding habitat conditions and their impacts on wildlife, enabling more effective conservation strategies in the face of climate change.
ML enhances this process by efficiently analyzing large volumes of data from multiple sources, producing more detailed and accurate environmental reconstructions.
A key data gap is the limited availability of high-resolution imagery, with most high-quality data being commercial and not freely accessible, particularly affecting studies that require detailed environmental monitoring.
Fostering a data-sharing culture through incentives for collectors, creating standardized annotation pipelines, and making commercial high-resolution satellite imagery more accessible would significantly advance ML-enabled environmental monitoring for biodiversity conservation.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
S1: Sufficiency > Insufficient Volume
One of the foremost challenges is the scarcity of publicly open and adequately annotated data for model training. It is worth mentioning that the lack of annotated datasets is a common and major challenge applied to almost every modality of biodiversity data and is particularly acute for bioacoustic data, where the availability of sufficiently large and diverse datasets, encompassing a wide array of species, remains limited for model training purposes.
This scarcity of publicly open and well annotated datasets arises from the labor-intensive nature of data annotation and the dearth of expertise necessary to accurately identify species. This is the case even for relatively well-studied taxa such as birds, and is even more notable for taxa such as Orthoptera (grasshoppers and relatives). The incompleteness of our current taxonomy also compounds this issue.
Though some data cannot be shared with the public because of concerns about poaching or other legal restrictions imposed by local governments, there are still considerable sources of well-annotated data that currently reside within the confines of individual researchers’ or organizations’ hard drives. Unlocking these data will partially address the data gap.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
S2: Sufficiency > Coverage
There exists a significant geographic imbalance in data collected, with insufficient representation from highly biodiverse regions. The data also does not cover a diverse spectrum of species, intertwining with taxonomy insufficiencies.
Addressing these gaps involves several strategies:
Data sharing: Unlocking existing datasets scattered across individual laboratories and organizations can partially alleviate the geographic and taxonomic gaps in data coverage.
Annotation Efforts:
Leveraging crowdsourcing platforms enables collaborative data collection and facilitates volunteer-driven annotation, playing a pivotal role in expanding annotated datasets.
Collaborating with domain experts, including ecologists and machine learning practitioners, is crucial. Incorporating local ecological knowledge, particularly from indigenous communities in target regions, enriches data annotation efforts.
Developing AI-powered methods for automatic cataloging, searching, and data conversion is imperative to streamline data processing and extract relevant information efficiently.
Data Collection:
Prioritize continuous data collection efforts, especially in underrepresented regions and for less documented species and taxonomic groups.
Explore innovative approaches like Automated Multisensor Stations for Monitoring of Species Diversity (AMMODs) to enhance data collection capabilities.
Strategically plan data collection initiatives to target species at risk of extinction, ensuring data capture before irreversible loss occurs.
U2: Usability > Aggregation
A lot of well-annotated datasets are currently scattered within the confines of individual researchers’ or organizations’ hard drives. Aggregating and sharing these datasets would largely address the lack of publicly open annotated data needed for model training. This initiative would also mitigate geographic and taxonomic imbalances in existing data.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
Satellite images provide environmental information for habitat monitoring. Combined with other data, e.g. bioacoustic data, they have been used to model and predict species distribution, richness, and interaction with the environment. High-resolution images are needed but most of them are not open to the public for free.
Satellite images provide environmental information for habitat monitoring. Combined with other data, e.g. bioacoustic data, they have been used to model and predict species distribution, richness, and interaction with the environment. High-resolution images are needed but most of them are not open to the public for free.
Data Gap Type
Data Gap Details
S3: Sufficiency > Granularity
Resolution of the publicly open satellite images are not sufficient for some environment reconstruction studies.
O2: Obtainability > Accessibility
The resolution of publicly open satellite images is not high enough. High-resolution images are usually commercial and not open for free.
Enhancing energy policy and market analysis
Details (click to expand)
Energy transition policies require comprehensive data on generation, emissions, and financial performance across power systems, but fragmented government datasets make evidence-based policymaking challenging.
AI and data fusion techniques can integrate scattered regulatory data from utilities and energy companies to create analysis-ready datasets that inform carbon pricing, renewable incentives, and grid modernization policies.
Inconsistent data formats, missing identifiers, and poor documentation across government agencies create significant barriers for automated data processing and analysis.
Standardized reporting formats, improved documentation, and centralized data platforms could enable more effective AI-driven policy analysis and accelerate evidence-based energy transitions.
Government energy datasets suffer from inconsistent formats, missing documentation, and aggregation challenges that prevent ready analysis. Key gaps include complex pre-processing requirements due to format changes, limited documentation maintenance, and missing weather and transmission data. Standardized reporting formats across agencies, improved documentation practices, and expanded data collection could significantly enhance the utility of integrated energy datasets for policy analysis.
Government energy datasets suffer from inconsistent formats, missing documentation, and aggregation challenges that prevent ready analysis. Key gaps include complex pre-processing requirements due to format changes, limited documentation maintenance, and missing weather and transmission data. Standardized reporting formats across agencies, improved documentation practices, and expanded data collection could significantly enhance the utility of integrated energy datasets for policy analysis.
Data Gap Type
Data Gap Details
U5: Usability > Pre-processing
Source data format changes (e.g., FERC’s shift from PDF to XBRL) and semi-structured formats require extensive preprocessing. PDF-based data extraction faces OCR challenges due to scan quality and inconsistent formatting. Standardized reporting formats and machine-readable data standards across agencies could reduce preprocessing burden.
U4: Usability > Documentation
Documentation updates lag behind source data changes, requiring continuous monitoring by maintainers. Proactive documentation standards and change notification systems from data providers could improve maintenance efficiency.
U3: Usability > Usage Rights
While PUDL uses Creative Commons licensing, some utility operator data has unclear public use rights despite being provided to regulatory agencies. Explicit public use licensing statements from government agencies could clarify usage permissions.
U2: Usability > Aggregation
Varying schema and naming conventions across agencies complicate data joining. Probabilistic entity matching helps but requires manual verification. Universal relational database standards and common identifiers across agencies could streamline aggregation.
U1: Usability > Structure
Source data structures vary significantly between reporting years and agencies, with inconsistent plant identification systems. Standardized data schemas and versioning practices could improve structural consistency.
S6: Sufficiency > Missing Components
Weather model data and transmission/congestion information from grid operators would enhance analysis capabilities. Integration partnerships with weather services and grid operators could expand dataset utility.
S3: Sufficiency > Granularity
Temporal resolution varies from hourly to annual across sources, requiring interpolation techniques. More frequent and standardized reporting intervals could improve data granularity.
S2: Sufficiency > Coverage
Dataset coverage limited to US regulatory agencies and organizations. International data partnerships could expand geographic scope for comparative analysis.
Enhancing estimations of methane emissions from rice paddies
Details (click to expand)
Rice paddies are a major source of global anthropogenic methane emissions. Accurate quantification of CH₄ emissions, especially how they vary with different agricultural practices, is crucial for addressing climate change.
ML can enhance methane emission estimation by automatically processing and analyzing remote-sensing data, leading to more efficient assessments.
Currently, there is a lack of direct observation of methane emissions from rice paddies that could be used to train ML models.
Real-world data collection is needed to unlock this use case.
There is a lack of direct observation of methane emissions from rice paddies.
Data Gap Type
Data Gap Details
W: Wish
Direct measurement of methane emissions is often expensive and labor-intensive. But this data is essential as it provides the ground truth for training and constraining ML models. Increased funding is needed to support and encourage comprehensive data collection efforts.
Enhancing marine wildlife detection and species classification
Details (click to expand)
Marine wildlife detection and species classification are crucial for understanding the impacts of climate change on marine ecosystems. These tasks involve identifying and categorizing different marine species.
ML can significantly enhance these efforts by automatically processing large volumes of data from diverse sources, improving accuracy and efficiency in monitoring and analyzing marine life.
Current bottlenecks due to data availability include the lack of sufficient labeled data and the lack of open data. Regarding existing data, enabling broader data sharing is the most critical challenge to address. A lot of ocean data is collected, there are massive gaps in coverage, with heavy biases towards coastal regions. Collecting data from the deep ocean is technologically challenging and financial incentives are lacking. High seas fall outside national jurisdictions, so data collection often occurs only through mining companies, military operations, or ad hoc research expeditions. The absence of marine protected areas on high seas and the migratory nature of species like phytoplankton further complicate data collection.
Open-source databases containing labeled data and label editors such as FathomNet can increase the amount of relevant data for training ML models. Initiatives like the Ocean Biodiversity Information System (OBIS) and, Integrated Ocean Observing System (IOOS) contribute to data availability more broadly. Data collection efforts may strategically target places where biodiversity is large, but currently available data is sparse. Financial tools or regulations could incentivize data collection.
The biggest challenge for the developer of FathomNet is enticing different institutions to contribute their data.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
The biggest challenge for the developer of FathomNet is enticing different institutions to contribute their data.
Enhancing power grid-vegetation management for wildfire risk mitigation
Details (click to expand)
Vegetation encroachment near high-voltage transmission lines can lead to outages and pose major fire risks, compromising the safety and reliability of the power grid and potentially igniting dangerous wildfires that release stored carbon and endanger wildlife.
Machine learning, especially computer vision applied to remote sensing imagery and historic management records, can accelerate vegetation management by identifying overgrowth areas and tracking dynamic seasonal vegetation growth near grid infrastructure.
Key data gaps include limited access to proprietary utility data, sparse LiDAR captures leading to incomplete scans, insufficient temporal and spatial coverage, and preprocessing requirements for imagery from multiple sensor platforms.
Solutions include establishing partnerships with utilities for data sharing, coordinating multiple robot/UAV inspection trips for improved coverage, developing preprocessing pipelines for diverse sensor data, and implementing regular monitoring schedules to capture seasonal vegetation changes.
UAV imagery for vegetation management near power lines requires partnerships with private companies and utilities for access. LiDAR data is often sparse with partial line scans resulting in poor data quality. Coverage is typically limited to specific rights-of-way, requiring continuous monitoring to track vegetation growth over time.
UAV imagery for vegetation management near power lines requires partnerships with private companies and utilities for access. LiDAR data is often sparse with partial line scans resulting in poor data quality. Coverage is typically limited to specific rights-of-way, requiring continuous monitoring to track vegetation growth over time.
Data Gap Type
Data Gap Details
U3: Usability > Usage Rights
Once collected, data is private as RoWs represent critical energy infrastructure. Private partnerships may allow for extended usage rights within a predefined scope.
S4: Sufficiency > Timeliness
Measurements should be taken at multiple periods to examine transmission line characteristics to both vegetation growth and or line sag caused by overvoltage conditions.
S2: Sufficiency > Coverage
Coverage can vary depending on the RoW examined. Often multiple datasets that contain multiple transmission RoW UAV image data would be necessary to increase the number of image examples in the dataset.
O1: Obtainability > Findability
Must be involved in an active study with a partnering utility or transmission owner to get access to pre-existing drone data or to get permission to collect drone data.
Grid inspection robot imagery requires coordination with local utilities foraccess, multiple robot trips for complete coverage, image preprocessing to remove ambient artifacts, position and location calibration, and may be limited by camera resolution for detecting subtle degradation patterns.
Grid inspection robot imagery requires coordination with local utilities foraccess, multiple robot trips for complete coverage, image preprocessing to remove ambient artifacts, position and location calibration, and may be limited by camera resolution for detecting subtle degradation patterns.
Data Gap Type
Data Gap Details
U2: Usability > Aggregation
Data needs to be aggregated from multiple cable inspection robots for improved generalizability of detection models. Multiple robot trips over areas of interest can help identify target locations needing further inspection.
U3: Usability > Usage Rights
Data is proprietary and requires coordination with utility.
U5: Usability > Pre-processing
Data may need significant preprocessing and thresholding to perform image segmentation tasks.
S2: Sufficiency > Coverage
Data must be supplemented with position orientation system information for accurate robot localization, potentially requiring preliminary inspections followed by detailed autonomous inspection of targets.
S3: Sufficiency > Granularity
Spatial resolution depends on the type of cable inspection robot utilized. Data from multiple multispectral imagers, drones, cable-mounted sensors, and additional robots may be employed to improve the level of detail needed for specific obstructions.
Enhancing wind power grid integration and stability
Details (click to expand)
The integration of low-inertia distributed energy resources like wind power into the grid creates critical stability and reliability challenges, particularly for maintaining system frequency at nominal levels to prevent damage and blackouts.
AI and machine learning can enhance wind power’s contribution to grid stability by optimizing synthetic inertial and primary frequency response capabilities through advanced modeling and control strategies.
Key data gaps include limited accessibility to simulation tools, insufficient temporal granularity in models that operate on hourly rather than sub-hourly scales, and reliability concerns due to the lack of real-world validation data for model outputs.
Grid operators and research institutions can collaborate to improve model accessibility, increase temporal resolution to capture sub-hourly dynamics, and validate simulations with operational data, enabling more effective AI-driven solutions for grid stability as renewable penetration increases.
Access to NREL’s FESTIV model requires special permission, limiting broader research applications. The model’s hourly temporal resolution cannot capture sub-hourly dynamics critical for frequency response and system stability. Additionally, the simulation-based approach requires validation with real-world operational data to ensure accuracy for practical grid applications.
Access to NREL’s FESTIV model requires special permission, limiting broader research applications. The model’s hourly temporal resolution cannot capture sub-hourly dynamics critical for frequency response and system stability. Additionally, the simulation-based approach requires validation with real-world operational data to ensure accuracy for practical grid applications.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Access to the FESTIV model requires permission by contacting the group manager
R1: Reliability > Quality
The model may not account for all real-time system dynamics and complexities, requiring verification from operational data. Scenario-based forecasting may not capture real-world uncertainties, and operating reserve values may be inaccurate without practical validation.
S3: Sufficiency > Granularity
FESTIV operates on hourly unit commitment time resolution, which cannot capture reliability impacts occurring on sub-hourly scales including frequency response, voltage magnitudes, and reactive power flows that affect system stability.
Facilitating grid reliability events analysis
Details (click to expand)
Due to rapid fluctuations in power generation, renewables introduce variability into the grid. These signals are capable of triggering safety monitoring systems related to grid stability. Power grid control centers receive multiple streams of data from these systems (e.g. alarms, sensors, and field reports) that are semi-structured and arriving at a high volume. For operators, these alarm triggers and associated data can be overwhelming to rationalize, reduce, and contextualize to diagnose grid conditions.
ML can assist in interpreting these data to better understand the sequence of events leading up to an incident as well as to identify and detect the causes behind system disturbances affecting grid reliability.
Access to grid reliability data remains limited, the amount of preprocessing needed constitutes a hurdle, and not all alarm triggers have been validated, also possibly resulting in noise.
More open data releases and open community work regarding data preprocessing can help further advance this use case.
Access to EPRI10 grid alarm data is currently limited within EPRI. Data gaps with respect to usability are the result of redundancies in grid alarm codes, requiring significant preprocessing and analysis of code ids, alarm priority, location, and timestamps. Alarm codes can vary based on sensor, asset and line. Resulting actions as a consequence of alarm trigger events need field verifications to assess the presence of fault or non-fault events.
Access to EPRI10 grid alarm data is currently limited within EPRI. Data gaps with respect to usability are the result of redundancies in grid alarm codes, requiring significant preprocessing and analysis of code ids, alarm priority, location, and timestamps. Alarm codes can vary based on sensor, asset and line. Resulting actions as a consequence of alarm trigger events need field verifications to assess the presence of fault or non-fault events.
Data Gap Type
Data Gap Details
U6: Usability > Large Volume
Operational alarm data volume is large, given the cadence of measurements made in the system at every millisecond. This can result in high data volume that is tabular in nature, but also unstructured with respect to text details associated with alarm trigger events, sensor measurements, and controller actions. Since the data also contains locations and grid asset information, spatio-temporal analysis can be made with respect to a single sensor and the conditions over which that sensor is operating. Therefore, indexing and mining time series data can be an approach for facilitating faster search over alarm data leading up to a fault event. Additionally, natural language processing and text mining techniques can also be utilized to facilitate search over alarm text and details.
U5: Usability > Pre-processing
In addition to challenges with respect to the decoding of remote signal identification data, the description fields associated with alarm trigger events are unstructured and vary in the amount of text detail provided. Typically, the details cover information with respect to the grid asset and its action. For example, a text description from a line monitoring device may describe the power, temperature, and the action taken in response to the grid alarm trigger event. Often, in real-world systems, the majority of grid alarm trigger events are short circuit faults and non-fault events, limiting the diversity of fault types found in the data.
To combat these issues, data pre-processing becomes necessary. For remote signal identification data, this includes parsing and hashing through text codes, assessing code components for redundancies, and building an associated reduced dictionary of alarm codes. For textual description fields and post-fault field reports, the use of natural language processing techniques to extract key information can provide more consistency between sensor data. Additionally, techniques like diverse sampling can account for the class imbalance with respect to the associated fault that can trigger the alarm.
U4: Usability > Documentation
Remote signaling identification information from monitoring sensors and devices encodes data with respect to the alarm trigger event in the context of fault priority. Based on the asset, line, or sensor, this identification code can vary depending on the naming conventions used. Documentation on remote signal ids associated with a dictionary of finite alarm code types can facilitate pre-processing of alarm data and assessment on the diversity of fault events occurring in real-time systems (as different alarm trigger codes may correspond to redundant events similar in nature).
U3: Usability > Usage Rights
Usage rights are currently constrained to those working within EPRI at this time.
U2: Usability > Aggregation
Reports on location, asset, and time can result in false alarm triggers requiring operators to send field workers to investigate, fix, and recalibrate field sensors. The data with respect to field assessments can be incorporated into the original data to provide greater context resulting in compilation of multimodal datasets which can enhance alarm data understanding.
U1: Usability > Structure
Grid alarm codes may be non-unique for different lines and grid assets. In other words, two different codes could represent equivalent information due to differences in naming conventions requiring significant alarm data pre-processing and analysis in identifying unique labels from over 2000 code words. Additional labels expressing alarm priority, for example high alarm type indicative of events such as fire, gas, or lightning, are also encoded into the grid alarm trigger event code. Creation of a standard structure for operational text data such as those already utilized in operational systems by companies such as General Electric or Siemens can avoid inconsistencies in data.
R1: Reliability > Quality
Alarm trigger events and the corresponding action taken by the events, require post assessment by field workers especially in cases of faults or perceived faults for verification.
O2: Obtainability > Accessibility
Data access is limited within EPRI due to restrictions with respect to data provided by utilities. Anonymization and aggregation of data to a benchmark or toy dataset by EPRI to the wider community can be a means of circumventing the security issues at the cost of operational context.
Facilitating disaster risk assessments
Details (click to expand)
As climate change progresses, extreme weather events and related hazards are expected to become more frequent and severe. To effectively address these challenges, robust disaster risk assessment and management are crucial. This involves better mapping of which population and assets are subject to given risks.
ML can be used to facilitate disaster risk assessments by helping analyze satellite imagery and geographic data in order to pinpoint vulnerable areas and produce more detailed risk maps. By this, ML can overcome some limitations of traditional ground surveys that are time- and cost-intensive.
There is a general lack of data from the Global South where, for many regions, collection capabilities are lower while climate impacts are forecasted to be disproportionally high. Existing data are typically incomplete, even in most high-income countries, limiting the depth of potential analyses and generating uncertainties in assessments, for example, about monetary losses due to disasters.
Closing these data gaps involves inter alia deploying ML techniques that perform well in the Global South, collecting high-quality data involving local knowledge in a variety of contexts, and making the best remote sensing and cadaster data available to these efforts.
These datasets are mainly available in rich countries from Europe, North America, and Asia, leaving large parts of the world with timely challenges involving their building stock without appropriate data for detailed assessments. The lack of attributes beyond the location and shape of the buildings also limits the breadth of applications. ML may help by inferring missing attributes, given the availability of sufficient training data. Research efforts in particular in Europe, including EUBUCCO (eubucco.com) or the Digital Building Stock Model by the Joint Research Centre of the European Commission (https://data.jrc.ec.europa.eu/collection/id-00382), are addressing several of the existing data gaps.
These datasets are mainly available in rich countries from Europe, North America, and Asia, leaving large parts of the world with timely challenges involving their building stock without appropriate data for detailed assessments. The lack of attributes beyond the location and shape of the buildings also limits the breadth of applications. ML may help by inferring missing attributes, given the availability of sufficient training data. Research efforts in particular in Europe, including EUBUCCO (eubucco.com) or the Digital Building Stock Model by the Joint Research Centre of the European Commission (https://data.jrc.ec.europa.eu/collection/id-00382), are addressing several of the existing data gaps.
Data Gap Type
Data Gap Details
O1: Obtainability > Findability
Certain datasets require searching and navigating websites in a foreign language.
O2: Obtainability > Accessibility
Some datasets are not publicly available and require either payment or governmental authorization. This situation is changing in Europe via the high-value dataset regulation in the European Union, which mandates member to release their building stock data with permissive licenses (https://data.europa.eu/en/news-events/news/unlocking-potential-high-value-datasets-impact-hvd-implementing-regulation).
U1: Usability > Structure
Datasets are released under a multitude of formats. Despite the existence of standards such as CityGML, one typically needs a particular pipeline for processing every new dataset.
U2: Usability > Aggregation
Datasets are typically released by local authorities and require aggregations. Some efforts in particular in Europe, including EUBUCCO (eubucco.com) or the Digital Building Stock Model by the Joint Research Centre of the European Commission (https://data.jrc.ec.europa.eu/collection/id-00382), have made this process easier, but without enabling yet seamless updates.
U3: Usability > Usage Rights
Most datasets use attribution-based licenses, but some datasets use custom licenses, unclear licenses, or restrictive licenses.
U4: Usability > Documentation
Most datasets do not provide appropriate documentation to fully understand how the dataset was created.
U5: Usability > Pre-processing
Certain fields may contain local codes that need to be translated and understood. Numerical values may contain encodings for NAs, such as -1 or 1000, that need to be cleaned.
U6: Usability > Large Volume
Precise 3D datasets can be voluminous for a city. Country-level datasets also tend to require significant computing resources.
R1: Reliability > Quality
The height estimation from LiDAR data may contain large errors, e.g., due to surrounding objects such as trees.
S2: Sufficiency > Coverage
There are very few datasets outside of rich countries from Europe, North America, and Asia. Precise 3D models and attribute-rich datasets are available for even fewer countries.
S4: Sufficiency > Timeliness
Practices vary widely between multiple updates per year to a one-off release that may be more than 10 years old. Aerial surveys with LiDAR are expensive and are rarely done more than once every ten years.
These datasets are typically released at a scale that makes their validation complex and partial, implying potentially large uncertainties in the data. The lack of attributes beyond the location and shape of the buildings also limits the breadth of applications. ML may help by inferring missing attributes, given the availability of sufficient training data.
These datasets are typically released at a scale that makes their validation complex and partial, implying potentially large uncertainties in the data. The lack of attributes beyond the location and shape of the buildings also limits the breadth of applications. ML may help by inferring missing attributes, given the availability of sufficient training data.
Data Gap Type
Data Gap Details
U1: Usability > Structure
Some datasets have not been published as scientific datasets and lack appropriate documentation about the methodology. Users should be aware of uncertainties in case of insufficient documentation of potential errors.
R1: Reliability > Quality
The building footprint data can contain errors due to detection inaccuracies in the models used to generate the dataset, as well as limitations of satellite imagery. These limitations include outdated images that may not reflect recent developments and visibility issues such as cloud cover or obstructions that can prevent the accurate identification of buildings.
U6: Usability > Large Volume
When working at a large geographical scale, e.g. continental scale, the data volume requires significant computational resources for the processing.
S3: Sufficiency > Granularity
Raster datasets provide a noisy view of the building stock.
S4: Sufficiency > Timeliness
The data depends on the availability of satellite surveys. Some datasets may mix images from different years. The surveys may be more than 5 years old, mischaracterizing fast-growing areas. In case of disasters, the imagery pre-disaster may not be representative of the current building stock.
S6: Sufficiency > Missing Components
More attributes inferred with high confidence would unlock new use cases.
Very high-resolution reference data is currently not freely open to the public.
Data Gap Type
Data Gap Details
O1: Obtainability > Findability
Surface elevation data defined by a digital elevation model (DEM) is one of the most essential types of reference data. The high-resolution elevation data has huge value for disaster risk assessment, particularly for the Global South.
Open DEM data with global coverage now goes to a resolution of 30-m, but the resolution is still insufficient for many disaster risk assessments. Higher-resolution datasets exist, but they are either with limited spatial coverage or are commercial products and are very expensive to get.
The resolution of current natural hazard forecast data is not sufficient for effective physical risk assessment.
Data Gap Type
Data Gap Details
S3: Sufficiency > Granularity
Climate hazard data (e.g., floods, tropical cyclones, droughts) is often too coarse for effective physical risk assessments, which focus on evaluating damage to infrastructure such as buildings and power grids. While exposure data, including information on buildings and power grids, is available at resolutions ranging from 25 meters to 250 meters, climate hazard projections, especially those extending beyond a year, are typically at resolutions of 25 kilometers or more.
To provide meaningful risk assessments, more granular data is required. This necessitates downscaling efforts, both dynamical and statistical, to refine the resolution of climate hazard data. Machine learning (ML) can play a valuable role in these downscaling processes. Additionally, the downscaled data should be made publicly available, and a dedicated portal should be established to facilitate access and sharing of this refined information.
R1: Reliability > Quality
Projecting future climate hazards is crucial for assessing long-term risks. Climate simulations from CMIP models are currently our primary source for future climate projections. However, these simulations come with significant uncertainties due to both uncertainties in model and emission scenarios. To improve their utility for disaster risk assessment and other applications, increased funding and efforts are needed to advance climate model development for greater accuracy. Additionally, machine learning methods can help mitigate some of these uncertainties by bias-correcting the simulations.
S6: Sufficiency > Missing Components
Seasonal climate hazard forecasts are crucial for disaster risk assessment, management, and preparation. However, high-resolution data at this scale is often lacking for many hazards. This challenge is likely due to the difficulty in generating accurate seasonal weather forecasts. ML has the potential to address this gap by improving forecast accuracy and granularity.
The quality of OpenStreetMap is very variable in terms of coverage of geometries e.g. buildings and attributes. Roads are better mapped than buildings in general. The very permissive data model from OpenStreetMap enables users to provide a variety of information, but it is often not well harmonized. Recent corporate editing efforts have increased dramatically the coverage in previously poorly mapped regions.
The quality of OpenStreetMap is very variable in terms of coverage of geometries e.g. buildings and attributes. Roads are better mapped than buildings in general. The very permissive data model from OpenStreetMap enables users to provide a variety of information, but it is often not well harmonized. Recent corporate editing efforts have increased dramatically the coverage in previously poorly mapped regions.
Data Gap Type
Data Gap Details
U6: Usability > Large Volume
When working at a large geographical scale, e.g. continental scale, the data volume requires significant computational resources for the processing.
U4: Usability > Documentation
The origin of attributes is often unknown, creating uncertainty about values.
U5: Usability > Pre-processing
The flexible data model lacks type enforcement, requiring additional processing for analysis.
R1: Reliability > Quality
Data quality ranges from excellent (sometimes surpassing official sources) to very low (including mapping vandalism).
S2: Sufficiency > Coverage
Street coverage is generally good, while building coverage varies widely.
S4: Sufficiency > Timeliness
Update frequency varies from multiple times per year to decades-old data, with disaster areas often updated quickly by active communities.
S6: Sufficiency > Missing Components
Most attributes remain incomplete, with completeness levels below 10%.
Accessibility and reliability are the most significant challenges with exposure data.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Country-specific exposure data varies widely in availability, with some existing only as hard copies in government offices.
U3: Usability > Usage Rights
OpenQuake GEM project provides comprehensive data on the residential, commercial, and industrial building stock worldwide, but it is restricted to non-commercial use only.
R1: Reliability > Quality
Population datasets show significant discrepancies, requiring validation before confident use. Some geospatial socioeconomic data from sources like UNEP are outdated or incomplete.
S3: Sufficiency > Granularity
Open global data often lacks sufficient resolution and completeness for hazard risk assessment, such as World Bank or US CIA GDP data.
Facilitating fault detection in low voltage distribution grids
Details (click to expand)
The low-voltage distribution portion of the grid directly supplies power to consumers. As consumers integrate more distributed energy resources and dynamic loads (such as electric vehicles), low-voltage distribution systems are susceptible to power quality issues that can affect the stability and reliability of the grid. Fault-inducing harmonics can be challenging to monitor, diagnose, and control due to the number of nodes/buses that connect various grid assets and the short distances between them.
Machine learning methods can recognize patterns to automate fault diagnoses agnostic to specific line thresholds and topologies. If integrated into advanced monitoring systems, detecting and localizing faults can accelerate adaptive protection and network reconfiguration efforts to ensure reliability and stability.
Data gaps for this use case include lack of coverage (spatial and temporal), noise in the data and high data volume.
New data collection and further analyses of existing data to better understand its pitfalls have the potential to help mitigate the existing gaps for this use case.
For effective fault localization using µPMU data, it is crucial that distribution circuit models supplied by utility partners or Distribution System Operators (DSOs) are accurate, though they often lack detailed phase identification and impedance values, leading to imprecise fault localization and time series contextualization. Noise and errors from transformers can degrade µPMU measurement accuracy. Integrating additional sensor data affects data quality and management due to high sampling rates and substantial data volumes, necessitating automated analysis strategies. Cross-verifying µPMU time series data, especially around fault occurrences, with other data sources is essential due to the continuous nature of µPMU data capture. Additionally, µPMU installations can be costly, limiting deployments to pilot projects over a small network. Furthermore, latencies in the monitoring system due to communication and computational delays challenge real-time data processing and analysis.
For effective fault localization using µPMU data, it is crucial that distribution circuit models supplied by utility partners or Distribution System Operators (DSOs) are accurate, though they often lack detailed phase identification and impedance values, leading to imprecise fault localization and time series contextualization. Noise and errors from transformers can degrade µPMU measurement accuracy. Integrating additional sensor data affects data quality and management due to high sampling rates and substantial data volumes, necessitating automated analysis strategies. Cross-verifying µPMU time series data, especially around fault occurrences, with other data sources is essential due to the continuous nature of µPMU data capture. Additionally, µPMU installations can be costly, limiting deployments to pilot projects over a small network. Furthermore, latencies in the monitoring system due to communication and computational delays challenge real-time data processing and analysis.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Typically the distribution circuit lacks notation with respect to the phase identification and impedance values, often providing rough approximations which can ultimately influence the accuracy of localization as well as time series contextualization of a fault. Decreased accuracy of localization can then affect downstream control mechanisms to ensure operational reliability. For µPMU data to be utilized for fault localization, the distribution circuit model must be provided by the partnering utility or DSO.
U5: Usability > Pre-processing
µPMU data is sensitive to noise especially from geomagnetic storms which can induce electric currents in the atmosphere and impact measurement accuracy. Data can also be compromised by errors introduced by current and potential transformers. One way to mitigate this error is to monitor and re-calibrate transformers or deploy redundant µPMUs to verify measurements.
Depending on whether additional data from other sensors or field reports is being used to classify µPMU time series data, creation of a joint sensor dataset may improve quality based on the overall sampling rate and format of the additional non-µPMU data.
U6: Usability > Large Volume
Due to the high sampling rates, data volume from each individual µPMU can be challenging to manage and analyze due to its continuous nature. Coupled with the number of µPMUs required to monitor a portion of the distribution network, the amount of data can easily exceed terabytes. Automation of indexing and mining time series by transient characteristics can facilitate domain specialist verification efforts.
R1: Reliability > Quality
Since µPMU data is continuously captured, time series data leading up to or even identifying a fault or potential fault requires verification from other data sources.
Digital Fault Recorders (DFRs) capture high resolution event driven data such as disturbances due to faults, switching and transients. They are able to detect rapid events like lightning strikes and breaker trips while also recording the current and voltage magnitude with respect to time. Additionally, system dynamics over a longer period following a disturbance can also be captured. When used in conjunction with µPMU data, DFR data can assist in verifying significant transients found in the µPMU data which can facilitate improved analysis of both signals leading up to and after an event from the perspective of distribution-side state.
S2: Sufficiency > Coverage
Currently µPMU installation to existing distribution grids have significant financial costs so most deployments have been in the form of pilot projects with utilities. Pilot studies include the Flexgrid testing facility at Lawrence Berkeley National Laboratory (LBNL), Philadelphia Navy Yard microgrid (2016-2017), the micro-synchrophasors for distribution systems plus-up project (2016-2018), resilient electricity grids in the Philippines (2016), the GMLC 1.2.5- sensing and measurement strategy (2016), the bi-national laboratory for the intelligent management of energy sustainability and technology education in Mexico City (2017-2018) based on North American Synchrophasor Initiative (NASPI) reports.
Coverage is also limited by acceptance to this technology due to a pre-existing reliance on SCADA systems which measure grid conditions on a 15 minute cadence. As transients become more common, especially on the low voltage distribution grid, transition to monitoring with higher resolution will become necessary. Multi-objective evaluation with respect to the value proposition of further µPMU sensor monitoring networks can provide utilities and DSOs with a framework for assessing the economic, environmental, and operational benefit to pursue larger scale studies.
S4: Sufficiency > Timeliness
µPMU data can suffer from multiple latencies within the monitoring system of the grid that are unable to keep up with the high sampling rate of the continuous measurements that µPMUs generate. Latencies occur in the context of the system communications surrounding signals as they are being recorded, processed, sent, and received. This can be due to the communication medium used, cable distance, amount of processing, and computational delay. More specifically, the named latencies are measurement, transmission, channel, receiver, and algorithm related. Identification of characteristics preceding fault events with lead times to overcome potential latencies through machine learning or other techniques can be of benefit.
Facilitating forest restoration monitoring
Details (click to expand)
Efforts are being made to restore ecosystems like forests and mangroves.
ML can be used to monitor biodiversity changes before and after restoration efforts, in order to quantify their effectiveness and outcomes.
A significant data gap is the lack of standardized protocols to guide data collection for restoration projects, making it difficult to consistently assess biodiversity outcomes using ML across different restoration initiatives.
Developing standardized data collection protocols, fostering a culture of data sharing, and implementing incentives for data collectors would enable more effective ML applications, leading to better assessment of restoration successes and failures on a global scale.
There is a lack of standardized protocol to guide data collection for restoration projects, including which variables to measure and how to measure them. Developing such protocols would standardize data collection practices, enabling consistent assessment and comparison of restoration successes and failures on a global scale.
There is a lack of standardized protocol to guide data collection for restoration projects, including which variables to measure and how to measure them. Developing such protocols would standardize data collection practices, enabling consistent assessment and comparison of restoration successes and failures on a global scale.
Data Gap Type
Data Gap Details
U1: Usability > Structure
For restoration projects, there is an urgent need for standardized protocols to guide data collection and curation for measuring biodiversity and other complex ecological outcomes of restoration projects. For example, there is an urgent need of clearly written guidance on what variables to collect and how to collect them.
Turning raw data into usable insights is also a significant challenge. There is a lack of standardized tools and workflows to turn the raw data into analysis-ready data and analyze the data in a consistent way across projects.
U4: Usability > Documentation
For restoration projects, there is an urgent need for standardized protocols to guide data collection and curation for measuring biodiversity and other complex ecological outcomes of restoration projects. For example, there is an urgent need for clearly written guidance on what variables to collect and how to collect them.
Turning raw data into usable insights is also a significant challenge. There is a lack of standardized tools and workflows to turn the raw data into analysis-ready data and analyze the data in a consistent way across projects.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
Data Gap Type
Data Gap Details
U1: Usability > Structure
For restoration projects, there is an urgent need for standardized protocols to guide data collection and curation for measuring biodiversity and other complex ecological outcomes of restoration projects. For example, there is an urgent need of clearly written guidance on what variables to collect and how to collect them.
Turning raw data into usable insights is also a significant challenge. There is a lack of standardized tools and workflows to turn the raw data into analysis-ready data and analyze the data in a consistent way across projects.
U4: Usability > Documentation
For restoration projects, there is an urgent need for standardized protocols to guide data collection and curation for measuring biodiversity and other complex ecological outcomes of restoration projects. For example, there is an urgent need for clearly written guidance on what variables to collect and how to collect them.
Turning raw data into usable insights is also a significant challenge. There is a lack of standardized tools and workflows to turn the raw data into analysis-ready data and analyze the data in a consistent way across projects.
There is a lack of standardized protocol to guide data collection for restoration projects, including which variables to measure and how to measure them. Developing such protocols would standardize data collection practices, enabling consistent assessment and comparison of restoration successes and failures on a global scale.
There is a lack of standardized protocol to guide data collection for restoration projects, including which variables to measure and how to measure them. Developing such protocols would standardize data collection practices, enabling consistent assessment and comparison of restoration successes and failures on a global scale.
Data Gap Type
Data Gap Details
U1: Usability > Structure
For restoration projects, there is an urgent need for standardized protocols to guide data collection and curation for measuring biodiversity and other complex ecological outcomes of restoration projects. For example, there is an urgent need of clearly written guidance on what variables to collect and how to collect them.
Turning raw data into usable insights is also a significant challenge. There is a lack of standardized tools and workflows to turn the raw data into analysis-ready data and analyze the data in a consistent way across projects.
U4: Usability > Documentation
For restoration projects, there is an urgent need of standardized protocol to guide data collection and curation for measuring biodiversity and other complex ecological outcomes of restoration projects. For example, there is an urgent need of a clearly written guidance on what variables to collect and how to collect them.
Turning raw data into usable insights is also a significant challenge. There is a lack of standardized tools and workflows to turn the raw data into some analysis-ready data and analyze the data in a consistent way across projects.
Facilitating the detection of climate-induced ecosystem changes
Details (click to expand)
Climate change is causing significant alterations in ecosystems worldwide, threatening biodiversity and ecosystem services that are critical for both nature and human well-being.
Machine learning can analyze complex ecological data from multiple sources to detect climate change impacts, identify vulnerable regions, and inform targeted conservation efforts.
Key data gaps include insufficient high-resolution climate and biodiversity data, restricted access to ground survey data, and limited institutional capacity to process collected data efficiently.
Addressing these gaps requires establishing decentralized monitoring networks, improving data accessibility through legislative reforms, and developing sustainable funding models for long-term ecosystem monitoring initiatives.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
Data Gap Type
Data Gap Details
U2: Usability > Aggregation
A lot of well-annotated datasets are currently scattered within the confines of individual researchers’ or organizations’ hard drives. Aggregating and sharing these datasets would largely address the lack of publicly open annotated data needed for model training. This initiative would also mitigate geographic and taxonomic imbalances in existing data.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
S2: Sufficiency > Coverage
There exists a significant geographic imbalance in data collected, with insufficient representation from highly biodiverse regions. The data also does not cover a diverse spectrum of species, intertwining with taxonomy insufficiencies.
Addressing these gaps involves several strategies:
Data sharing: Unlocking existing datasets scattered across individual laboratories and organizations can partially alleviate the geographic and taxonomic gaps in data coverage.
Annotation Efforts:
- Leveraging crowdsourcing platforms enables collaborative data collection and facilitates volunteer-driven annotation, playing a pivotal role in expanding annotated datasets.
- Collaborating with domain experts, including ecologists and machine learning practitioners, is crucial. Incorporating local ecological knowledge, particularly from indigenous communities in target regions, enriches data annotation efforts.
- Developing AI-powered methods for automatic cataloging, searching, and data conversion is imperative to streamline data processing and extract relevant information efficiently.
Data Collection:
- Prioritize continuous data collection efforts, especially in underrepresented regions and for less documented species and taxonomic groups.
- Explore innovative approaches like Automated Multisensor Stations for Monitoring of Species Diversity (AMMODs) to enhance data collection capabilities.
- Strategically plan data collection initiatives to target species at risk of extinction, ensuring data capture before irreversible loss occurs.
S1: Sufficiency > Insufficient Volume
One of the foremost challenges is the scarcity of publicly open and adequately annotated data for model training. It is worth mentioning that the lack of annotated datasets is a common and major challenge applied to almost every modality of biodiversity data and is particularly acute for bioacoustic data, where the availability of sufficiently large and diverse datasets, encompassing a wide array of species, remains limited for model training purposes.
This scarcity of publicly open and well annotated datasets arises from the labor-intensive nature of data annotation and the dearth of expertise necessary to accurately identify species. This is the case even for relatively well-studied taxa such as birds, and is even more notable for taxa such as Orthoptera (grasshoppers and relatives). The incompleteness of our current taxonomy also compounds this issue.
Though some data cannot be shared with the public because of concerns about poaching or other legal restrictions imposed by local governments, there are still considerable sources of well-annotated data that currently reside within the confines of individual researchers’ or organizations’ hard drives. Unlocking these data will partially address the data gap.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
Access to comprehensive ground survey data is restricted due to institutional barriers and privacy concerns, limiting its availability for ecosystem change analysis.
Access to comprehensive ground survey data is restricted due to institutional barriers and privacy concerns, limiting its availability for ecosystem change analysis.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Access to the data is restricted, with limited availability to the public. Users often find themselves unable to access the comprehensive information they require and must settle for suboptimal or outdated data. Addressing this challenge necessitates a legislative process to facilitate broader access to data.
For highly biodiverse regions, like tropical rainforests, there is a lack of high-resolution data that captures the spatial heterogeneity of climate, hydrology, soil, and the relationship with biodiversity.
For highly biodiverse regions, like tropical rainforests, there is a lack of high-resolution data that captures the spatial heterogeneity of climate, hydrology, soil, and the relationship with biodiversity.
Data Gap Type
Data Gap Details
S3: Sufficiency > Granularity
There is a lack of high-resolution data that captures the spatial heterogeneity of climate, hydrology, soil, and so on, which are important for biodiversity patterns. This is because of a lack of observation systems that are dense enough to capture the subtleties in those variables caused by terrain. It would be helpful to establish decentralized monitoring networks to cost-effectively collect and maintain high-quality data over time, which cannot be done by one single country.
The major data gap is the limited institutional capacity to process and analyze data promptly to inform decision-making.
Data Gap Type
Data Gap Details
M: Misc/Other
There is a significant institutional challenge in processing and analyzing data promptly to inform decision-making. To enhance institutional capacity for leveraging global data sources and analytical methods effectively, a strategic, ecosystem-building approach is essential, rather than solely focusing on individual researcher skill development. This approach should prioritize long-term sustainability over short-term project-based funding.
Hybrid ML-physics climate models for enhanced simulations
Details (click to expand)
Physics-based climate models incorporate numerous complex components that are computationally intensive, which limits the spatial resolution achievable in climate simulations.
ML models can emulate these physical processes, providing a more efficient alternative to traditional methods, enabling faster simulations and enhanced model performance.
The most significant data gaps are the enormous volume of climate data, which creates challenges for storage, transfer, and processing, and insufficient granularity in existing datasets to resolve fine-scale physical processes like turbulence.
Developing improved computational infrastructure for handling large datasets and creating ultra-high-resolution benchmark simulations would significantly enhance hybrid climate modeling capabilities.
ClimSim faces challenges with its large data volume, making downloading and processing difficult for most ML practitioners, and its resolution is insufficient to resolve some fine-scale physical processes critical for accurate climate modeling.
ClimSim faces challenges with its large data volume, making downloading and processing difficult for most ML practitioners, and its resolution is insufficient to resolve some fine-scale physical processes critical for accurate climate modeling.
Data Gap Type
Data Gap Details
U6: Usability > Large Volume
A common challenge for emulating climate model components, especially subgrid scale processes is the large data volume, which makes data downloading, transferring, processing, and storing challenging. Computation resources, including GPUs and storage, are urgently needed for most ML practitioners. Technical help on optimizing code for large volumes of data would also be appreciated.
S3: Sufficiency > Granularity
The current resolution is still sufficient to resolve some physical processes, e.g. turbulence, and tornado. Extremely high-resolution simulations, like large-eddy-simulations are needed.
DYAMOND faces similar challenges to ClimSim: its large volume creates processing difficulties, and its resolution, while high, remains insufficient for resolving fine-scale atmospheric processes needed for accurate climate modeling.
DYAMOND faces similar challenges to ClimSim: its large volume creates processing difficulties, and its resolution, while high, remains insufficient for resolving fine-scale atmospheric processes needed for accurate climate modeling.
Data Gap Type
Data Gap Details
U6: Usability > Large Volume
A common challenge for emulating climate model components, especially subgrid scale processes is the large data volume, which makes data downloading, transferring, processing, and storing challenging. Computation resources, including GPUs and storage, are urgently needed for most ML practitioners. Technical help on optimizing code for large volumes of data would also be appreciated.
S3: Sufficiency > Granularity
The current resolution is still sufficient to resolve some physical processes, e.g. turbulence, and tornado. Extremely high-resolution simulations, like large-eddy-simulations are needed.
While ERA5 is widely used due to its good structure and global coverage, users face significant challenges with downloading times that can take days to months, and the sheer data volume presents processing difficulties for many users.
While ERA5 is widely used due to its good structure and global coverage, users face significant challenges with downloading times that can take days to months, and the sheer data volume presents processing difficulties for many users.
Data Gap Type
Data Gap Details
U6: Usability > Large Volume
Massive storage and processing requirements - Cloud computing platforms with pre-loaded datasets and data subsetting tools can enable analysis without full downloads
These simulations are essential for resolving turbulent processes that current climate models cannot capture, but they require significant computational resources and are not readily available as benchmark datasets for the wider research community.
These simulations are essential for resolving turbulent processes that current climate models cannot capture, but they require significant computational resources and are not readily available as benchmark datasets for the wider research community.
Data Gap Type
Data Gap Details
S6: Sufficiency > Missing Components
Current high-resolution simulations cannot resolve many physical processes like turbulence. Extremely high-resolution simulations (sub-kilometer or tens of meters) are needed to serve as ground truth for training ML models as they provide a more realistic representation of atmospheric processes. Creating and sharing benchmark datasets based on these simulations would facilitate model development and validation.
While conceptually needed, this dataset does not exist in the form required. An enhanced version of ERA5 with higher resolution and fidelity would significantly improve ML model training and validation.
While conceptually needed, this dataset does not exist in the form required. An enhanced version of ERA5 with higher resolution and fidelity would significantly improve ML model training and validation.
Data Gap Type
Data Gap Details
W: Wish
ERA5 is currently widely used in ML-based weather forecasts and climate modeling because of its high resolution and ready-for-analysis characteristics. But large volumes of observations, e.g. data from radiosonde, balloons, and weather stations are largely under-utilized. It would be great to create a dataset that is well-structured like ERA5 but from more observations.
Improving battery management systems
Details (click to expand)
Battery storage is crucial for transitioning to renewable energy and electrifying transportation, with efficiency and lifetime directly impacting these sustainability efforts.
Machine learning can improve battery management systems by accurately estimating state of charge (SoC), state of health (SoH), and remaining useful life (RUL), and optimizing charging and discharging strategies.
Key data gaps include oversimplified battery models that don’t account for real-world operating conditions and insufficient validation data from physical battery systems in diverse operational environments.
Enhancing model complexity and collecting comprehensive real-world performance data can significantly improve battery management predictions, leading to extended battery lifetimes, more efficient energy use, and accelerated adoption of electric vehicles and renewable energy storage.
While ECMs enable real-time battery SoC predictions due to their computational efficiency, they often oversimplify real-life operating conditions which limits the accuracy of SoH and RUL estimates. Additionally, verification with data from physical battery systems is required to validate simulated outcomes and improve prediction reliability across diverse operational environments.
While ECMs enable real-time battery SoC predictions due to their computational efficiency, they often oversimplify real-life operating conditions which limits the accuracy of SoH and RUL estimates. Additionally, verification with data from physical battery systems is required to validate simulated outcomes and improve prediction reliability across diverse operational environments.
Data Gap Type
Data Gap Details
R1: Reliability > Quality
Due to their simplified nature and assumptions based on ideal laboratory conditions, ECMs have limited accuracy in predicting battery aging and dynamics in real systems. Verification with real-life battery system data from diverse operational environments is essential for improving state of health (SoH) and remaining useful life (RUL) predictions.
S3: Sufficiency > Granularity
The resolution of SoH and SoC predictions of ECMs are impacted by assumptions made with respect to battery performance. These include constant internal resistance assumptions that don’t account for sensitivity to complex current profiles or temperature variations, leading to inaccurate voltage and subsequent SoH/SoC calculations. ECMs also simplify electrochemical processes by ignoring electrode polarization, diffusion, and transfer kinetics, while neglecting battery aging effects like capacity fade Linearity assumptions, in simpler ECMs do not hold true under high charge/discharge rates. Solutions include increasing the complexity of ECMs by adding parallel RC networks to model the internal resistance of the battery with different time constants, introducing non-linear elements for different operating conditions, incorporating adaptive hysteresis models, and integrating aging parameters.
Improving estimations of forest carbon stock
Details (click to expand)
Forests are one of Earth’s major carbon sinks, making accurate estimation of forest carbon stocks essential for climate change mitigation efforts and carbon accounting.
ML can help by providing more precise and large-scale estimates of forest carbon through the analysis of satellite imagery, LiDAR data, and ground surveys.
Ground truth data for forest carbon stock estimation is often limited in geographical coverage and temporal frequency due to the high costs and labor-intensive nature of manual data collection. Additionally, remotely sensed data (satellite, airborne LiDAR) requires significant domain expertise for proper preprocessing and interpretation.
Governments and research institutions can address these gaps by investing in more comprehensive ground survey programs, making airborne LiDAR data more widely available, and developing standardized preprocessing tools for non-experts to utilize remote sensing data effectively.
Manual collection results in data quality issues and limited spatial coverage, requiring improved collection protocols and integration with remote sensing to expand usability.
Manual collection results in data quality issues and limited spatial coverage, requiring improved collection protocols and integration with remote sensing to expand usability.
Data Gap Type
Data Gap Details
U5: Usability > Pre-processing
Ground-survey data often contains missing values, measurement errors, and duplicates that require cleaning before use. Standardizing collection protocols and developing automated quality control procedures could improve data usability.
S2: Sufficiency > Coverage
Manual collection methods limit geographical coverage and collection frequency. Integrating ground surveys with remote sensing approaches and developing citizen science initiatives could help expand coverage while maintaining data quality.
Limited geographical coverage due to high collection costs, combined with the need for domain expertise to process the complex point cloud data, restricts the use of this high-value data source.
Limited geographical coverage due to high collection costs, combined with the need for domain expertise to process the complex point cloud data, restricts the use of this high-value data source.
Data Gap Type
Data Gap Details
U5: Usability > Pre-processing
Domain expertise is required to process raw LiDAR point clouds and generate canopy height metrics used for training ML models. Developing open-source processing tools with standardized workflows would make this data more accessible to non-experts.
S2: Sufficiency > Coverage
Airborne LiDAR provides the most accurate measurements of canopy height but is not collected everywhere due to the high costs of aircraft or drone operations. Coordinated efforts to expand coverage and make existing data publicly available would significantly improve forest carbon stock estimation capabilities.
Quality uncertainties in GEDI data affect carbon stock estimation reliability, requiring validation methods and calibration procedures to improve measurement accuracy.
Quality uncertainties in GEDI data affect carbon stock estimation reliability, requiring validation methods and calibration procedures to improve measurement accuracy.
Data Gap Type
Data Gap Details
R1: Reliability > Quality
GEDI data contains inherent uncertainties including geolocation errors and weak return signals in dense forests, which introduce errors into canopy height estimates and subsequent carbon calculations. Combining GEDI with other data sources like airborne LiDAR for validation and developing region-specific calibration methods could improve data reliability.
Domain expertise is needed to preprocess this data, limiting its accessibility to researchers and practitioners without specialized knowledge in radar imagery interpretation.
Domain expertise is needed to preprocess this data, limiting its accessibility to researchers and practitioners without specialized knowledge in radar imagery interpretation.
Data Gap Type
Data Gap Details
U5: Usability > Pre-processing
Domain expertise is required to understand the raw radar data and preprocess it properly for use in ML models for forest carbon estimation. Developing standardized preprocessing pipelines and tools could make this valuable data more accessible to the broader ML and climate science communities.
Improving long-term extreme heat prediction
Details (click to expand)
Extreme heat events are becoming more frequent and intense due to climate change, posing serious risks to human health, infrastructure, and ecosystems worldwide.
Machine learning can improve long-term extreme heat prediction by identifying complex patterns in climate data and enhancing the accuracy and resolution of projections beyond what traditional physics-based models can achieve.
Working with climate projection datasets presents significant challenges due to their massive size, which requires substantial computational resources for storage, transfer, and processing, limiting accessibility for many researchers and stakeholders.
Cloud computing providers, research institutions, and funding agencies can collaborate to develop accessible platforms and tools for efficiently managing large climate datasets, enabling broader use of AI for extreme heat prediction and adaptation planning.
The dataset’s massive size (petabytes of data) creates significant barriers for access, transfer, and analysis, requiring specialized computing infrastructure and technical expertise that many researchers lack. Additionally, efficiently extracting relevant extreme heat information from this comprehensive climate dataset presents computational and methodological challenges.
The dataset’s massive size (petabytes of data) creates significant barriers for access, transfer, and analysis, requiring specialized computing infrastructure and technical expertise that many researchers lack. Additionally, efficiently extracting relevant extreme heat information from this comprehensive climate dataset presents computational and methodological challenges.
Data Gap Type
Data Gap Details
U6: Usability > Large Volume
The NEX-GDDP-CMIP6 dataset requires substantial computational resources for processing and analysis. While cloud platforms provide access, they involve usage costs that may be prohibitive for some researchers. Processing such large datasets requires specialized techniques like distributed computing frameworks (e.g., Dask, Spark) and occasionally large-memory computing nodes for certain statistical analyses. Many researchers and practitioners lack either the technical expertise or computational resources to effectively utilize this valuable data.
Improving offshore wind power forecasting: short-to long-term (3 hours–1 year)
Details (click to expand)
Wind forecasting can allow for resource assessment studies for offshore energy production, wind resource mapping, and wind farm modeling.
Machine learning can improve spatio-temporal forecasts at different horizons, given the availability of high-quality training data.
Current data gaps include coverage gaps, noisy data and difficulties to access data.
Efforts to get more of such data out of silos, mainly from energy companies, may help alleviate this gap.
Due to the location, FINO platform sensors are prone to failures of measurement sensors due to adverse outdoor conditions such as high wind and high waves. Often, when sensors fail, manual recalibration or repair is nontrivial requiring weather conditions to be amenable for human intervention. This can directly affect the data quality which can last for several weeks to a season. Coverage is also constrained to the platform and associated wind farm location.
Due to the location, FINO platform sensors are prone to failures of measurement sensors due to adverse outdoor conditions such as high wind and high waves. Often, when sensors fail, manual recalibration or repair is nontrivial requiring weather conditions to be amenable for human intervention. This can directly affect the data quality which can last for several weeks to a season. Coverage is also constrained to the platform and associated wind farm location.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
The data is free to use but requires sign up through a login account at: https://login.bsh.de/fachverfahren/
U5: Usability > Pre-processing
The dataset is prone to failures of measurement sensors. Issues with data loggers, power supplies, and effects of adverse conditions such as low aerosol concentrations can influence data quality. High wind and wave conditions impact the ability to correct or recalibrate sensors creating data gaps that can last for several weeks or seasons.
S2: Sufficiency > Coverage
Coverage of wind farms is relegated to the dimensions of the platform itself and the wind farm that it is built in proximity to. For locations with different offshore characteristics similar testbed platforms or buoys can be developed.
S5: Sufficiency > Proxy
Due to the nature of sensors exposed to environmental ocean conditions and storms, often FINO sensors may need maintenance and repair but are difficult to physically access. Gaps in the data from lack of data points can be addressed by utilizing mesoscale wind modeling output.
The spatiotemporal coverage of the offshore windspeed mast data is restricted to the dimensions of the platform/tower itself as well as the time of construction. Depending on the data provider access to the data may require the signing of a non-disclosure agreement.
The spatiotemporal coverage of the offshore windspeed mast data is restricted to the dimensions of the platform/tower itself as well as the time of construction. Depending on the data provider access to the data may require the signing of a non-disclosure agreement.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Access to data must be requested with different data providers having varying levels of restrictions. For data obtained from Orsted, access is only provided by signing a standard non-disclosure agreement. For more information mail R&D at datasharing@orsted.com.
S2: Sufficiency > Coverage
Spatiotemporal coverage of the dataset varies depending on the construction of the platform testbed and location but overall data is available from 2014 to the present. While measurements from LiDAR have higher resolution than wind mast data, sensor information is still restricted to the dimensions of the platform and the associated off-shore windfarm when present. Data provided by Orsted from LiDAR sensors includes 10 minute statistics.
Improving offshore wind power nowcasting (10 min)
Details (click to expand)
Wind nowcasting can enable estimations of the active power generated by wind farms in the absence of curtailment and facilitate operations, potentially making them more efficient.
Machine learning can improve such very short-term spatio-temporal forecasts, given the availability of high-quality training data.
High-resolution wind data measured at wind farms currently remains limited to a few datasets.
Efforts to get such data out of sillos, mainly from energy companies, may help alleviate this gap.
Data can be accessed by requesting access via the Orsted form. Sufficiency of the dataset is constrained by volume where only a finite amount of short term off-shore wind farms exist to which expanding the coverage area, volume and time granularity of data to under 10 minutes may enable transient detection from generated active power.
Data can be accessed by requesting access via the Orsted form. Sufficiency of the dataset is constrained by volume where only a finite amount of short term off-shore wind farms exist to which expanding the coverage area, volume and time granularity of data to under 10 minutes may enable transient detection from generated active power.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Access requests are needed via a form from Orsted.
S1: Sufficiency > Insufficient Volume
Data from multiple wind farms over a variety of regions would be required to get a more accurate comparison against simulated weather data.
S2: Sufficiency > Coverage
The coverage is over parts of Europe; off-shore wind conditions will vary depending on the environment and cannot scale or transfer to other temperate regions of the world
S3: Sufficiency > Granularity
The time granularity of 10 min is too coarse to capture transients in active power generated.
S4: Sufficiency > Timeliness
Only two years worth of data from 2016–2018 is provided. Additional data collection from offshore wind farms or simulations are needed.
Improving power grid optimization
Details (click to expand)
Optimal Power Flow (OPF) is used to find the cheapest way to generate electricity while meeting demand and staying within system limits like voltage and line capacity. Traditionally, OPF is a complex math problem solved separately for AC and DC systems. As more renewable energy is added, the grid is shifting toward hybrid AC/DC systems to better handle long-distance power flow and new challenges like two-way power movement.
Changes in the grid due to renewable sources make OPF harder to solve. ML can be used to approximate OPF problems in order to allow them to be solved at greater speed, scale, and fidelity.
Data gaps for this use case are numerous and mainly across usability, reliability, and sufficiency.
Closing these gaps requires an array of gap-specific actions; further industry engagement may have a significant impact on many of the gaps.
Grid2Op faces several data gaps related to usability, reliability, and coverage. Key issues include poor documentation, limited customization options (especially for reward functions and cascading failure scenarios), and a lack of support for multi-agent setups. The framework also lacks realistic system dynamics, fine time resolution, and flexible backend modeling, making it challenging to use for advanced research or real-world grid simulations without significant modification. These gaps can hinder the framework’s ability to accurately train reinforcement learning agents and simulate real-world power grid behavior.
Grid2Op faces several data gaps related to usability, reliability, and coverage. Key issues include poor documentation, limited customization options (especially for reward functions and cascading failure scenarios), and a lack of support for multi-agent setups. The framework also lacks realistic system dynamics, fine time resolution, and flexible backend modeling, making it challenging to use for advanced research or real-world grid simulations without significant modification. These gaps can hinder the framework’s ability to accurately train reinforcement learning agents and simulate real-world power grid behavior.
Data Gap Type
Data Gap Details
U4: Usability > Documentation
In the customization of the reward function, there are several TODOs in place concerning the units and attributes of the reward function related to redispatching. Documentation and code comments can sometimes provide conflicting information. Modularity of reward, adversary, action, environment, and backend are non-intuitive, requiring pregenerated dictionaries rather than dynamic inputs or conversion from single agent to multi-agent functionality. Refactoring of documentation and comments to reflect updates can assist users and avoid having to cross-reference information from the Discord channel for “Learning to Run a Power Network” and github issues.
U5: Usability > Pre-processing
The game over rules and constraints are difficult to adapt when customizing the environment for cascading failure scenarios and more complex adversaries such as natural disasters. Code base variations between versions especially between the native and Gym formatted framework lose features present in the legacy version including topology graphics. Open source refactoring efforts can assist in updating the code base to run latest and previous versions without loss of features.
R1: Reliability > Quality
The grid2op framework relies on mathematical robust control laws and rewards which train the RL agent based on set observation assumptions rather than actual system dynamics which are susceptible to noise, uncertainty, and disturbances not represented in the simulation environment. It has no internal modeling of the equations of the grids nor can it suggest which solver should be adopted to solve traditional nonlinear optimal power flow equations. Specifics concerning modeling and preferred solver require users to customize or create a new “Backend.” Additionally, such RL human-in-the-loop systems in practice require trustworthiness and quantification of risk. A library of open source contributed “Backends” from independent projects that customize the framework with supplemental documentation and paper references can assist in further development of the environment for different conditions. Human-in-the-loop studies can be completed by testing the environment scenario and control response of the system over a model of a real grid. Generated observations and control actions can then be compared to historical event sequences and grid operator responses.
S2: Sufficiency > Coverage
Coverage is limited to the network topologies provided by the grid2op environment which is based on different IEEE bus topologies. While customization of the environment in terms of the “Backend,” “Parameters,” and “Rules” are possible, there may be dependent modules that may still enforce game-over rules. Furthermore, since backend modeling is not the focus of grid2op, verification that customization obeys physical laws or models is necessary.
S3: Sufficiency > Granularity
The time resolution of 5-minute increments may not represent realistic observation time series grid data or chronics. Furthermore, the granularity may limit the effectiveness of specific actions in the provided action space. For example, the use of energy storage devices in the presence of overvoltage has little effect on energy absorption, incentivizing the agent to select from grid topology actions such as line changing line status or changing bus rather than setting storage. Expansion of the framework with efforts from the open source community to include multiple time resolutions may allow for generalization of the tool for different forecasting time horizons as well as action evaluation.
Traditional OPF simulation software may require the purchase of licenses for advanced features and functionalities. To simulate more complex systems or regions, additional data regarding energy infrastructure, region-specific load demand, and renewable generation may be needed to conduct studies. OPF simulation output would require verification and performance evaluation to assess results in practice. Increasing the granularity of the simulation model by increasing the number of buses, limits, or additional parameters increases the complexity of the OPF problem, thereby increasing the computational time and resources required.
Traditional OPF simulation software may require the purchase of licenses for advanced features and functionalities. To simulate more complex systems or regions, additional data regarding energy infrastructure, region-specific load demand, and renewable generation may be needed to conduct studies. OPF simulation output would require verification and performance evaluation to assess results in practice. Increasing the granularity of the simulation model by increasing the number of buses, limits, or additional parameters increases the complexity of the OPF problem, thereby increasing the computational time and resources required.
Data Gap Type
Data Gap Details
U2: Usability > Aggregation
In MATPOWER and PowerWorld outside data may be required to simulate conditions over a specific region with a given amount of DERs, generating sources, bus topology, and line limits. This will require collation of pre-existing synthetic grid data with additional data to model specific scenarios.
U3: Usability > Usage Rights
Depending on whether proprietary simulators are pursued (PowerWorld) there may be licensing costs for use of certain features.
R1: Reliability > Quality
Traditional OPF simulation software simplifies the power system and makes assumptions about the system behavior such as perfect power factor correction or constant system parameters. Simulation results may need to be verified with real-world results.
S3: Sufficiency > Granularity
In PowerWorld, bus topologies available may be simplified representations of actual grids to simplify the modeling and simulation techniques to represent overall system behavior. MATPOWER requires the user to define the bus matrix. As the number of buses in a power system increases the computational complexity of OPF increases, requiring more resources and time to solve. Additional parameters such as line limits, number of generating sources, number of DERs, and load demand also increase the complexity of the model as more constraints and assets are introduced.
While network datasets are open source, maintenance of the repository requires continuous curation and collection of more complex benchmark data to enable diverse AC-OPF simulation and scenario studies. Industry engagement can assist in developing more realistic data though such data without cooperative effort may be hard to find.
While network datasets are open source, maintenance of the repository requires continuous curation and collection of more complex benchmark data to enable diverse AC-OPF simulation and scenario studies. Industry engagement can assist in developing more realistic data though such data without cooperative effort may be hard to find.
Data Gap Type
Data Gap Details
O1: Obtainability > Findability
Industry engagement can assist in developing detailed and realistic networked datasets and operating conditions, limits, and constraints.
U2: Usability > Aggregation
Repository maintenance requires continuous curation of more complex networked benchmark data for more realistic AC-OPF simulation studies.
Improving short-term electricity load forecasting
Details (click to expand)
Short-term load forecasting is critical for utilities to balance power demand with supply. Utilities need accurate forecasts (e.g. on the scale of hours, days, weeks, up to a month) to plan, schedule, and dispatch energy while decreasing costs and avoiding service interruptions.
ML is well suited to handle large amounts of data such as historical electricity load data, weather forecasts, and continuous streams of advanced metering infrastructure (AMI) data, from which it may capture non-linearities which traditional linear models often struggle with.
Several data gaps for this use case resolve around the difficulty to access varied data due inter alia to privacy concerns and lack of willingness from private actors to share data for research.
ML can help the development of synthetic, privacy-preserving datasets that can accelerate research in this space.
AMI data is challenging to obtain without pilot study partnerships with utilities since data collection on individual building consumer behavior can infringe upon customer privacy, especially at the residential level. Granularity of time series data can also vary depending on the level of access to the data, whether it be aggregated and anonymized or based on the resolution of readings and system. Additionally, the coverage of data will be limited to utility pilot test service areas, thereby restricting the scope and scale of demand studies.
AMI data is challenging to obtain without pilot study partnerships with utilities since data collection on individual building consumer behavior can infringe upon customer privacy, especially at the residential level. Granularity of time series data can also vary depending on the level of access to the data, whether it be aggregated and anonymized or based on the resolution of readings and system. Additionally, the coverage of data will be limited to utility pilot test service areas, thereby restricting the scope and scale of demand studies.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Access to real AMI data can be difficult to obtain due to privacy concerns. Even when partnered with a utility, the AMI data can undergo anonymization and aggregation to protect individual customers. Some ISOs are able to distribute data provided that a written records request is submitted. If requesting personal consumption data, program pricing enrollment, may limit temporal resolution of data that a utility can provide. Open datasets, on the other hand, may only be available for academic research or teaching use (ISSDA CER data).
U2: Usability > Aggregation
AMI data when used jointly with other data that may influence demand such as weather, availability of rooftop solar, presence of electric vehicles, building specifications, and appliance inventory may require significant additional data collection or retrieval. Non-intrusive load monitoring techniques to disaggregate AMI data may be employed with some assumptions based on additional data. For example, the use of satellite imagery over a region of interest can assist in identifying buildings that have solar panels.
U3: Usability > Usage Rights
For ISSDA CER data use, a request form must be submitted for evaluation by issda@ucd.ie. For data obtained through utility collaborative partnerships, usage rights may vary. Please contact the data provider for more information.
U5: Usability > Pre-processing
Data cleanliness may vary depending on the data source. For individual private data collection through testbed development, cleanliness can depend on formats of data stream output from the sensor network system installed. When designing the testbed data format it is recommended to develop and structure comprehensive metadata with respect to the study to encourage further development.
R1: Reliability > Quality
Anonymized data may not be verifiable or useful once it is open-source. Further data collection for verification purposes is recommended.
S2: Sufficiency > Coverage
S3: Sufficiency > Granularity
Meter resolution can vary based on the hardware ranging from 1 hour, 30 minute, to 15 minute measurement intervals. Depending on the level of anonymization and aggregation of data, the granularity may be constrained to other factors such as the cadence of time of use pricing and other tiered demand response programs employed by the partnering utility. Interpolation may be used to combat issues with respect to resolution but may require uncertainty considerations when reporting results.
S4: Sufficiency > Timeliness
With respect to the CER Smart Metering Project and the associated Customer Behavior Trials (CBT), Electric Ireland and Bord Gais Energy smart meter installation and monitoring occurred from 2009-2010. This anonymized dataset may no longer be representative of current behavior usage as household compositions and associated loads change with time. Similarly, pilot programs through participating utilities are finite in nature. To address this data gap, in the context of previous pilot study locations, studies and testbeds can be reopened or revisited. In the context of new studies in different locations, previous data can still be utilized for pre-training models, however, fine-tuning would still require new data collection.
While the dataset has clear documentation, the lack of diversity in building types necessitates further development of the dataset to include other types of buildings, as well as expanding coverage to areas and times beyond those currently available.
While the dataset has clear documentation, the lack of diversity in building types necessitates further development of the dataset to include other types of buildings, as well as expanding coverage to areas and times beyond those currently available.
Data Gap Type
Data Gap Details
U2: Usability > Aggregation
Data was collated from 7 open access public data sources as well as 12 privately curated datasets from facilities management at different college sites requiring manual site visits which are not included in the data repository at this time.
S2: Sufficiency > Coverage
The dataset is curated from buildings on university campuses thereby limiting the diversity of building representation. To overcome the lack of diversity in building data, data sharing incentives and community open source contributions can allow for the expansion of the of the dataset.
S3: Sufficiency > Granularity
The granularity of the meter data is hourly which may not be adequate for short term load-forecasting and efficiency studies at a higher resolution. Assumptions on conditions would have to be made prior to interpolating.
S4: Sufficiency > Timeliness
The dataset covers hourly measurements from January 1, 2016 to December 31, 2018. While this may be adequate for pre-training models, further data collection through a reinitiation of the study may be needed to fine-tune models for more up to date periods of time.
Despite the model being trained on actual AMI data, synthetic data generation requires a priori knowledge with respect to building type, efficiency, and presence of low-carbon technologies. Furthermore, since the dataset is trained on UK building data, AMI time series generated may not be an accurate representation of load demand in regions outside the UK. Finally, since data is synthetically generated, studies will require validation and verification using real data or aggregated per substation level data to assess its effectiveness.
Despite the model being trained on actual AMI data, synthetic data generation requires a priori knowledge with respect to building type, efficiency, and presence of low-carbon technologies. Furthermore, since the dataset is trained on UK building data, AMI time series generated may not be an accurate representation of load demand in regions outside the UK. Finally, since data is synthetically generated, studies will require validation and verification using real data or aggregated per substation level data to assess its effectiveness.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
The Variational Autoencoder Model can generate synthetic AMI data based on several conditions. The presence of low carbon technology (LCT) for a given household or property type depends on access to battery storage solutions, solar rooftop panels, and the presence of electric vehicles. This type of data may require curation of LCT purchases by type and household. Building type and efficiency at the residential and commercial/industrial level may also be difficult to access, requiring the user to set initial assumptions or seek additional datasets. Furthermore, data verification requires a performance metric based on actual readings. This may be done through access to substation- level load demand data.
U3: Usability > Usage Rights
Faraday is open for alpha testing by request only.
S2: Sufficiency > Coverage
Faraday is trained from utility provided AMI data from the UK which may not be representative of load demand and corresponding building type and temperate zone of other global regions. To generate similar synthetic data, custom data may be retrieved through a pilot test bed for private collection or the result of a partnership with a local utility. Additionally, pre-existing AMI data over an area of interest can be utilized to generate similar synthetic data.
Datasets are restricted to past pilot study coverage areas requiring further data collection for fine-tuning models to a different coverage area.
S3: Sufficiency > Granularity
Data granularity is limited to the granularity of data the model was trained on. Generative modeling approaches similar to Faraday, can be built using higher resolution data or interpolation methods could also be employed.
S4: Sufficiency > Timeliness
Timeliness of dataset would require continuous integration and development of the model using MLOps best practices to avoid data and model drift. By contributing to Linux Foundation Energy’s OpenSynth initiative, Centre for Net Zero hopes to build a global community of contributors to facilitate research.
R1: Reliability > Quality
Verification of AMI synthetic data requires verification, which can be done in a bottom-up grid modeling manner. For example, load demand at the substation level can be estimated based on the sum of individual building loads that the substation services. This value can then be compared to the actual substation load demand provided through private partnerships with distribution network operators (DNOs). However, the accuracy of a specific demand profile per property or section of properties would require identification of a population of buildings, a connected real-world substation, and residential low carbon technology investment for the set of properties under study.
Improving solar power forecasting: long-term (>24 hours)
Details (click to expand)
Accurately forecasting solar power generation beyond 24 hours is critical for energy market pricing, investment decisions, and coordinating renewable energy sources in an increasingly decarbonized grid.
Machine learning approaches can improve longer-term solar forecasting by combining weather predictions, historical generation data, and other relevant variables to create more accurate models than traditional methods.
The primary data gaps include limited geographic coverage of existing datasets, reliance on simulated rather than measured data, and quality concerns when adapting models to specific regions.
Expanding data collection networks, validating simulated data with real measurements, and creating standardized datasets for diverse regions would enable more reliable ML-based solar forecasting systems that could significantly improve grid stability and accelerate renewable energy adoption.
While valuable for renewable energy integration studies, this dataset has limitations in geographic coverage (limited to the US), temporal scope (only 2006 data), and relies on simulated rather than measured PV outputs. Addressing these gaps would enable more accurate and globally applicable ML-based solar forecasting models.
While valuable for renewable energy integration studies, this dataset has limitations in geographic coverage (limited to the US), temporal scope (only 2006 data), and relies on simulated rather than measured PV outputs. Addressing these gaps would enable more accurate and globally applicable ML-based solar forecasting models.
Data Gap Type
Data Gap Details
R1: Reliability > Quality
The dataset uses simulated outputs based on weather predictions rather than actual PV measurements, which may introduce systematic biases. Site-specific projects require additional validation with real measurements from solar power inverters. Developers can improve model accuracy by supplementing with local measurements and adapting simulation parameters to better represent specific regions.
S2: Sufficiency > Coverage
The dataset is limited to US locations based on 2006 solar conditions and is not representative of other geographic regions or more recent climate patterns. Expanding data collection to include diverse global regions and updating with more recent measurements would improve model transferability.
S4: Sufficiency > Timeliness
The dataset only covers 2006, which may not capture recent climate trends or technology improvements in PV systems. Updated datasets with more recent time periods would better represent current conditions and improve forecasting accuracy.
Improving solar power forecasting: medium-term (6-24 hours)
Details (click to expand)
Medium-term solar forecasting (6-24 hours ahead) is essential for efficient grid management, especially as solar power integration increases, impacting energy markets, demand response, and microgrid operations.
Machine learning techniques can significantly improve these forecasts by integrating satellite data with weather predictions and historical patterns to provide more accurate solar irradiance estimates.
A key data gap is the inconsistency in satellite data resolutions and coverage, alongside challenges in processing multispectral data and accurately modeling how different cloud types affect ground irradiance.
Combining satellite observations with ground-based measurements and developing standardized preprocessing approaches would substantially improve forecast accuracy, enabling better grid management and renewable energy integration.
Satellite remote sensing data for solar forecasting faces several challenges: variability in spatial and temporal resolution across different satellite sources, complex preprocessing requirements for multispectral data, and the need to accurately translate cloud observations into ground-level irradiance predictions. Improving granularity through supplementation with ground-based measurements and developing standardized preprocessing pipelines would significantly enhance forecast accuracy for grid management applications.
Satellite remote sensing data for solar forecasting faces several challenges: variability in spatial and temporal resolution across different satellite sources, complex preprocessing requirements for multispectral data, and the need to accurately translate cloud observations into ground-level irradiance predictions. Improving granularity through supplementation with ground-based measurements and developing standardized preprocessing pipelines would significantly enhance forecast accuracy for grid management applications.
Data Gap Type
Data Gap Details
U2: Usability > Aggregation
Data from different satellite sources (both geostationary and polar-orbiting) needs to be collated and harmonized when analyzing multiple regions of interest, creating challenges in data integration and standardization.
U5: Usability > Pre-processing
Multispectral remote sensing data requires preprocessing, including atmospheric correction and band combinations in the visible and infrared spectra, before it can be effectively used for solar forecasting models.
R1: Reliability > Quality
Different cloud types affect ground-level solar irradiance in varying ways that satellite imagery alone cannot fully capture, necessitating verification and supplementation with ground-based measurements for improved model accuracy.
S3: Sufficiency > Granularity
Spatial and temporal resolution varies significantly between satellite sources, limiting the ability to capture rapid changes in cloud cover that impact solar irradiance, particularly during partly cloudy conditions which create high variability in short timeframes.
Improving solar power forecasting: nowcasting/very-short-term (0-30min)
Details (click to expand)
Very-short-term solar power forecasting is critical for grid stability and efficiency as sudden changes in solar irradiance (ramp events) can cause abrupt fluctuations in power generation.
AI techniques can analyze cloud dynamics through segmentation and classification to predict solar irradiance attenuation, enabling more accurate forecasting for real-time electricity markets, dispatch of other generating sources, and energy storage control.
Key data gaps include limited spatial coverage of ground monitoring stations, insufficient time resolution for sub-5-minute forecasting, challenges with large data volumes from sensor networks, and data quality issues related to sensor calibration.
Expanding sensor networks to diverse environments, implementing AI-based data compression and quality control, and integrating multi-source data can close these gaps, ultimately enabling more reliable integration of solar power into electricity grids.
ARM data presents challenges with data volume management, measurement verification (especially for aerosol composition), limited spatial coverage (ARM sites only), and sensor calibration issues. Solutions include AI-based data compression, enhanced aerosol composition measurements, collaboration with partner networks to expand coverage, and automated quality control.
ARM data presents challenges with data volume management, measurement verification (especially for aerosol composition), limited spatial coverage (ARM sites only), and sensor calibration issues. Solutions include AI-based data compression, enhanced aerosol composition measurements, collaboration with partner networks to expand coverage, and automated quality control.
Data Gap Type
Data Gap Details
U6: Usability > Large Volume
ARM sites generate large datasets which can be challenging to store, analyze, stream, and archive. AI-based data compression and novel indexing can improve data management.
S3: Sufficiency > Granularity
Enhanced aerosol composition and ice nucleating particle measurements are needed for a better understanding of cloud dynamics and solar irradiance for DER site planning.
S2: Sufficiency > Coverage
Spatial coverage is limited to ARM sites within the United States. Collaboration with partner networks can expand coverage both within and outside the US.
R1: Reliability > Quality
Sensor data can be sensitive to noise and calibration issues, requiring automated systems to identify measurement drift.
The dataset has limited spatial coverage (Gaithersburg, MD only) and is no longer maintained after July 2017, limiting its usefulness for current applications.
The dataset has limited spatial coverage (Gaithersburg, MD only) and is no longer maintained after July 2017, limiting its usefulness for current applications.
Data Gap Type
Data Gap Details
S2: Sufficiency > Coverage
Since testbeds are located on the NIST campus spatial coverage is limited to the institution’s site. Similar datasets outside which combine sensor information from the solar irradiance conditions, and the associated solar generated power at the output of the inverter would require investment in similar site-specific testbeds in different regions.
S4: Sufficiency > Timeliness
Spatial coverage is limited to the NIST campus in Gaithersburg, MD. Similar datasets in different regions would require investment in comparable testbeds.
Solcast data is only accessible through academic or research institutions, uses coarse elevation models, has limited coverage (33 global sites), and provides data at 5-60 minute intervals, insufficient for very-short-term forecasting.
Solcast data is only accessible through academic or research institutions, uses coarse elevation models, has limited coverage (33 global sites), and provides data at 5-60 minute intervals, insufficient for very-short-term forecasting.
Data Gap Type
Data Gap Details
S3: Sufficiency > Granularity
Time resolution ranges from 5 to 60 minutes, which is insufficient for sub-5-minute forecasting needs.
S2: Sufficiency > Coverage
Coverage is limited to 33 global sites (18 tropical/subtropical, 15 temperate), requiring expansion to other regions and environmental conditions.
R1: Reliability > Quality
Significant elevation differences between ground sites and cell height affect clear-sky irradiance estimation accuracy.
O1: Obtainability > Findability
Data is only accessible through collaborating academic or research institutions.
Data coverage is limited by camera locations, temporal resolution is restricted to 10-minute increments, and image resolution is limited to 352x288 24-bit jpeg images.
Data coverage is limited by camera locations, temporal resolution is restricted to 10-minute increments, and image resolution is limited to 352x288 24-bit jpeg images.
Data Gap Type
Data Gap Details
S3: Sufficiency > Granularity
Images have limited resolution (352x288 pixels) with 10-minute capture intervals, potentially insufficient for very-short-term forecasting.
S2: Sufficiency > Coverage
Coverage is constrained by sensor network location and density. Expanded networks in diverse environments would improve coverage.
S2: Sufficiency > Coverage
The current dataset derives from sky imager datasets in Singapore, requiring similar networks in other regions or alternative data sources.
The dataset provides valuable annotated data for cloud detection and segmentation but is limited to Singapore, has an insufficient volume of samples (especially nighttime images), and restricts commercial use.
The dataset provides valuable annotated data for cloud detection and segmentation but is limited to Singapore, has an insufficient volume of samples (especially nighttime images), and restricts commercial use.
Data Gap Type
Data Gap Details
S1: Sufficiency > Insufficient Volume
The dataset needs more manually annotated cloud mask labels and is imbalanced with fewer nighttime samples.
O2: Obtainability > Accessibility
The dataset is under a Creative Commons license that prohibits commercial use, and access must be requested.
Improving solar power forecasting: short-term (30 min-6 hours)
Details (click to expand)
Solar irradiance forecasting at hourly intervals is critical for managing intermittent solar energy resources and ensuring grid stability and reliability.
Machine learning approaches can enhance forecasting accuracy by leveraging multiple data sources, including measured irradiance, PV inverter outputs, and meteorological variables.
Important data gaps include limited spatial coverage, with most high-quality data concentrated in specific regions, and inconsistent temporal resolution that affects forecasting precision.
By expanding sensor networks globally and harmonizing data collection standards, forecasting models can better support real-time energy management, demand response, and grid stability across diverse geographical areas.
While NOAA’s SOLRAD is an excellent data source for long term solar irradiance and climate studies, it has limitations for short-term solar forecasting applications. Key gaps include lower quality hourly averages compared to native resolution data, and limited geographic coverage with only nine monitoring stations across the United States. These constraints impact the effectiveness of forecasting for real-time energy management, grid stability, and market operations.
While NOAA’s SOLRAD is an excellent data source for long term solar irradiance and climate studies, it has limitations for short-term solar forecasting applications. Key gaps include lower quality hourly averages compared to native resolution data, and limited geographic coverage with only nine monitoring stations across the United States. These constraints impact the effectiveness of forecasting for real-time energy management, grid stability, and market operations.
Data Gap Type
Data Gap Details
S2: Sufficiency > Coverage
The coverage area is constrained to nine SURFRAD network locations in the United States (Albuquerque, NM; Bismark, ND; Hanford, CA; Madison, WI; Oak Ridge, TN; Salt Lake City, UT; Seattle, WA; Sterling, VA; Tallahassee, FL). For generalization to other regions, locations with similar climates and temperate zones would need to be identified.
S3: Sufficiency > Granularity
Data quality of the hourly averages is lower than that of the native resolution data, impacting effective short-term forecasting for real-time energy management, grid stability, demand response, and market operations. To address this gap, using very short-term data or supplementing with data from sky imagers and other sensors with frequent measurement outputs would be beneficial.
While NSRDB offers global coverage using satellite-derived data, several challenges exist. The dataset requires periodic recalculation and updating to remain current, with unbalanced temporal coverage favoring the United States. Satellite-based estimations may be inaccurate in regions with frequent cloud cover, snow, or bright surfaces, requiring ground-based verification. Additionally, data derived from satellite imagery may need preprocessing to account for parallax effects and field-of-view issues that aren’t fully addressed in the higher-level FARMS products.
While NSRDB offers global coverage using satellite-derived data, several challenges exist. The dataset requires periodic recalculation and updating to remain current, with unbalanced temporal coverage favoring the United States. Satellite-based estimations may be inaccurate in regions with frequent cloud cover, snow, or bright surfaces, requiring ground-based verification. Additionally, data derived from satellite imagery may need preprocessing to account for parallax effects and field-of-view issues that aren’t fully addressed in the higher-level FARMS products.
Data Gap Type
Data Gap Details
U5: Usability > Pre-processing
Data derived from satellite imagery requires pre-processing to account for pixel variability, parallax effects, and additional modeling using radiative transfer to improve solar radiation estimates.
S4: Sufficiency > Timeliness
Data flow from satellite imagery to solar radiation measurement output from FARMS needs to be recalculated and updated to expand beyond the current coverage years of the represented global regions.
R1: Reliability > Quality
Satellite-based estimation of solar resource information for sites susceptible to cloud cover, snow, and bright surfaces may not be accurate, requiring verification from ground-based measurements.
While NREL’S SRRL BMS provides real-time joint variable data from ground-based sensors, its coverage is limited to the single location in Golden, CO in the United States. The diverse sensor network requires regular maintenance, and instrument malfunctions or calibration issues may lead to data inaccuracies if not promptly detected and addressed, affecting the reliability of solar forecasting applications.
While NREL’S SRRL BMS provides real-time joint variable data from ground-based sensors, its coverage is limited to the single location in Golden, CO in the United States. The diverse sensor network requires regular maintenance, and instrument malfunctions or calibration issues may lead to data inaccuracies if not promptly detected and addressed, affecting the reliability of solar forecasting applications.
Data Gap Type
Data Gap Details
U5: Usability > Pre-processing
Instrument malfunction or calibration requires human intervention, leading to inaccuracies in measured data quantities, especially if detection is delayed, affecting solar forecast accuracies. Despite this, the dataset continues to be maintained.
S2: Sufficiency > Coverage
Coverage is reserved to Golden, CO. Other locations would benefit from similar sensor monitoring systems, especially those with variations in weather patterns that could affect solar irradiance forecasting and energy harvesting.
The SMA PV monitoring system requires user profile creation and specific system access requests, with documentation primarily in German creating potential language barriers. Data representation is geographically unbalanced with stronger coverage in Germany, Netherlands, and Australia despite its global presence. Additionally, only a subset of systems includes energy storage data, which would be valuable for comprehensive distributed energy resource load forecasting studies.
The SMA PV monitoring system requires user profile creation and specific system access requests, with documentation primarily in German creating potential language barriers. Data representation is geographically unbalanced with stronger coverage in Germany, Netherlands, and Australia despite its global presence. Additionally, only a subset of systems includes energy storage data, which would be valuable for comprehensive distributed energy resource load forecasting studies.
Data Gap Type
Data Gap Details
U4: Usability > Documentation
Documentation is primarily in German and lacks the same detail in the English version of the website. Companion research utilizing the data is not readily cited or linked. Language barriers can challenge the interpretation of displayed data values when accessed through the portal interface.
S2: Sufficiency > Coverage
Coverage varies significantly by country, with representation ranging from single systems to over 43,000 systems per country. Systems in Germany, the Netherlands, and Australia are more comprehensively represented than other regions. Additionally, battery storage information is inconsistently available across monitored systems. This gap could be addressed by increasing private user-contributed system data from diverse regions.
O2: Obtainability > Accessibility
Users must utilize the web interface or create a user profile to request access to additional data or preferred formats. Data cannot be freely downloaded in bulk or raw format and must be scraped from the web portal. Contact with SMA is required for membership or extended usage rights.
While SOLETE offers valuable data for joint wind-solar distributed energy resource forecasting, several sufficiency gaps limit its application. The dataset’s 15-month temporal coverage doesn’t capture long-term seasonal variations, and it monitors only a single wind turbine and PV array, limiting analyseis of multi-source generation coordination. Additionally, maintenance schedule and system downtime data are missing, which would enhance realistic system dynamic modeling. Supplementing with external data sources or simulation could address these limitations.
While SOLETE offers valuable data for joint wind-solar distributed energy resource forecasting, several sufficiency gaps limit its application. The dataset’s 15-month temporal coverage doesn’t capture long-term seasonal variations, and it monitors only a single wind turbine and PV array, limiting analyseis of multi-source generation coordination. Additionally, maintenance schedule and system downtime data are missing, which would enhance realistic system dynamic modeling. Supplementing with external data sources or simulation could address these limitations.
Data Gap Type
Data Gap Details
S6: Sufficiency > Missing Components
SOLETE lacks maintenance schedule data and system downtime information. Retroactively supplementing this data through simulation or SYSLAB records would improve system forecasting to account for scheduled maintenance uncertainties.
S3: Sufficiency > Granularity
Varying resolution and sampling rates (seconds to hours) can impact analysis precision, particularly when fusing data of different temporal resolutions. Aggregating second-level data to hourly intervals may affect joint short-term solar and wind forecasting outcomes.
S2: Sufficiency > Coverage
The 15-month temporal coverage is insufficient to capture long-term seasonal variations in joint wind and irradiance patterns.
S1: Sufficiency > Insufficient Volume
The dataset covers only a single wind turbine and PV array, limiting insights into coordination between multiple generation sources. This gap could be addressed by physically expanding the network or combining SOLETE with external datasets from utility and energy technology companies to enable larger grid control studies.
Improving terrestrial wildlife detection and species classification
Details (click to expand)
Terrestrial wildlife detection and species classification are essential for understanding the impacts of climate change on terrestrial ecosystems.
ML can greatly improve these efforts by automatically processing large volumes of data from diverse sources, enhancing the accuracy and efficiency of monitoring and analyzing terrestrial species.
The primary data gaps include insufficient publicly available annotated datasets and challenges with sharing large-volume bioacoustic data due to storage limitations and high costs.
Solutions include developing affordable data hosting platforms, incentivizing data sharing through recognition and funding, and establishing standardized protocols for data integration.
The main challenge with community science data is its lack of diversity. Data tends to be concentrated in accessible areas and primarily focuses on charismatic or commonly encountered species.
The main challenge with community science data is its lack of diversity. Data tends to be concentrated in accessible areas and primarily focuses on charismatic or commonly encountered species.
Data Gap Type
Data Gap Details
S2: Sufficiency > Coverage
Data is often concentrated in easily accessible areas and focuses on more charismatic or easily identifiable species. Data is also biased towards more densely populated species.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
Data Gap Type
Data Gap Details
S1: Sufficiency > Insufficient Volume
One of the foremost challenges is the scarcity of publicly open and adequately annotated data for model training. It is worth mentioning that the lack of annotated datasets is a common and major challenge applied to almost every modality of biodiversity data and is particularly acute for bioacoustic data, where the availability of sufficiently large and diverse datasets, encompassing a wide array of species, remains limited for model training purposes.
This scarcity of publicly open and well annotated datasets arises from the labor-intensive nature of data annotation and the dearth of expertise necessary to accurately identify species. This is the case even for relatively well-studied taxa such as birds, and is even more notable for taxa such as Orthoptera (grasshoppers and relatives). The incompleteness of our current taxonomy also compounds this issue.
Though some data cannot be shared with the public because of concerns about poaching or other legal restrictions imposed by local governments, there are still considerable sources of well-annotated data that currently reside within the confines of individual researchers’ or organizations’ hard drives. Unlocking these data will partially address the data gap.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
S2: Sufficiency > Coverage
There exists a significant geographic imbalance in data collected, with insufficient representation from highly biodiverse regions. The data also does not cover a diverse spectrum of species, intertwining with taxonomy insufficiencies.
Addressing these gaps involves several strategies:
Data sharing: Unlocking existing datasets scattered across individual laboratories and organizations can partially alleviate the geographic and taxonomic gaps in data coverage.
Annotation Efforts:
Leveraging crowdsourcing platforms enables collaborative data collection and facilitates volunteer-driven annotation, playing a pivotal role in expanding annotated datasets.
Collaborating with domain experts, including ecologists and machine learning practitioners, is crucial. Incorporating local ecological knowledge, particularly from indigenous communities in target regions, enriches data annotation efforts.
Developing AI-powered methods for automatic cataloging, searching, and data conversion is imperative to streamline data processing and extract relevant information efficiently.
Data Collection:
Prioritize continuous data collection efforts, especially in underrepresented regions and for less documented species and taxonomic groups.
Explore innovative approaches like Automated Multisensor Stations for Monitoring of Species Diversity (AMMODs) to enhance data collection capabilities.
Strategically plan data collection initiatives to target species at risk of extinction, ensuring data capture before irreversible loss occurs.
U2: Usability > Aggregation
A lot of well-annotated datasets are currently scattered within the confines of individual researchers’ or organizations’ hard drives. Aggregating and sharing these datasets would largely address the lack of publicly open annotated data needed for model training. This initiative would also mitigate geographic and taxonomic imbalances in existing data.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
Data Gap Type
Data Gap Details
S1: Sufficiency > Insufficient Volume
One of the foremost challenges is the scarcity of publicly open and adequately annotated data for model training. It is worth mentioning that the lack of annotated datasets is a common and major challenge applied to almost every modality of biodiversity data and is particularly acute for bioacoustic data, where the availability of sufficiently large and diverse datasets, encompassing a wide array of species, remains limited for model training purposes.
This scarcity of publicly open and well annotated datasets arises from the labor-intensive nature of data annotation and the dearth of expertise necessary to accurately identify species. This is the case even for relatively well-studied taxa such as birds, and is even more notable for taxa such as Orthoptera (grasshoppers and relatives). The incompleteness of our current taxonomy also compounds this issue.
Though some data cannot be shared with the public because of concerns about poaching or other legal restrictions imposed by local governments, there are still considerable sources of well-annotated data that currently reside within the confines of individual researchers’ or organizations’ hard drives. Unlocking these data will partially address the data gap.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
S2: Sufficiency > Coverage
There exists a significant geographic imbalance in data collected, with insufficient representation from highly biodiverse regions. The data also does not cover a diverse spectrum of species, intertwining with taxonomy insufficiencies.
Addressing these gaps involves several strategies:
Data sharing: Unlocking existing datasets scattered across individual laboratories and organizations can partially alleviate the geographic and taxonomic gaps in data coverage.
Annotation Efforts:
Leveraging crowdsourcing platforms enables collaborative data collection and facilitates volunteer-driven annotation, playing a pivotal role in expanding annotated datasets.
Collaborating with domain experts, including ecologists and machine learning practitioners, is crucial. Incorporating local ecological knowledge, particularly from indigenous communities in target regions, enriches data annotation efforts.
Developing AI-powered methods for automatic cataloging, searching, and data conversion is imperative to streamline data processing and extract relevant information efficiently.
Data Collection:
Prioritize continuous data collection efforts, especially in underrepresented regions and for less documented species and taxonomic groups.
Explore innovative approaches like Automated Multisensor Stations for Monitoring of Species Diversity (AMMODs) to enhance data collection capabilities.
Strategically plan data collection initiatives to target species at risk of extinction, ensuring data capture before irreversible loss occurs.
U2: Usability > Aggregation
A lot of well-annotated datasets are currently scattered within the confines of individual researchers’ or organizations’ hard drives. Aggregating and sharing these datasets would largely address the lack of publicly open annotated data needed for model training. This initiative would also mitigate geographic and taxonomic imbalances in existing data.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
The majority of the world’s museum specimens remain undigitized, creating a significant barrier to using these records in machine learning applications for biodiversity monitoring and climate change research.
The majority of the world’s museum specimens remain undigitized, creating a significant barrier to using these records in machine learning applications for biodiversity monitoring and climate change research.
Data Gap Type
Data Gap Details
S1: Sufficiency > Insufficient Volume
Museum specimens only become valuable to ML studies when they are digitized. Many museum specimens remain to be digitized, and this task presents significant challenges. Much of the information about these specimens, such as species traits and occurrence data, is often recorded in handwritten notes, making parsing and recognizing this information a complex and error-prone process.
Digitizing these specimens has become a priority for many museums. To support this effort, adequate funding, and technical and scientific assistance should be provided. Machine learning itself can help support some of these efforts e.g. when it comes to digitizing notes.
The first and foremost challenge of bioacoustic data is its sheer volume, which makes its data sharing especially difficult due to limited storage options and high costs. Urgent solutions are needed for cheaper and more reliable data hosting and sharing platforms.
Additionally, there’s a significant shortage of large and diverse annotated datasets, much more severe compared to image data like camera trap, drone, and crowd-sourced images.
The first and foremost challenge of bioacoustic data is its sheer volume, which makes its data sharing especially difficult due to limited storage options and high costs. Urgent solutions are needed for cheaper and more reliable data hosting and sharing platforms.
Additionally, there’s a significant shortage of large and diverse annotated datasets, much more severe compared to image data like camera trap, drone, and crowd-sourced images.
Data Gap Type
Data Gap Details
S1: Sufficiency > Insufficient Volume
One of the foremost challenges is the scarcity of publicly open and adequately annotated data for model training. It is worth mentioning that the lack of annotated datasets is a common and major challenge applied to almost every modality of biodiversity data and is particularly acute for bioacoustic data, where the availability of sufficiently large and diverse datasets, encompassing a wide array of species, remains limited for model training purposes.
This scarcity of publicly open and well annotated datasets arises from the labor-intensive nature of data annotation and the dearth of expertise necessary to accurately identify species. This is the case even for relatively well-studied taxa such as birds, and is even more notable for taxa such as Orthoptera (grasshoppers and relatives). The incompleteness of our current taxonomy also compounds this issue.
Though some data cannot be shared with the public because of concerns about poaching or other legal restrictions imposed by local governments, there are still considerable sources of well-annotated data that currently reside within the confines of individual researchers’ or organizations’ hard drives. Unlocking these data will partially address the data gap.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
S2: Sufficiency > Coverage
There exists a significant geographic imbalance in data collected, with insufficient representation from highly biodiverse regions. The data also does not cover a diverse spectrum of species, intertwining with taxonomy insufficiencies.
Addressing these gaps involves several strategies:
Data sharing: Unlocking existing datasets scattered across individual laboratories and organizations can partially alleviate the geographic and taxonomic gaps in data coverage.
Annotation Efforts:
Leveraging crowdsourcing platforms enables collaborative data collection and facilitates volunteer-driven annotation, playing a pivotal role in expanding annotated datasets.
Collaborating with domain experts, including ecologists and machine learning practitioners, is crucial. Incorporating local ecological knowledge, particularly from indigenous communities in target regions, enriches data annotation efforts.
Developing AI-powered methods for automatic cataloging, searching, and data conversion is imperative to streamline data processing and extract relevant information efficiently.
Data Collection:
Prioritize continuous data collection efforts, especially in underrepresented regions and for less documented species and taxonomic groups.
Explore innovative approaches like Automated Multisensor Stations for Monitoring of Species Diversity (AMMODs) to enhance data collection capabilities.
Strategically plan data collection initiatives to target species at risk of extinction, ensuring data capture before irreversible loss occurs.
U2: Usability > Aggregation
A lot of well-annotated datasets are currently scattered within the confines of individual researchers’ or organizations’ hard drives. Aggregating and sharing these datasets would largely address the lack of publicly open annotated data needed for model training. This initiative would also mitigate geographic and taxonomic imbalances in existing data.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
U6: Usability > Large Volume
One of the biggest challenges in bioacoustic data lies in its sheer volume, stemming from continuous monitoring processes. Researchers face significant hurdles in sharing and hosting this data, as existing online platforms often don’t provide sufficient long-term storage capacity or are very expensive. Solutions are needed to provide cheaper and more reliable hosting options. Moreover, accessing these extensive datasets demands advanced computing infrastructure and solutions. The availability of more funding sources may push more people to start sharing their bioacoustic data.
Some commercial high-resolution satellite images can also be used to identify large animals such as whales, but those images are not open to the public.
Some commercial high-resolution satellite images can also be used to identify large animals such as whales, but those images are not open to the public.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
The resolution of publicly open satellite images is not high enough. High-resolution images are usually commercial and not open for free.
Modeling effects of soil processes on soil organic carbon
Details (click to expand)
Understanding the causal relationship between soil organic carbon and soil management or farming practices is crucial for enhancing agricultural productivity and evaluating agriculture-based climate mitigation strategies.
ML can significantly contribute to this understanding by integrating data from diverse sources to provide more precise spatial and temporal analyses.
The insufficient data coverage and granularity of soil organic carbon measurements severely limit the development of well-generalized ML models for accurately predicting soil carbon dynamics.
Expanding monitoring networks and developing cost-effective measurement technologies, combined with better data standardization across different collection efforts, would enable more effective ML applications for soil carbon management and climate-smart agriculture.
Data collection is extremely expensive for some variables, leading to the use of simulated variables. Unfortunately, simulated values have large uncertainties due to the assumptions and simplifications made within simulation models.
Data collection is extremely expensive for some variables, leading to the use of simulated variables. Unfortunately, simulated values have large uncertainties due to the assumptions and simplifications made within simulation models.
Data Gap Type
Data Gap Details
R1: Reliability > Quality
Soil carbon generated from simulators is not reliable because these process-based models might be obsolete or might have a certain kind of systematic bias, which gets reflected in the simulated variables. But ML scientists who use those simulated variables usually don’t have the proper knowledge to properly calibrate these process-based models.
The common and biggest challenges for use cases involving soil organic carbon is the insufficiency of data and the lack of high granularity data.
Data Gap Type
Data Gap Details
S3: Sufficiency > Granularity
Data is quarterly value for soil carbon, but this is not enough for capturing the weekly changes in soil carbon when we change the fertilizer amount or the tilling practices.
In general, there is insufficient soil organic carbon data for training a well-generalized ML model (both in terms of coverage and granularity).
Data Gap Type
Data Gap Details
U1: Usability > Structure
Data is collected by different farmers on different farms, leading to consistency issues and a need to better structure the data.
S3: Sufficiency > Granularity
In general, there is insufficient soil organic carbon data for training a well-generalized ML model (both in terms of coverage and granularity). One reason is that collecting such data is very expensive – the hardware is costly and collecting data at a high frequency is even more expensive.
S2: Sufficiency > Coverage
In general, there is insufficient soil organic carbon data for training a well-generalized ML model (both in terms of coverage and granularity). One reason is that collecting such data is very expensive – the hardware is costly and collecting data at a high frequency is even more expensive.
Optimizing electrified bus fleet in urban vehicle-to-grid systems
Details (click to expand)
Diesel-powered school buses contribute significant carbon emissions and air pollution in urban areas, while electric bus adoption faces high upfront costs that challenge school district budgets.
AI-powered optimization systems can manage electric school bus charging and discharging schedules to create virtual power plants, offsetting electrification costs through grid services revenue.
Key data gaps include inconsistent bus fleet reporting across states, limited access to proprietary charging profiles, and fragmented charge station data that prevent comprehensive fleet optimization modeling.
Standardizing state-level fleet reporting, fostering manufacturer partnerships for charging data access, and creating centralized charge station databases can enable scalable AI solutions for urban transit electrification.
Critical gaps include limited findability of station-specific usage data due to proprietary restrictions and scattered data sources requiring aggregation from multiple providers. Manufacturer partnerships and utility collaboration can improve data access, while standardized reporting frameworks can consolidate fragmented datasets to enable comprehensive fleet optimization
Critical gaps include limited findability of station-specific usage data due to proprietary restrictions and scattered data sources requiring aggregation from multiple providers. Manufacturer partnerships and utility collaboration can improve data access, while standardized reporting frameworks can consolidate fragmented datasets to enable comprehensive fleet optimization
Data Gap Type
Data Gap Details
O1: Obtainability > Findability
Charging station usage profiles and vehicle-specific load data are often proprietary. Solution: Establish manufacturer partnerships and utility pilot programs to access detailed charging profiles.
U2: Usability > Aggregation
Charging data scattered across multiple providers and systems. Solution: Create standardized APIs and data sharing agreements between charging network operators.
The dataset suffers from inconsistent state-level reporting structures and missing data from 4 US states, limiting comprehensive national analysis. Standardizing reporting formats and expanding state participation could enable more robust AI models for fleet electrification planning across diverse geographic and operational contexts.
The dataset suffers from inconsistent state-level reporting structures and missing data from 4 US states, limiting comprehensive national analysis. Standardizing reporting formats and expanding state participation could enable more robust AI models for fleet electrification planning across diverse geographic and operational contexts.
Data Gap Type
Data Gap Details
U1: Usability > Structure
Inconsistent state-level reporting creates varying data structures and fields, with some states excluding contractor-owned buses. Solution: Develop federal reporting standards for consistent data collection across all states.
S2: Sufficiency > Coverage
Inconsistent state-level reporting creates varying data structures and fields, with some states excluding contractor-owned buses. Solution: Develop federal reporting standards for consistent data collection across all states.
S4: Sufficiency > Timeliness
Dataset maintenance discontinued after November 2022. Solution: Establish ongoing federal or industry-supported data collection mechanisms.
Optimizing smart inverter management for distributed energy resources
Details (click to expand)
Solar panels and batteries are part of new power systems that don’t use traditional spinning generators. They use inverters to convert DC to AC power. Smart inverters can do more than just convert power—they help manage changes in energy supply and keep the grid stable by adjusting voltage and power levels. This prevents issues like sudden drops or spikes in voltage when solar and other sources are added to the grid.
Machine learning can help better monitor and control smart inverters, with the potential to make efficiency gains.
One key data gap towards unlocking this use case is the access to relevant data.
Partnerships between research labs, utilities, and smart inverter manufacturers may help alleviate this bottleneck.
There is a need to enhance existing simulation tools to study inverter-based power systems rather than traditional machine-based based. Simulations should be able to represent a large number of distribution-connected inverters that incorporate DER models with the larger power system. Uncertainty surrounding model outputs will require verification and testing. Furthermore, accessibility to simulations and hardware in the loop facilities and systems requires user access proposal submission for NREL’s Energy Systems Integration Facility access. Similar testing laboratories may require access requests and funding.
There is a need to enhance existing simulation tools to study inverter-based power systems rather than traditional machine-based based. Simulations should be able to represent a large number of distribution-connected inverters that incorporate DER models with the larger power system. Uncertainty surrounding model outputs will require verification and testing. Furthermore, accessibility to simulations and hardware in the loop facilities and systems requires user access proposal submission for NREL’s Energy Systems Integration Facility access. Similar testing laboratories may require access requests and funding.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Contact NREL precise@nrel.gov for access to the PRECISE model
Submit an Energy Systems Integration Facility (ESIF) laboratory request form to userprogram.esif@nrel.gov to gain access to hardware in the loop inverter simulation systems. Access to particular hardware may require collaboration with inverter manufacturers which may have additional permission requirements.
R1: Reliability > Quality
The optimization routine of the simulation model may face challenges in determining the precise balance between grid operation criteria and impacts on customer PV generation. Generation may still require curtailment by the utility to prioritize grid stability. To circumvent this gap external data on distribution side operating conditions, load demand, solar generation, and utility-initiated generation curtailment can be collected and introduced into expanded simulation studies.
Smart inverter operational data is not publicly available and requires partnerships with research labs, utilities, and smart inverter manufacturers. However, the California Energy Commission maintains a database of UL 1741-SB compliant manufacturers and smart inverter models that can then be contacted for research partnerships. In terms of coverage area, while California and Hawaii are now moving towards standardizing smart inverter technology in their power systems, other regions outside of the United States may locate similar manufacturers through partnerships and collaborations.
Smart inverter operational data is not publicly available and requires partnerships with research labs, utilities, and smart inverter manufacturers. However, the California Energy Commission maintains a database of UL 1741-SB compliant manufacturers and smart inverter models that can then be contacted for research partnerships. In terms of coverage area, while California and Hawaii are now moving towards standardizing smart inverter technology in their power systems, other regions outside of the United States may locate similar manufacturers through partnerships and collaborations.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Particularly for the CEC database, one will need to contact the CEC or manufacturer to receive additional information for a particular smart inverter. Detailed studies using smart inverter hardware may require collaboration with a utility and research organization to perform advanced research studies.
U2: Usability > Aggregation
To retrieve additional data beyond the single entry model and manufacturer of a particular smart inverter, one may need to contact a variety of manufacturers to get access to datasets and specifications for operational smart inverter data, laboratories to get access to hardware in the loop test centers, and utilities or local energy commissions for smart inverter safety compliance and standards.
S2: Sufficiency > Coverage
New grid support functions defined by UL1741-SA and UL1741-SB are optional but will be required for California and Hawaii as of now, public manufacturing data is available only via the CEC website. Collaborations and contact with manufacturers outside the US may be necessary to compile a similar database and contact with utilities can provide a better understanding of similar UL 1741-SB criteria adoption.
Scaling identification and mapping of climate policy
Details (click to expand)
Laws and regulations relevant to climate change mitigation and adaptation are essential for assessing progress on climate action and addressing various research and practical questions.
ML can be employed to identify climate-related policies and categorize them according to different focus areas.
Law corpora are published in various languages and formats by a variety of actors, including cities, national governments and other agencies. They are not all digitized, may be hard to access and require ample harmonization work.
These data gaps may be addressed through aggregation initiatives and ML may be a key component by automating lengthy processes such as translation or screening for relevance.
Laws and regulations for climate action are published in various formats through national and subnational governments, and most are not labeled as a “climate policy”. There are a number of initiatives that take on the challenge of selecting, aggregating, and structuring the laws to provide a better overview of the global policy landscape. This, however, requires a great deal of work, needs to be permanently updated, and datasets are not complete.
Laws and regulations for climate action are published in various formats through national and subnational governments, and most are not labeled as a “climate policy”. There are a number of initiatives that take on the challenge of selecting, aggregating, and structuring the laws to provide a better overview of the global policy landscape. This, however, requires a great deal of work, needs to be permanently updated, and datasets are not complete.
Data Gap Type
Data Gap Details
U1: Usability > Structure
Much of the data are in PDF format and need to be structured into machine-readable format. Much of the data in original languages of the publishing country and needs to be translated into English.
U2: Usability > Aggregation
Legislation data is published through national and subnational governments, and often is not explicitly labeled as “climate policy”. Determing whether it is climate-related is not simple.
This information is usually published on local websites and must be downloaded or scraped manually. There are a number of initiatives, such as Climate Policy Radar, International Energy Agency, and New Climate Institute that are working to address this by selecting, aggregating, and structuring these data to provide a better overview of the global policy landscape. However, this process is labor-intensive, requires continuous updates, and often results in incomplete datasets.
Scaling methane emission detection
Details (click to expand)
Methane is the most potent greenhouse gas and the second-largest contributor to climate change, with emissions from the oil and gas industry accounting for 20% of global methane emissions.
Advanced machine learning techniques applied to satellite imagery enable the detection, quantification, and monitoring of methane emissions at scale, supporting more effective mitigation efforts across global oil and gas operations.
The primary data gap for methane detection is insufficient spatial resolution in widely available satellite data, making it difficult to pinpoint smaller or localized emission sources and accurately quantify their contribution.
Developing higher-resolution satellite systems like MethaneSAT and creating benchmark datasets with synthetic methane plume data can significantly improve detection capabilities, enabling more targeted mitigation efforts and potentially reducing a substantial portion of global methane emissions.
Very few actual hyperspectral images of methane plumes exist, creating a significant data volume limitation for training robust detection algorithms.
Data Gap Type
Data Gap Details
S1: Sufficiency > Insufficient Volume
Images of methane plumes in hyperspectral satellite data are very rare, leading to insufficient data for developing and training robust detection algorithms. Consequently, researchers often use synthetic data, transposing high-resolution methane plume images from other sources such as Sentinel-2 onto hyperspectral images from platforms like PRISMA. Expanding the collection of actual hyperspectral methane plume observations or developing more sophisticated methods for generating realistic synthetic data would significantly improve detection capabilities.
Current multispectral satellite data has insufficient spatial resolution to detect smaller methane leaks.
Data Gap Type
Data Gap Details
S3: Sufficiency > Granularity
Many current satellites have limited spatial resolution, making it challenging to detect smaller or localized methane sources. This low resolution can result in inaccurate assessments, potentially missing smaller leaks or misidentifying emission sources. Higher resolution is necessary for accurately identifying and quantifying methane emissions from specific facilities or small-scale sources.
Scaling solar photovoltaics site assessments
Details (click to expand)
Statistical analysis on solar photovoltaic (PV) systems with respect to pricing, logistics, planning, and site capacity studies is an important part of the process for siting solar PV systems.
Spatiotemporal generation forecasting using pre-existing site data can be used to inform future site recommendations, policy, and decision-making with respect to new developments.
Existing data displays lacks of coverage that limit the applicability and generalization capacities of ML across regions.
The availability of open datasets in more regions would help alleviate these gaps.
The solar panel PV system dataset excluded third-party-owned systems and systems with battery backup. Since data was self-reported, component cost data may be imprecise. The dataset also has a data gap in timeliness as some of the data includes historical data, which may not reflect current pricing and costs of PV systems.
The solar panel PV system dataset excluded third-party-owned systems and systems with battery backup. Since data was self-reported, component cost data may be imprecise. The dataset also has a data gap in timeliness as some of the data includes historical data, which may not reflect current pricing and costs of PV systems.
Data Gap Type
Data Gap Details
S2: Sufficiency > Coverage
The dataset excluded third-party-owned systems, systems with battery backup, self-installed systems, and data that was missing installation prices. Data was self-reported and may be inconsistent based on the reporting of component costs. Furthermore, some state markets were underrepresented or missing, which can be alleviated by new data collection jointly with simulation studies.
S4: Sufficiency > Timeliness
The dataset includes historical data that may not reflect current pricing for PV systems. To alleviate this, updated pricing may be incorporated in the form of external data or as additional synthetic data from simulation.
Only the US are covered in this dataset.. Enhancing the data by supplementing it with international large-scale photovoltaic satellite imagery can expand the coverage area of the dataset.
Only the US are covered in this dataset.. Enhancing the data by supplementing it with international large-scale photovoltaic satellite imagery can expand the coverage area of the dataset.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Data may be accessed through the USGS’s designated USPVDB mapper or downloaded as shapefiles for GIS data, tabular data, or as XML: metadata. Data is open and easily obtainable.
S2: Sufficiency > Coverage
Coverage is over the US and specifically over densely populated regions that may or may not correlate to areas of low cloud cover and high solar irradiance. Representation of smaller scale private PV systems could expand the current dataset to less populated areas as well as regions outside the US.
Weather forecasting: Short-to-medium term (1-14 days)
Details (click to expand)
Weather forecasting at 1-14 days ahead has implications for real-time response and planning applications within both climate change mitigation and adaptation. ML can help improve short-to-medium-term weather forecasts.
The biggest challenge of ENS is that only a portion of it is available to the public for free.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
The high-resolution real-time forecast is for purchase only and it is expensive to get them.
U6: Usability > Large Volume
Due to its high spatio-temporal resolution, HRES data is quite large, creating significant challenges for downloading, transferring, storing, and processing with standard computational infrastructure. To address this, the WeatherBench 2 benchmark dataset provides a solution for certain machine learning tasks involving ENS data.
ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.
ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Download delays from Copernicus Climate Data Store - Enhanced server infrastructure, regional mirror sites, and cloud-based access platforms can reduce download times from days/months to hours
U6: Usability > Large Volume
Massive storage and processing requirements - Cloud computing platforms with pre-loaded datasets and data subsetting tools can enable analysis without full downloads
R1: Reliability > Quality
Inherent biases limit ground truth applications - ML-enhanced data assimilation techniques and ensemble reanalysis approaches can reduce model-dependent biases, particularly improving precipitation and cloud field accuracy
The biggest challenge of HRES is that only a portion of it is available to the public for free.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
The high-resolution real-time forecast is for purchase only and it is expensive to get them.
U6: Usability > Large Volume
Due to its high spatio-temporal resolution, HRES data is quite large, creating significant challenges for downloading, transferring, storing, and processing with standard computational infrastructure. To address this, the WeatherBench 2 benchmark dataset provides a solution for certain machine learning tasks involving HRES data.
Weather Bench 2 is based on ERA5, so the issues of ERA5 are also inherent here, that is, data has biases over regions where there are no observations.
Data Gap Type
Data Gap Details
R1: Reliability > Quality
Inherent biases limit ground truth applications - ML-enhanced data assimilation techniques and ensemble reanalysis approaches can reduce model-dependent biases, particularly improving precipitation and cloud field accuracy
Weather forecasting: Subseasonal horizon
Details (click to expand)
High-fidelity weather forecasts at subseasonal to seasonal scales (3-4 weeks ahead) are important for a variety of climate change-related applications. ML can be used to postprocess outputs from physics-based numerical forecast models in order to generate higher-fidelity forecasts.
ERA5 is widely used due to its high resolution and global coverage, but faces significant accessibility and reliability challenges. Download times from the Copernicus Climate Data Store can take days to months due to high demand and data storage on tape systems. ERA5’s own biases and uncertainties, particularly in precipitation fields, limit its effectiveness as ground truth for ML bias correction. Enhanced download infrastructure and improved reanalysis methods incorporating ML-based data assimilation can address these limitations.
ERA5 is widely used due to its high resolution and global coverage, but faces significant accessibility and reliability challenges. Download times from the Copernicus Climate Data Store can take days to months due to high demand and data storage on tape systems. ERA5’s own biases and uncertainties, particularly in precipitation fields, limit its effectiveness as ground truth for ML bias correction. Enhanced download infrastructure and improved reanalysis methods incorporating ML-based data assimilation can address these limitations.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Download delays from Copernicus Climate Data Store - Enhanced server infrastructure, regional mirror sites, and cloud-based access platforms can reduce download times from days/months to hours
U6: Usability > Large Volume
Massive storage and processing requirements - Cloud computing platforms with pre-loaded datasets and data subsetting tools can enable analysis without full downloads
R1: Reliability > Quality
Inherent biases limit ground truth applications - ML-enhanced data assimilation techniques and ensemble reanalysis approaches can reduce model-dependent biases, particularly improving precipitation and cloud field accuracy
More data is needed to develop a more accurate and robust ML model. It is also important to note that SUBX data contains biases and uncertainties, which can be inherited by ML models trained with this data.
More data is needed to develop a more accurate and robust ML model. It is also important to note that SUBX data contains biases and uncertainties, which can be inherited by ML models trained with this data.
Data Gap Type
Data Gap Details
S1: Sufficiency > Insufficient Volume
Larger models generally offer improved performance for developing data-driven sub-seasonal forecast models. However, with only a limited number of models contributing to the SUBX dataset, there is a scarcity of training data. To enhance ML model performance, more SUBX data generated by physics-based numerical weather forecast models is required.
Weather forecasting: Subseasonal-to-seasonal horizon
Details (click to expand)
High-fidelity weather forecasts at the subseasonal-to-seasonal (S2S) scale (i.e., 10-46 days ahead) are important for a variety of climate change-related applications. ML can be used to postprocess outputs from physics-based numerical forecast models in order to generate higher-fidelity forecasts.
CPC Global Unified gauge-based analysis of daily precipitation https://psl.noaa.gov/data/gridded/data.cpc.globalprecip.html
Data Gap Type
Data Gap Details
R1: Reliability > Quality
There is large uncertainty in data as data is derived via interpolating station data. There are large biases over areas where rain gauge stations are sparse
S3: Sufficiency > Granularity
Resolution is 0.5 deg (roughly 50km) and not sufficiently fine for many applications.
Camera trap wildlife image collections
Details (click to expand)
Camera traps are likely the most widely used sensors in automated biodiversity monitoring due to their low cost and simple installation. This medium offers close-range monitoring over long-time scales. The image sequences can be used to not only classify species but to identify specifics about the individual, e.g. sex, age, health, behavior, and predator-prey interactions. Camera trap data has been used to estimate species occurrence, richness, distribution, and density.
In general, the raw images from camera traps need to be annotated before they can be used to train ML models. Some of the available annotated camera trap images are shared via Wildlife Insights (www.wildlifeinsights.org) and LILA BC (www.lila.science), while others are listed on GBIF (https://www.gbif.org/dataset/search?q=). However, the majority of camera trap data is likely scattered across individual research labs or organizations and not publicly available. Sharing such images could provide significant progress towards filling the gaps associated with the lack of annotated data that currently hinders the progress of efficiently using ML in biodiversity studies. This is what initiatives like Wildlife Insights are looking to do.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
Data Gap Type
Data Gap Details
S1: Sufficiency > Insufficient Volume
One of the foremost challenges is the scarcity of publicly open and adequately annotated data for model training. It is worth mentioning that the lack of annotated datasets is a common and major challenge applied to almost every modality of biodiversity data and is particularly acute for bioacoustic data, where the availability of sufficiently large and diverse datasets, encompassing a wide array of species, remains limited for model training purposes.
This scarcity of publicly open and well annotated datasets arises from the labor-intensive nature of data annotation and the dearth of expertise necessary to accurately identify species. This is the case even for relatively well-studied taxa such as birds, and is even more notable for taxa such as Orthoptera (grasshoppers and relatives). The incompleteness of our current taxonomy also compounds this issue.
Though some data cannot be shared with the public because of concerns about poaching or other legal restrictions imposed by local governments, there are still considerable sources of well-annotated data that currently reside within the confines of individual researchers’ or organizations’ hard drives. Unlocking these data will partially address the data gap.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
S2: Sufficiency > Coverage
There exists a significant geographic imbalance in data collected, with insufficient representation from highly biodiverse regions. The data also does not cover a diverse spectrum of species, intertwining with taxonomy insufficiencies.
Addressing these gaps involves several strategies:
Data sharing: Unlocking existing datasets scattered across individual laboratories and organizations can partially alleviate the geographic and taxonomic gaps in data coverage.
Annotation Efforts:
Leveraging crowdsourcing platforms enables collaborative data collection and facilitates volunteer-driven annotation, playing a pivotal role in expanding annotated datasets.
Collaborating with domain experts, including ecologists and machine learning practitioners, is crucial. Incorporating local ecological knowledge, particularly from indigenous communities in target regions, enriches data annotation efforts.
Developing AI-powered methods for automatic cataloging, searching, and data conversion is imperative to streamline data processing and extract relevant information efficiently.
Data Collection:
Prioritize continuous data collection efforts, especially in underrepresented regions and for less documented species and taxonomic groups.
Explore innovative approaches like Automated Multisensor Stations for Monitoring of Species Diversity (AMMODs) to enhance data collection capabilities.
Strategically plan data collection initiatives to target species at risk of extinction, ensuring data capture before irreversible loss occurs.
U2: Usability > Aggregation
A lot of well-annotated datasets are currently scattered within the confines of individual researchers’ or organizations’ hard drives. Aggregating and sharing these datasets would largely address the lack of publicly open annotated data needed for model training. This initiative would also mitigate geographic and taxonomic imbalances in existing data.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity studies. Addressing this gap requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity studies. Addressing this gap requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
Data Gap Type
Data Gap Details
S1: Sufficiency > Insufficient Volume
One of the foremost challenges is the scarcity of publicly open and adequately annotated data for model training. It is worth mentioning that the lack of annotated datasets is a common and major challenge applied to almost every modality of biodiversity data and is particularly acute for bioacoustic data, where the availability of sufficiently large and diverse datasets, encompassing a wide array of species, remains limited for model training purposes.
This scarcity of publicly open and well annotated datasets arises from the labor-intensive nature of data annotation and the dearth of expertise necessary to accurately identify species. This is the case even for relatively well-studied taxa such as birds, and is even more notable for taxa such as Orthoptera (grasshoppers and relatives). The incompleteness of our current taxonomy also compounds this issue.
Though some data cannot be shared with the public because of concerns about poaching or other legal restrictions imposed by local governments, there are still considerable sources of well-annotated data that currently reside within the confines of individual researchers’ or organizations’ hard drives. Unlocking these data will partially address the data gap.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
S2: Sufficiency > Coverage
There exists a significant geographic imbalance in data collected, with insufficient representation from highly biodiverse regions. The data also does not cover a diverse spectrum of species, intertwining with taxonomy insufficiencies. There is now increasing work in insect camera traps, but this field is still in its infancy and data remains limited.
Addressing these gaps involves several strategies:
Data sharing: Unlocking existing datasets scattered across individual laboratories and organizations can partially alleviate the geographic and taxonomic gaps in data coverage.
Annotation Efforts:
Leveraging crowdsourcing platforms enables collaborative data collection and facilitates volunteer-driven annotation, playing a pivotal role in expanding annotated datasets.
Collaborating with domain experts, including ecologists and machine learning practitioners, is crucial. Incorporating local ecological knowledge, particularly from indigenous communities in target regions, enriches data annotation efforts.
Developing AI-powered methods for automatic cataloging, searching, and data conversion is imperative to streamline data processing and extract relevant information efficiently.
Data Collection:
Prioritize continuous data collection efforts, especially in underrepresented regions and for less documented species and taxonomic groups.
Explore innovative approaches like Automated Multisensor Stations for Monitoring of Species Diversity (AMMODs) to enhance data collection capabilities.
Strategically plan data collection initiatives to target species at risk of extinction, ensuring data capture before irreversible loss occurs.
U2: Usability > Aggregation
A lot of well-annotated datasets are currently scattered within the confines of individual researchers’ or organizations’ hard drives. Aggregating and sharing these datasets would largely address the lack of publicly open annotated data needed for model training. This initiative would also mitigate geographic and taxonomic imbalances in existing data.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
There is a lack of standardized protocol to guide data collection for restoration projects, including which variables to measure and how to measure them. Developing such protocols would standardize data collection practices, enabling consistent assessment and comparison of restoration successes and failures on a global scale.
There is a lack of standardized protocol to guide data collection for restoration projects, including which variables to measure and how to measure them. Developing such protocols would standardize data collection practices, enabling consistent assessment and comparison of restoration successes and failures on a global scale.
Data Gap Type
Data Gap Details
U1: Usability > Structure
For restoration projects, there is an urgent need for standardized protocols to guide data collection and curation for measuring biodiversity and other complex ecological outcomes of restoration projects. For example, there is an urgent need of clearly written guidance on what variables to collect and how to collect them.
Turning raw data into usable insights is also a significant challenge. There is a lack of standardized tools and workflows to turn the raw data into analysis-ready data and analyze the data in a consistent way across projects.
U4: Usability > Documentation
For restoration projects, there is an urgent need for standardized protocols to guide data collection and curation for measuring biodiversity and other complex ecological outcomes of restoration projects. For example, there is an urgent need for clearly written guidance on what variables to collect and how to collect them.
Turning raw data into usable insights is also a significant challenge. There is a lack of standardized tools and workflows to turn the raw data into analysis-ready data and analyze the data in a consistent way across projects.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
Data Gap Type
Data Gap Details
U2: Usability > Aggregation
A lot of well-annotated datasets are currently scattered within the confines of individual researchers’ or organizations’ hard drives. Aggregating and sharing these datasets would largely address the lack of publicly open annotated data needed for model training. This initiative would also mitigate geographic and taxonomic imbalances in existing data.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
S2: Sufficiency > Coverage
There exists a significant geographic imbalance in data collected, with insufficient representation from highly biodiverse regions. The data also does not cover a diverse spectrum of species, intertwining with taxonomy insufficiencies.
Addressing these gaps involves several strategies:
Data sharing: Unlocking existing datasets scattered across individual laboratories and organizations can partially alleviate the geographic and taxonomic gaps in data coverage.
Annotation Efforts:
- Leveraging crowdsourcing platforms enables collaborative data collection and facilitates volunteer-driven annotation, playing a pivotal role in expanding annotated datasets.
- Collaborating with domain experts, including ecologists and machine learning practitioners, is crucial. Incorporating local ecological knowledge, particularly from indigenous communities in target regions, enriches data annotation efforts.
- Developing AI-powered methods for automatic cataloging, searching, and data conversion is imperative to streamline data processing and extract relevant information efficiently.
Data Collection:
- Prioritize continuous data collection efforts, especially in underrepresented regions and for less documented species and taxonomic groups.
- Explore innovative approaches like Automated Multisensor Stations for Monitoring of Species Diversity (AMMODs) to enhance data collection capabilities.
- Strategically plan data collection initiatives to target species at risk of extinction, ensuring data capture before irreversible loss occurs.
S1: Sufficiency > Insufficient Volume
One of the foremost challenges is the scarcity of publicly open and adequately annotated data for model training. It is worth mentioning that the lack of annotated datasets is a common and major challenge applied to almost every modality of biodiversity data and is particularly acute for bioacoustic data, where the availability of sufficiently large and diverse datasets, encompassing a wide array of species, remains limited for model training purposes.
This scarcity of publicly open and well annotated datasets arises from the labor-intensive nature of data annotation and the dearth of expertise necessary to accurately identify species. This is the case even for relatively well-studied taxa such as birds, and is even more notable for taxa such as Orthoptera (grasshoppers and relatives). The incompleteness of our current taxonomy also compounds this issue.
Though some data cannot be shared with the public because of concerns about poaching or other legal restrictions imposed by local governments, there are still considerable sources of well-annotated data that currently reside within the confines of individual researchers’ or organizations’ hard drives. Unlocking these data will partially address the data gap.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
Data Gap Type
Data Gap Details
S1: Sufficiency > Insufficient Volume
One of the foremost challenges is the scarcity of publicly open and adequately annotated data for model training. It is worth mentioning that the lack of annotated datasets is a common and major challenge applied to almost every modality of biodiversity data and is particularly acute for bioacoustic data, where the availability of sufficiently large and diverse datasets, encompassing a wide array of species, remains limited for model training purposes.
This scarcity of publicly open and well annotated datasets arises from the labor-intensive nature of data annotation and the dearth of expertise necessary to accurately identify species. This is the case even for relatively well-studied taxa such as birds, and is even more notable for taxa such as Orthoptera (grasshoppers and relatives). The incompleteness of our current taxonomy also compounds this issue.
Though some data cannot be shared with the public because of concerns about poaching or other legal restrictions imposed by local governments, there are still considerable sources of well-annotated data that currently reside within the confines of individual researchers’ or organizations’ hard drives. Unlocking these data will partially address the data gap.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
S2: Sufficiency > Coverage
There exists a significant geographic imbalance in data collected, with insufficient representation from highly biodiverse regions. The data also does not cover a diverse spectrum of species, intertwining with taxonomy insufficiencies.
Addressing these gaps involves several strategies:
Data sharing: Unlocking existing datasets scattered across individual laboratories and organizations can partially alleviate the geographic and taxonomic gaps in data coverage.
Annotation Efforts:
Leveraging crowdsourcing platforms enables collaborative data collection and facilitates volunteer-driven annotation, playing a pivotal role in expanding annotated datasets.
Collaborating with domain experts, including ecologists and machine learning practitioners, is crucial. Incorporating local ecological knowledge, particularly from indigenous communities in target regions, enriches data annotation efforts.
Developing AI-powered methods for automatic cataloging, searching, and data conversion is imperative to streamline data processing and extract relevant information efficiently.
Data Collection:
Prioritize continuous data collection efforts, especially in underrepresented regions and for less documented species and taxonomic groups.
Explore innovative approaches like Automated Multisensor Stations for Monitoring of Species Diversity (AMMODs) to enhance data collection capabilities.
Strategically plan data collection initiatives to target species at risk of extinction, ensuring data capture before irreversible loss occurs.
U2: Usability > Aggregation
A lot of well-annotated datasets are currently scattered within the confines of individual researchers’ or organizations’ hard drives. Aggregating and sharing these datasets would largely address the lack of publicly open annotated data needed for model training. This initiative would also mitigate geographic and taxonomic imbalances in existing data.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
Academic literature databases
Details (click to expand)
Academic literature databases, such as Openalex, Web of Science, Scopus.
Use Case
Data Gap Summary
Active fire data – satellite-derived
Details (click to expand)
Active fire data derived from images taken by satellites such as MODIS, VIIRS, and LANDSAT at different spatial resolutions and temporal frequencies. These datasets provide near real-time detection of active fires globally and can be downloaded fromhttps://firms.modaps.eosdis.nasa.gov/active_fire.
Use Case
Data Gap Summary
Advanced metering infrastructure data
Details (click to expand)
Advanced Metering Infrastructure (AMI) facilitates communication between utilities and customers through smart meter device systems that collect, store, and analyze per building energy consumption.
AMI data can be retrieved through public data portals, individual data collection, or research partnerships with local utilities. Some examples of utility research partnerships include the Irvine Smart Grid Demonstration (ISGD) project conducted by Southern California Edison (SCE) and the smart meter pilot test from the Sacramento Municipal Utility. An example of publicly available data that is aggregated and anonymized is the Commission for Energy Regulation (CER) Smart Metering Project hosted by the Irish Social Science Data Archive (ISSDA).
AMI data is challenging to obtain without pilot study partnerships with utilities since data collection on individual building consumer behavior can infringe upon customer privacy, especially at the residential level. Granularity of time series data can also vary depending on the level of access to the data, whether it be aggregated and anonymized or based on the resolution of readings and system. Additionally, the coverage of data will be limited to utility pilot test service areas, thereby restricting the scope and scale of demand studies.
AMI data is challenging to obtain without pilot study partnerships with utilities since data collection on individual building consumer behavior can infringe upon customer privacy, especially at the residential level. Granularity of time series data can also vary depending on the level of access to the data, whether it be aggregated and anonymized or based on the resolution of readings and system. Additionally, the coverage of data will be limited to utility pilot test service areas, thereby restricting the scope and scale of demand studies.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Access to real AMI data can be difficult to obtain due to privacy concerns. Even when partnered with a utility, the AMI data can undergo anonymization and aggregation to protect individual customers. Some ISOs are able to distribute data provided that a written records request is submitted. If requesting personal consumption data, program pricing enrollment, may limit temporal resolution of data that a utility can provide. Open datasets, on the other hand, may only be available for academic research or teaching use (ISSDA CER data).
U2: Usability > Aggregation
AMI data when used jointly with other data that may influence demand such as weather, availability of rooftop solar, presence of electric vehicles, building specifications, and appliance inventory may require significant additional data collection or retrieval. Non-intrusive load monitoring techniques to disaggregate AMI data may be employed with some assumptions based on additional data. For example, the use of satellite imagery over a region of interest can assist in identifying buildings that have solar panels.
U3: Usability > Usage Rights
For ISSDA CER data use, a request form must be submitted for evaluation by issda@ucd.ie. For data obtained through utility collaborative partnerships, usage rights may vary. Please contact the data provider for more information.
U5: Usability > Pre-processing
Data cleanliness may vary depending on the data source. For individual private data collection through testbed development, cleanliness can depend on formats of data stream output from the sensor network system installed. When designing the testbed data format it is recommended to develop and structure comprehensive metadata with respect to the study to encourage further development.
R1: Reliability > Quality
Anonymized data may not be verifiable or useful once it is open-source. Further data collection for verification purposes is recommended.
S2: Sufficiency > Coverage
S3: Sufficiency > Granularity
Meter resolution can vary based on the hardware ranging from 1 hour, 30 minute, to 15 minute measurement intervals. Depending on the level of anonymization and aggregation of data, the granularity may be constrained to other factors such as the cadence of time of use pricing and other tiered demand response programs employed by the partnering utility. Interpolation may be used to combat issues with respect to resolution but may require uncertainty considerations when reporting results.
S4: Sufficiency > Timeliness
With respect to the CER Smart Metering Project and the associated Customer Behavior Trials (CBT), Electric Ireland and Bord Gais Energy smart meter installation and monitoring occurred from 2009-2010. This anonymized dataset may no longer be representative of current behavior usage as household compositions and associated loads change with time. Similarly, pilot programs through participating utilities are finite in nature. To address this data gap, in the context of previous pilot study locations, studies and testbeds can be reopened or revisited. In the context of new studies in different locations, previous data can still be utilized for pre-training models, however, fine-tuning would still require new data collection.
Aerial power line corridor inspection data
Details (click to expand)
LiDAR and image data collected from unmanned aerial vehicles (UAVs) for power line right-of-way (RoW) inspection can be accessed from private providers such as LUMA Energy and COR3, as well as sources like China Southern Power Grid with dastasets from Yunnan RoW-1, Yunnan RoW-2, and Hubei RoW 4. Open source EPRI distribution inspection imagery is also available and labeled with information regarding conductors, poles, crossarms, insulators, and other infrastructure components. These datasets pair images with geolocated GIS data to identify priority vegetation management areas near transmission lines.
UAV imagery for vegetation management near power lines requires partnerships with private companies and utilities for access. LiDAR data is often sparse with partial line scans resulting in poor data quality. Coverage is typically limited to specific rights-of-way, requiring continuous monitoring to track vegetation growth over time.
UAV imagery for vegetation management near power lines requires partnerships with private companies and utilities for access. LiDAR data is often sparse with partial line scans resulting in poor data quality. Coverage is typically limited to specific rights-of-way, requiring continuous monitoring to track vegetation growth over time.
Data Gap Type
Data Gap Details
U3: Usability > Usage Rights
Once collected, data is private as RoWs represent critical energy infrastructure. Private partnerships may allow for extended usage rights within a predefined scope.
S4: Sufficiency > Timeliness
Measurements should be taken at multiple periods to examine transmission line characteristics to both vegetation growth and or line sag caused by overvoltage conditions.
S2: Sufficiency > Coverage
Coverage can vary depending on the RoW examined. Often multiple datasets that contain multiple transmission RoW UAV image data would be necessary to increase the number of image examples in the dataset.
O1: Obtainability > Findability
Must be involved in an active study with a partnering utility or transmission owner to get access to pre-existing drone data or to get permission to collect drone data.
Automated surface observation system (ASOS)
Details (click to expand)
This dataset contains one- and five-minute observations from automated surface observation system stations in the US. The ASOS network provides near real-time surface weather measurements including wind speed and direction, dew point, air temperature, station pressure, precipitation, visibility, and cloud characteristics. See https://madis.ncep.noaa.gov/madis_OMO.shtml
Data volume is large and only data specific to the US is available.
Data Gap Type
Data Gap Details
U6: Usability > Large Volume
The large data volume, resulting from its high spatio-temporal resolution, makes transferring and processing the data very challenging. It would be beneficial if the data were accessible remotely and if computational resources were provided alongside it.
S2: Sufficiency > Coverage
This assimilated dataset currently covers only the continental US. It would be highly beneficial to have a similar dataset that includes global coverage.
Benchmark datasets for building energy modeling
Details (click to expand)
Building energy modeling datasets provide measurements of energy demand profiles for a sample of buildings, as well as relevant input variables for traditional and ML-based models, enabling us to benchmark the performance of different models for energy prediction tasks. For example, the US Office of Energy Efficiency and Renewable Energy hosts 15 building datasets for 10 states covering 7 climate zones and 11 different building types (https://bbd.labworks.org/dataset-search). The data covers energy, indoor air quality, occupancy, environment, HVAC, lighting, and energy consumption to name a few. Datasets are organized by name and points of contact.
All data featured on the platform is open access, with standardization on metadata format to allow for ease of use and information specific to buildings based on type, location, and climate zone. Data quality and guidance on curation and cleaning, in addition to access restrictions, are specified in the metadata of each hosted dataset. Licensing information for each individual featured dataset is provided.
Benchmark datasets for building energy modeling are few, are mostly available in the US, and cover a limited range of building types. The variables provided in such datasets are not always precise and comprehensive enough to test models adequately.
Benchmark datasets for building energy modeling are few, are mostly available in the US, and cover a limited range of building types. The variables provided in such datasets are not always precise and comprehensive enough to test models adequately.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Most of the energy demand data is not freely available. Reasons include the reluctance of private companies to share the data and privacy concerns with respect to the residents of the buildings. Such data may be obtained for research via non-disclosure agreements, often after lengthy bureaucratic approval. This situation makes the development of open-access benchmark datasets complex. Targeted stakeholder engagement via data collection projects is required to overcome this situation.
U2: Usability > Aggregation
The different variables needed may not always be available together. One may need to match energy demand with building stock information and climatic data. Reusable open-source tools may ease this process.
S2: Sufficiency > Coverage
Most datasets are from test beds, buildings, and contributing households from the United States. Similar data from other regions would require data collection as household usage behavior may differ depending on culture, location, building age, and weather. Targeted stakeholder engagement via data collection projects is required to overcome this situation.
S3: Sufficiency > Granularity
Dataset time resolution and period of temporal coverage vary depending on the dataset selected. To overcome this gap, interpolation techniques may be employed and recorded.
S6: Sufficiency > Missing Components
Certain detailed variables about the building design and occupancy may not be recorded. Such data points are difficult to obtain without new data collection. Building data typically does not include grid interactive data or signals from the utility side with respect to control or demand side management. Such data can be difficult to obtain or require special permissions. By enabling the collection of utility side signals, utility-initiated auto-demand response (auto-DR) and load shifting could be better assessed.
Biodiversity images and recordings – community science data
Details (click to expand)
Images and recordings contributed by volunteers represent another significant source of data on biodiversity and ecosystems. Crowdsourcing platforms, such as iNaturalist, eBird, Zooniverse, and Wildbook, facilitate the sharing of community science data. Many of these platforms also serve as hubs for collating and annotating datasets.
The main challenge with community science data is its lack of diversity. Data tends to be concentrated in accessible areas and primarily focuses on charismatic or commonly encountered species.
The main challenge with community science data is its lack of diversity. Data tends to be concentrated in accessible areas and primarily focuses on charismatic or commonly encountered species.
Data Gap Type
Data Gap Details
S2: Sufficiency > Coverage
Data is often concentrated in easily accessible areas and focuses on more charismatic or easily identifiable species. Data is also biased towards more densely populated species.
Building data genome project (hourly building-level metered data)
Details (click to expand)
The Building Data Genome Project 2 dataset contains hourly building-level data from 3,053 energy meters from 1,636 non-residential buildings covering two years worth of metered data with respect to electricity, water, and solar in addition to logistical metadata with respect to area, primary building use category, floor area, time zone, weather, and smart meter type. The goal of the dataset to allow for the development of generalizable building models for energy efficiency analysis studies. The building data genome project 2 compiles building data from public open datasets along with privately curated building data specific to university and higher education institutions.
While the dataset has clear documentation, the lack of diversity in building types necessitates further development of the dataset to include other types of buildings, as well as expanding coverage to areas and times beyond those currently available.
While the dataset has clear documentation, the lack of diversity in building types necessitates further development of the dataset to include other types of buildings, as well as expanding coverage to areas and times beyond those currently available.
Data Gap Type
Data Gap Details
U2: Usability > Aggregation
Data was collated from 7 open access public data sources as well as 12 privately curated datasets from facilities management at different college sites requiring manual site visits which are not included in the data repository at this time.
S2: Sufficiency > Coverage
The dataset is curated from buildings on university campuses thereby limiting the diversity of building representation. To overcome the lack of diversity in building data, data sharing incentives and community open source contributions can allow for the expansion of the of the dataset.
S3: Sufficiency > Granularity
The granularity of the meter data is hourly which may not be adequate for short term load-forecasting and efficiency studies at a higher resolution. Assumptions on conditions would have to be made prior to interpolating.
S4: Sufficiency > Timeliness
The dataset covers hourly measurements from January 1, 2016 to December 31, 2018. While this may be adequate for pre-training models, further data collection through a reinitiation of the study may be needed to fine-tune models for more up to date periods of time.
Building stock – from cadaster and aerial imagery
Details (click to expand)
Building stock maps enable a geolocalized understanding of where and which kind of buildings stand, relevant both to climate change mitigation and adaptation. Building stock data from cadasters and aerial imagery provide the most precise available data. In addition to precise building footprints, the 3D geometry of walls and roofs may be available thanks to LiDAR aerial surveys. Further high-quality information from the cadaster may be available as attributes, such as the current usage or the construction year of the building.
These datasets are mainly available in rich countries from Europe, North America, and Asia, leaving large parts of the world with timely challenges involving their building stock without appropriate data for detailed assessments. The lack of attributes beyond the location and shape of the buildings also limits the breadth of applications. ML may help by inferring missing attributes, given the availability of sufficient training data. Research efforts in particular in Europe, including EUBUCCO (eubucco.com) or the Digital Building Stock Model by the Joint Research Centre of the European Commission (https://data.jrc.ec.europa.eu/collection/id-00382), are addressing several of the existing data gaps.
These datasets are mainly available in rich countries from Europe, North America, and Asia, leaving large parts of the world with timely challenges involving their building stock without appropriate data for detailed assessments. The lack of attributes beyond the location and shape of the buildings also limits the breadth of applications. ML may help by inferring missing attributes, given the availability of sufficient training data. Research efforts in particular in Europe, including EUBUCCO (eubucco.com) or the Digital Building Stock Model by the Joint Research Centre of the European Commission (https://data.jrc.ec.europa.eu/collection/id-00382), are addressing several of the existing data gaps.
Data Gap Type
Data Gap Details
O1: Obtainability > Findability
Certain datasets require searching and navigating websites in a foreign language.
O2: Obtainability > Accessibility
Some datasets are not publicly available and require either payment or governmental authorization. This situation is changing in Europe via the high-value dataset regulation in the European Union, which mandates member to release their building stock data with permissive licenses (https://data.europa.eu/en/news-events/news/unlocking-potential-high-value-datasets-impact-hvd-implementing-regulation).
U1: Usability > Structure
Datasets are released under a multitude of formats. Despite the existence of standards such as CityGML, one typically needs a particular pipeline for processing every new dataset.
U2: Usability > Aggregation
Datasets are typically released by local authorities and require aggregations. Some efforts in particular in Europe, including EUBUCCO (eubucco.com) or the Digital Building Stock Model by the Joint Research Centre of the European Commission (https://data.jrc.ec.europa.eu/collection/id-00382), have made this process easier, but without enabling yet seamless updates.
U3: Usability > Usage Rights
Most datasets use attribution-based licenses, but some datasets use custom licenses, unclear licenses, or restrictive licenses.
U4: Usability > Documentation
Most datasets do not provide appropriate documentation to fully understand how the dataset was created.
U5: Usability > Pre-processing
Certain fields may contain local codes that need to be translated and understood. Numerical values may contain encodings for NAs, such as -1 or 1000, that need to be cleaned.
U6: Usability > Large Volume
Precise 3D datasets can be voluminous for a city. Country-level datasets also tend to require significant computing resources.
R1: Reliability > Quality
The height estimation from LiDAR data may contain large errors, e.g., due to surrounding objects such as trees.
S2: Sufficiency > Coverage
There are very few datasets outside of rich countries from Europe, North America, and Asia. Precise 3D models and attribute-rich datasets are available for even fewer countries.
S4: Sufficiency > Timeliness
Practices vary widely between multiple updates per year to a one-off release that may be more than 10 years old. Aerial surveys with LiDAR are expensive and are rarely done more than once every ten years.
Building stock – satellite-derived
Details (click to expand)
Building stock maps enable a geolocalized understanding of where and which kind of buildings stand, relevant both to climate change mitigation and adaptation. Satellite-derived datasets, which often use ML for processing satellite imagery, can provide such maps on a global scale. Coarser-resolution maps come as raster data at resolutions varying from 10 to more than 100 m, while the maps with the highest resolution provide details on building footprint geometries as vector data. Some of these datasets may have a temporal resolution and some inferred attributes describing the building characteristics.
These datasets are typically released at a scale that makes their validation complex and partial, implying potentially large uncertainties in the data. The lack of attributes beyond the location and shape of the buildings also limits the breadth of applications. ML may help by inferring missing attributes, given the availability of sufficient training data.
These datasets are typically released at a scale that makes their validation complex and partial, implying potentially large uncertainties in the data. The lack of attributes beyond the location and shape of the buildings also limits the breadth of applications. ML may help by inferring missing attributes, given the availability of sufficient training data.
Data Gap Type
Data Gap Details
U1: Usability > Structure
Some datasets have not been published as scientific datasets and lack appropriate documentation about the methodology. Users should be aware of uncertainties in case of insufficient documentation of potential errors.
R1: Reliability > Quality
The building footprint data can contain errors due to detection inaccuracies in the models used to generate the dataset, as well as limitations of satellite imagery. These limitations include outdated images that may not reflect recent developments and visibility issues such as cloud cover or obstructions that can prevent the accurate identification of buildings.
U6: Usability > Large Volume
When working at a large geographical scale, e.g. continental scale, the data volume requires significant computational resources for the processing.
S3: Sufficiency > Granularity
Raster datasets provide a noisy view of the building stock.
S4: Sufficiency > Timeliness
The data depends on the availability of satellite surveys. Some datasets may mix images from different years. The surveys may be more than 5 years old, mischaracterizing fast-growing areas. In case of disasters, the imagery pre-disaster may not be representative of the current building stock.
S6: Sufficiency > Missing Components
More attributes inferred with high confidence would unlock new use cases.
CMIP6 (earth system model intercomparison data)
Details (click to expand)
CMIP6 (Coupled Model Intercomparison Project Phase 6) provides climate simulations from a consortium of state-of-the-art global climate models, covering historical periods and future scenarios through 2100. The dataset includes multiple climate variables at various spatial and temporal resolutions from modeling centers worldwide. Data can be found here https://pcmdi.llnl.gov/CMIP6/.
The dataset faces three key challenges: its large volume makes access and processing difficult with standard computational infrastructure; lack of uniform structure across models complicates multi-model analysis; and inherent biases and uncertainties in the simulations affect reliability.
The dataset faces three key challenges: its large volume makes access and processing difficult with standard computational infrastructure; lack of uniform structure across models complicates multi-model analysis; and inherent biases and uncertainties in the simulations affect reliability.
Data Gap Type
Data Gap Details
U6: Usability > Large Volume
Massive computational requirements - Cloud-based platforms and data subsetting tools can improve accessibility
U1: Usability > Structure
Inconsistent formats across models - Standardized naming conventions and preprocessing pipelines can enable seamless multi-model integration
R1: Reliability > Quality
Large uncertainties in future projections - Model evaluation frameworks and ensemble weighting methods can help quantify and reduce uncertainties
Large uncertainties in future climate projections limit confidence in bias-correction applications. The massive data volume and inconsistent formats across models—including variable naming conventions, resolutions, and file structures—hinder effective multi-model analysis. Improved model evaluation frameworks and data standardization efforts can enhance projection reliability and streamline ML model development.
Large uncertainties in future climate projections limit confidence in bias-correction applications. The massive data volume and inconsistent formats across models—including variable naming conventions, resolutions, and file structures—hinder effective multi-model analysis. Improved model evaluation frameworks and data standardization efforts can enhance projection reliability and streamline ML model development.
Data Gap Type
Data Gap Details
R1: Reliability > Quality
Large uncertainties in future projections - Model evaluation frameworks and ensemble weighting methods can help quantify and reduce uncertainties
U1: Usability > Structure
Inconsistent formats across models - Standardized naming conventions and preprocessing pipelines can enable seamless multi-model integration
U6: Usability > Large Volume
Massive computational requirements - Cloud-based platforms and data subsetting tools can improve accessibility
CPC Precipitation (global unified daily precipitation)
Details (click to expand)
CPC Global Unified gauge-based analysis of daily precipitation https://psl.noaa.gov/data/gridded/data.cpc.globalprecip.html
High-fidelity weather forecasts at the subseasonal-to-seasonal (S2S) scale (i.e., 10-46 days ahead) are important for a variety of climate change-related applications. ML can be used to postprocess outputs from physics-based numerical forecast models in order to generate higher-fidelity forecasts.
High-fidelity weather forecasts at the subseasonal-to-seasonal (S2S) scale (i.e., 10-46 days ahead) are important for a variety of climate change-related applications. ML can be used to postprocess outputs from physics-based numerical forecast models in order to generate higher-fidelity forecasts.
Data Gap Type
Data Gap Details
R1: Reliability > Quality
There is large uncertainty in data as data is derived via interpolating station data. There are large biases over areas where rain gauge stations are sparse
S3: Sufficiency > Granularity
Resolution is 0.5 deg (roughly 50km) and not sufficiently fine for many applications.
ClimSim (benchmark data for hybrid ML-physics research)
Details (click to expand)
ClimSim is an ML-ready benchmark dataset designed for hybrid ML-physics research, for example, for emulating subgrid clouds and convection processes in climate models.
ClimSim faces challenges with its large data volume, making downloading and processing difficult for most ML practitioners, and its resolution is insufficient to resolve some fine-scale physical processes critical for accurate climate modeling.
ClimSim faces challenges with its large data volume, making downloading and processing difficult for most ML practitioners, and its resolution is insufficient to resolve some fine-scale physical processes critical for accurate climate modeling.
Data Gap Type
Data Gap Details
U6: Usability > Large Volume
A common challenge for emulating climate model components, especially subgrid scale processes is the large data volume, which makes data downloading, transferring, processing, and storing challenging. Computation resources, including GPUs and storage, are urgently needed for most ML practitioners. Technical help on optimizing code for large volumes of data would also be appreciated.
S3: Sufficiency > Granularity
The current resolution is still sufficient to resolve some physical processes, e.g. turbulence, and tornado. Extremely high-resolution simulations, like large-eddy-simulations are needed.
Climate-related laws and regulations
Details (click to expand)
Laws and regulations for climate action that are published through national and subnational governments. There are some centralized databases, such as Climate Policy Radar, International Energy Agency, and New Climate Institute that have selected, aggregated, and structured these data into comprehensive resources.
Laws and regulations for climate action are published in various formats through national and subnational governments, and most are not labeled as a “climate policy”. There are a number of initiatives that take on the challenge of selecting, aggregating, and structuring the laws to provide a better overview of the global policy landscape. This, however, requires a great deal of work, needs to be permanently updated, and datasets are not complete.
Laws and regulations for climate action are published in various formats through national and subnational governments, and most are not labeled as a “climate policy”. There are a number of initiatives that take on the challenge of selecting, aggregating, and structuring the laws to provide a better overview of the global policy landscape. This, however, requires a great deal of work, needs to be permanently updated, and datasets are not complete.
Data Gap Type
Data Gap Details
U1: Usability > Structure
Much of the data are in PDF format and need to be structured into machine-readable format. Much of the data in original languages of the publishing country and needs to be translated into English.
U2: Usability > Aggregation
Legislation data is published through national and subnational governments, and often is not explicitly labeled as “climate policy”. Determing whether it is climate-related is not simple.
This information is usually published on local websites and must be downloaded or scraped manually. There are a number of initiatives, such as Climate Policy Radar, International Energy Agency, and New Climate Institute that are working to address this by selecting, aggregating, and structuring these data to provide a better overview of the global policy landscape. However, this process is labor-intensive, requires continuous updates, and often results in incomplete datasets.
ClimateBench v1.0 (benchmark dataset for earth system models)
Details (click to expand)
ClimateBench v1.0 is a benchmark dataset derived from the NorESM2 Earth System Model (a participant in CMIP6) designed specifically for evaluating machine learning methods that emulate key climate variables. The dataset is publicly available at https://zenodo.org/records/7064308
The dataset currently includes simulations from only one Earth system model, limiting the diversity of training data and potentially affecting the robustness and generalizability of ML emulators trained on it.
The dataset currently includes simulations from only one Earth system model, limiting the diversity of training data and potentially affecting the robustness and generalizability of ML emulators trained on it.
Data Gap Type
Data Gap Details
S2: Sufficiency > Coverage
Currently, the dataset includes information from only one model. Training a machine learning model with this single source of data may result in limited generalization capabilities. To improve the model’s robustness and accuracy, it is essential to incorporate data from multiple models. This approach not only enhances the model’s ability to generalize across different scenarios but also helps reduce uncertainties associated with relying on a single model.
ClimateSet (ML-ready earth system model inputs/outputs)
Details (click to expand)
ClimateSet is an ML-ready benchmark dataset compiled from inputs and outputs of the Input4MIPS and CMIP6 archives, structured for various machine learning tasks including climate model emulation, downscaling, and prediction. More information is available at https://arxiv.org/pdf/2311.03721.pdf
Computational fluid dynamics simulation for building energy models
Details (click to expand)
Computational fluid dynamics (CFD) simulation output from building energy models is a means of precisely assessing thermal (e.g. insulation of the walls) and ventilation (e.g. natural ventilation or HVAC) properties of a building. Given the building geometry, terrain, presence of neighboring buildings, and boundary conditions Navier-Stokes equations are typically solved. Datasets including precise building inputs and outputs from CFD would help build ML surrogate models. Surrogate models, such as GANs or physics constrained deep neural network architectures have been shown to provide promising results though further research with respect to turbulence representation needs to be taken into account.
Despite its usefulness in ventilation studies for new construction, CFD simulations are computationally expensive making it difficult to include in the early phase of the design process where building morphosis can be optimized to reduce future operational consumption associated with building lighting, heating, and cooling. Simulations require accurate input information with respect to material properties that may not be present in traditional urban building types. Output of models require the integration of domain knowledge to interpret results from large volumes of synthetic data for different wind directions becoming challenging to manage. Future data collection with respect to simulation output verification can benefit surrogate or proxy approaches to computationally expensive Navier-Stokes equations, and coverage is often restricted to modern building approaches, leaving out passive building techniques known as vernacular architecture from indigenous communities from being taken into design consideration.
Despite its usefulness in ventilation studies for new construction, CFD simulations are computationally expensive making it difficult to include in the early phase of the design process where building morphosis can be optimized to reduce future operational consumption associated with building lighting, heating, and cooling. Simulations require accurate input information with respect to material properties that may not be present in traditional urban building types. Output of models require the integration of domain knowledge to interpret results from large volumes of synthetic data for different wind directions becoming challenging to manage. Future data collection with respect to simulation output verification can benefit surrogate or proxy approaches to computationally expensive Navier-Stokes equations, and coverage is often restricted to modern building approaches, leaving out passive building techniques known as vernacular architecture from indigenous communities from being taken into design consideration.
Data Gap Type
Data Gap Details
W: Wish
Such datasets do not exist and require dedicated work to gather inputs, generate the data via simulations, and ensure that the simulations are reliable by verifying them with real-world data. Licensing and privacy issues may also be important aspects of such efforts.
DOE Atmospheric Radiation Measurement research facility data products
Details (click to expand)
The DOE Atmospheric Radiation Measurement (ARM) dataset comprises ground-based measurements from various field programs sponsored by the US Department of Energy, including sun-tracking photometers, radiometers, and spectrometer data useful for solar radiation time series forecasting and solar potential assessment.
ARM data presents challenges with data volume management, measurement verification (especially for aerosol composition), limited spatial coverage (ARM sites only), and sensor calibration issues. Solutions include AI-based data compression, enhanced aerosol composition measurements, collaboration with partner networks to expand coverage, and automated quality control.
ARM data presents challenges with data volume management, measurement verification (especially for aerosol composition), limited spatial coverage (ARM sites only), and sensor calibration issues. Solutions include AI-based data compression, enhanced aerosol composition measurements, collaboration with partner networks to expand coverage, and automated quality control.
Data Gap Type
Data Gap Details
U6: Usability > Large Volume
ARM sites generate large datasets which can be challenging to store, analyze, stream, and archive. AI-based data compression and novel indexing can improve data management.
S3: Sufficiency > Granularity
Enhanced aerosol composition and ice nucleating particle measurements are needed for a better understanding of cloud dynamics and solar irradiance for DER site planning.
S2: Sufficiency > Coverage
Spatial coverage is limited to ARM sites within the United States. Collaboration with partner networks can expand coverage both within and outside the US.
R1: Reliability > Quality
Sensor data can be sensitive to noise and calibration issues, requiring automated systems to identify measurement drift.
DYAMOND (global atmospheric circulation model intercomparison data)
Details (click to expand)
DYAMOND (DYnamics of the Atmospheric general circulation Modeled On Non-hydrostatic Domains) is an intercomparison of global storm-resolving model simulations at 5 km resolution or less, used as targets for climate model emulators.
DYAMOND faces similar challenges to ClimSim: its large volume creates processing difficulties, and its resolution, while high, remains insufficient for resolving fine-scale atmospheric processes needed for accurate climate modeling.
DYAMOND faces similar challenges to ClimSim: its large volume creates processing difficulties, and its resolution, while high, remains insufficient for resolving fine-scale atmospheric processes needed for accurate climate modeling.
Data Gap Type
Data Gap Details
U6: Usability > Large Volume
A common challenge for emulating climate model components, especially subgrid scale processes is the large data volume, which makes data downloading, transferring, processing, and storing challenging. Computation resources, including GPUs and storage, are urgently needed for most ML practitioners. Technical help on optimizing code for large volumes of data would also be appreciated.
S3: Sufficiency > Granularity
The current resolution is still sufficient to resolve some physical processes, e.g. turbulence, and tornado. Extremely high-resolution simulations, like large-eddy-simulations are needed.
Digital elevation model
Details (click to expand)
Surface elevation data, often called digital elevation model or terrain surface model, provide a 3D representation of the bare surface of the Earth. These topographic inputs are important for disaster risk assessments and modeling to assess risks due to floods, sea level rise, or landslides, where the elevation of a given location determines whether it is at risk. These digital models are typically estimated from remote sensing data, for example, the Shuttle Radar Topography Mission. They are often provided as raster but may also be provided as points (vector).
Very high-resolution reference data is currently not freely open to the public.
Data Gap Type
Data Gap Details
O1: Obtainability > Findability
Surface elevation data defined by a digital elevation model (DEM) is one of the most essential types of reference data. The high-resolution elevation data has huge value for disaster risk assessment, particularly for the Global South.
Open DEM data with global coverage now goes to a resolution of 30-m, but the resolution is still insufficient for many disaster risk assessments. Higher-resolution datasets exist, but they are either with limited spatial coverage or are commercial products and are very expensive to get.
Direct measurement of methane emission of rice paddies
Details (click to expand)
With sampling systems placed in rice paddies, methane concentrations can be directly measured in the air above the fields or in the soil.
There is a lack of direct observation of methane emissions from rice paddies.
Data Gap Type
Data Gap Details
W: Wish
Direct measurement of methane emissions is often expensive and labor-intensive. But this data is essential as it provides the ground truth for training and constraining ML models. Increased funding is needed to support and encourage comprehensive data collection efforts.
Distribution system simulators
Details (click to expand)
Distribution system simulators such as OpenDSS and GridLab-D enable analysis of hosting capacity for distribution-level substation feeders by simulating how various factors affect grid stability and reliability. These open-source tools allow researchers to model voltage limits, thermal capabilities, control parameters, and fault currents under different scenarios, providing insights into how distribution grids can safely accommodate distributed energy resources like solar panels. These simulators serve as critical alternatives when real circuit feeder data from utilities is unavailable.
While OpenDSS and GridLab-D provide valuable simulation capabilities, their utility is limited by challenges in obtaining verification data from real distribution circuits, aggregating necessary input data from multiple sources, and navigating usage rights for proprietary utility data. Closing these gaps through improved utility-researcher partnerships and data sharing protocols would significantly enhance the accuracy of hosting capacity assessments, enabling greater renewable energy integration in distribution networks.
While OpenDSS and GridLab-D provide valuable simulation capabilities, their utility is limited by challenges in obtaining verification data from real distribution circuits, aggregating necessary input data from multiple sources, and navigating usage rights for proprietary utility data. Closing these gaps through improved utility-researcher partnerships and data sharing protocols would significantly enhance the accuracy of hosting capacity assessments, enabling greater renewable energy integration in distribution networks.
Data Gap Type
Data Gap Details
U2: Usability > Aggregation
Realistic distribution system studies require aggregating and collating data from multiple external sources regarding network topology, load profiles, and DER penetration for the specific region of interest.
U3: Usability > Usage Rights
Rights to external data for use with OpenDSS or GridLab-D may require purchase or partnerships with utilities and/or the Distribution System Operator (DSO) to perform scenario studies with high DER penetration and load demand.
R1: Reliability > Quality
Simulator studies require real deployment data from substations for verification, as actual hosting capacity may vary based on load conditions, environmental factors, and DER penetration levels in the service area.
Drone imagery
Details (click to expand)
Drone imagery provides high-resolution, close-range visual data for species identification, individual tracking, and environmental reconstruction. These images offer detailed insights into habitats and wildlife populations, similar to camera traps but with greater flexibility in coverage. Currently, most drone imagery data is scattered across disparate sources, with some collections hosted on platforms like www.lila.science.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
Data Gap Type
Data Gap Details
S1: Sufficiency > Insufficient Volume
One of the foremost challenges is the scarcity of publicly open and adequately annotated data for model training. It is worth mentioning that the lack of annotated datasets is a common and major challenge applied to almost every modality of biodiversity data and is particularly acute for bioacoustic data, where the availability of sufficiently large and diverse datasets, encompassing a wide array of species, remains limited for model training purposes.
This scarcity of publicly open and well annotated datasets arises from the labor-intensive nature of data annotation and the dearth of expertise necessary to accurately identify species. This is the case even for relatively well-studied taxa such as birds, and is even more notable for taxa such as Orthoptera (grasshoppers and relatives). The incompleteness of our current taxonomy also compounds this issue.
Though some data cannot be shared with the public because of concerns about poaching or other legal restrictions imposed by local governments, there are still considerable sources of well-annotated data that currently reside within the confines of individual researchers’ or organizations’ hard drives. Unlocking these data will partially address the data gap.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
S2: Sufficiency > Coverage
There exists a significant geographic imbalance in data collected, with insufficient representation from highly biodiverse regions. The data also does not cover a diverse spectrum of species, intertwining with taxonomy insufficiencies.
Addressing these gaps involves several strategies:
Data sharing: Unlocking existing datasets scattered across individual laboratories and organizations can partially alleviate the geographic and taxonomic gaps in data coverage.
Annotation Efforts:
Leveraging crowdsourcing platforms enables collaborative data collection and facilitates volunteer-driven annotation, playing a pivotal role in expanding annotated datasets.
Collaborating with domain experts, including ecologists and machine learning practitioners, is crucial. Incorporating local ecological knowledge, particularly from indigenous communities in target regions, enriches data annotation efforts.
Developing AI-powered methods for automatic cataloging, searching, and data conversion is imperative to streamline data processing and extract relevant information efficiently.
Data Collection:
Prioritize continuous data collection efforts, especially in underrepresented regions and for less documented species and taxonomic groups.
Explore innovative approaches like Automated Multisensor Stations for Monitoring of Species Diversity (AMMODs) to enhance data collection capabilities.
Strategically plan data collection initiatives to target species at risk of extinction, ensuring data capture before irreversible loss occurs.
U2: Usability > Aggregation
A lot of well-annotated datasets are currently scattered within the confines of individual researchers’ or organizations’ hard drives. Aggregating and sharing these datasets would largely address the lack of publicly open annotated data needed for model training. This initiative would also mitigate geographic and taxonomic imbalances in existing data.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
S1: Sufficiency > Insufficient Volume
One of the foremost challenges is the scarcity of publicly open and adequately annotated data for model training. It is worth mentioning that the lack of annotated datasets is a common and major challenge applied to almost every modality of biodiversity data and is particularly acute for bioacoustic data, where the availability of sufficiently large and diverse datasets, encompassing a wide array of species, remains limited for model training purposes.
This scarcity of publicly open and well annotated datasets arises from the labor-intensive nature of data annotation and the dearth of expertise necessary to accurately identify species. This is the case even for relatively well-studied taxa such as birds, and is even more notable for taxa such as Orthoptera (grasshoppers and relatives). The incompleteness of our current taxonomy also compounds this issue.
Though some data cannot be shared with the public because of concerns about poaching or other legal restrictions imposed by local governments, there are still considerable sources of well-annotated data that currently reside within the confines of individual researchers’ or organizations’ hard drives. Unlocking these data will partially address the data gap.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
S2: Sufficiency > Coverage
There exists a significant geographic imbalance in data collected, with insufficient representation from highly biodiverse regions. The data also does not cover a diverse spectrum of species, intertwining with taxonomy insufficiencies.
Addressing these gaps involves several strategies:
Data sharing: Unlocking existing datasets scattered across individual laboratories and organizations can partially alleviate the geographic and taxonomic gaps in data coverage.
Annotation Efforts:
Leveraging crowdsourcing platforms enables collaborative data collection and facilitates volunteer-driven annotation, playing a pivotal role in expanding annotated datasets.
Collaborating with domain experts, including ecologists and machine learning practitioners, is crucial. Incorporating local ecological knowledge, particularly from indigenous communities in target regions, enriches data annotation efforts.
Developing AI-powered methods for automatic cataloging, searching, and data conversion is imperative to streamline data processing and extract relevant information efficiently.
Data Collection:
Prioritize continuous data collection efforts, especially in underrepresented regions and for less documented species and taxonomic groups.
Explore innovative approaches like Automated Multisensor Stations for Monitoring of Species Diversity (AMMODs) to enhance data collection capabilities.
Strategically plan data collection initiatives to target species at risk of extinction, ensuring data capture before irreversible loss occurs.
U2: Usability > Aggregation
A lot of well-annotated datasets are currently scattered within the confines of individual researchers’ or organizations’ hard drives. Aggregating and sharing these datasets would largely address the lack of publicly open annotated data needed for model training. This initiative would also mitigate geographic and taxonomic imbalances in existing data.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
Data Gap Type
Data Gap Details
U1: Usability > Structure
For restoration projects, there is an urgent need for standardized protocols to guide data collection and curation for measuring biodiversity and other complex ecological outcomes of restoration projects. For example, there is an urgent need of clearly written guidance on what variables to collect and how to collect them.
Turning raw data into usable insights is also a significant challenge. There is a lack of standardized tools and workflows to turn the raw data into analysis-ready data and analyze the data in a consistent way across projects.
U4: Usability > Documentation
For restoration projects, there is an urgent need for standardized protocols to guide data collection and curation for measuring biodiversity and other complex ecological outcomes of restoration projects. For example, there is an urgent need for clearly written guidance on what variables to collect and how to collect them.
Turning raw data into usable insights is also a significant challenge. There is a lack of standardized tools and workflows to turn the raw data into analysis-ready data and analyze the data in a consistent way across projects.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
Data Gap Type
Data Gap Details
S1: Sufficiency > Insufficient Volume
One of the foremost challenges is the scarcity of publicly open and adequately annotated data for model training. It is worth mentioning that the lack of annotated datasets is a common and major challenge applied to almost every modality of biodiversity data and is particularly acute for bioacoustic data, where the availability of sufficiently large and diverse datasets, encompassing a wide array of species, remains limited for model training purposes.
This scarcity of publicly open and well annotated datasets arises from the labor-intensive nature of data annotation and the dearth of expertise necessary to accurately identify species. This is the case even for relatively well-studied taxa such as birds, and is even more notable for taxa such as Orthoptera (grasshoppers and relatives). The incompleteness of our current taxonomy also compounds this issue.
Though some data cannot be shared with the public because of concerns about poaching or other legal restrictions imposed by local governments, there are still considerable sources of well-annotated data that currently reside within the confines of individual researchers’ or organizations’ hard drives. Unlocking these data will partially address the data gap.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
S2: Sufficiency > Coverage
There exists a significant geographic imbalance in data collected, with insufficient representation from highly biodiverse regions. The data also does not cover a diverse spectrum of species, intertwining with taxonomy insufficiencies.
Addressing these gaps involves several strategies:
Data sharing: Unlocking existing datasets scattered across individual laboratories and organizations can partially alleviate the geographic and taxonomic gaps in data coverage.
Annotation Efforts:
Leveraging crowdsourcing platforms enables collaborative data collection and facilitates volunteer-driven annotation, playing a pivotal role in expanding annotated datasets.
Collaborating with domain experts, including ecologists and machine learning practitioners, is crucial. Incorporating local ecological knowledge, particularly from indigenous communities in target regions, enriches data annotation efforts.
Developing AI-powered methods for automatic cataloging, searching, and data conversion is imperative to streamline data processing and extract relevant information efficiently.
Data Collection:
Prioritize continuous data collection efforts, especially in underrepresented regions and for less documented species and taxonomic groups.
Explore innovative approaches like Automated Multisensor Stations for Monitoring of Species Diversity (AMMODs) to enhance data collection capabilities.
Strategically plan data collection initiatives to target species at risk of extinction, ensuring data capture before irreversible loss occurs.
U2: Usability > Aggregation
A lot of well-annotated datasets are currently scattered within the confines of individual researchers’ or organizations’ hard drives. Aggregating and sharing these datasets would largely address the lack of publicly open annotated data needed for model training. This initiative would also mitigate geographic and taxonomic imbalances in existing data.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
ECMWF ENS (global 9-km 15-day ahead weather model ensemble)
Details (click to expand)
Ensemble forecast up to 15 days ahead, generated by ECMWF numerical weather prediction model; used as a benchmark/baseline for evaluating ML-based weather forecasts. Data can be found here.
Same as HRES, the biggest challenge of ENS is that only a portion of it is available to the public for free.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
The high-resolution real-time forecast is for purchase only and it is expensive to get them.
U6: Usability > Large Volume
Due to its high spatio-temporal resolution, HRES data is quite large, creating significant challenges for downloading, transferring, storing, and processing with standard computational infrastructure. To address this, the WeatherBench 2 benchmark dataset provides a solution for certain machine learning tasks involving ENS data.
The biggest challenge of ENS is that only a portion of it is available to the public for free.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
The high-resolution real-time forecast is for purchase only and it is expensive to get them.
U6: Usability > Large Volume
Due to its high spatio-temporal resolution, HRES data is quite large, creating significant challenges for downloading, transferring, storing, and processing with standard computational infrastructure. To address this, the WeatherBench 2 benchmark dataset provides a solution for certain machine learning tasks involving ENS data.
ECMWF ERA5 Atmospheric Reanalysis
Details (click to expand)
ERA5 is a comprehensive atmospheric reanalysis dataset covering 1940 to present that integrates in-situ and remote sensing observations from weather stations, satellites, and radar into a global, hourly gridded product at 31 km resolution. The dataset is continuously updated and available for download through the Copernicus Climate Data Store.
ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.
ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Download delays from Copernicus Climate Data Store - Enhanced server infrastructure, regional mirror sites, and cloud-based access platforms can reduce download times from days/months to hours
U6: Usability > Large Volume
Massive storage and processing requirements - Cloud computing platforms with pre-loaded datasets and data subsetting tools can enable analysis without full downloads
R1: Reliability > Quality
Inherent biases limit ground truth applications - ML-enhanced data assimilation techniques and ensemble reanalysis approaches can reduce model-dependent biases, particularly improving precipitation and cloud field accuracy
While ERA5 is widely used due to its good structure and global coverage, users face significant challenges with downloading times that can take days to months, and the sheer data volume presents processing difficulties for many users.
While ERA5 is widely used due to its good structure and global coverage, users face significant challenges with downloading times that can take days to months, and the sheer data volume presents processing difficulties for many users.
Data Gap Type
Data Gap Details
U6: Usability > Large Volume
Massive storage and processing requirements - Cloud computing platforms with pre-loaded datasets and data subsetting tools can enable analysis without full downloads
ERA5 is widely used due to its high resolution and global coverage, but faces significant accessibility and reliability challenges. Download times from the Copernicus Climate Data Store can take days to months due to high demand and data storage on tape systems. ERA5’s own biases and uncertainties, particularly in precipitation fields, limit its effectiveness as ground truth for ML bias correction. Enhanced download infrastructure and improved reanalysis methods incorporating ML-based data assimilation can address these limitations.
ERA5 is widely used due to its high resolution and global coverage, but faces significant accessibility and reliability challenges. Download times from the Copernicus Climate Data Store can take days to months due to high demand and data storage on tape systems. ERA5’s own biases and uncertainties, particularly in precipitation fields, limit its effectiveness as ground truth for ML bias correction. Enhanced download infrastructure and improved reanalysis methods incorporating ML-based data assimilation can address these limitations.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Download delays from Copernicus Climate Data Store - Enhanced server infrastructure, regional mirror sites, and cloud-based access platforms can reduce download times from days/months to hours
U6: Usability > Large Volume
Massive storage and processing requirements - Cloud computing platforms with pre-loaded datasets and data subsetting tools can enable analysis without full downloads
R1: Reliability > Quality
Inherent biases limit ground truth applications - ML-enhanced data assimilation techniques and ensemble reanalysis approaches can reduce model-dependent biases, particularly improving precipitation and cloud field accuracy
Single high-resolution forecast up to 10 days ahead generated by ECMWF numerical weather prediction model, the Integrated Forecasting system (IFS). It is usually used as a benchmark/baseline for evaulating ML-based weather forecast. Data can be found here.
The biggest challenge with using HRES data is that only a portion of it is available to the public for free.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
The high-resolution real-time forecast is for purchase only and it is expensive to get them.
U6: Usability > Large Volume
Due to its high spatio-temporal resolution, HRES data is quite large, creating significant challenges for downloading, transferring, storing, and processing with standard computational infrastructure. To address this, the WeatherBench 2 benchmark dataset provides a solution for certain machine learning tasks involving HRES data.
The biggest challenge of HRES is that only a portion of it is available to the public for free.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
The high-resolution real-time forecast is for purchase only and it is expensive to get them.
U6: Usability > Large Volume
Due to its high spatio-temporal resolution, HRES data is quite large, creating significant challenges for downloading, transferring, storing, and processing with standard computational infrastructure. To address this, the WeatherBench 2 benchmark dataset provides a solution for certain machine learning tasks involving HRES data.
EPRI10 (transmission control center alarm and operational data set)
Details (click to expand)
Supervisory Control and Data Acquisition (SCADA) systems collect data from sensors throughout the power grid. Alarm operational data, a portion of the data received by SCADA, provides discrete event-based information on the status of protection and monitoring devices in a tabular format, which includes semi-structured text descriptions of individual alarm events.
Often, the data is formatted based on timestamp (in milliseconds), station, signal identification information, location, description, and action. Encoded within the identification information is the alarm message.
Access to EPRI10 grid alarm data is currently limited within EPRI. Data gaps with respect to usability are the result of redundancies in grid alarm codes, requiring significant preprocessing and analysis of code ids, alarm priority, location, and timestamps. Alarm codes can vary based on sensor, asset and line. Resulting actions as a consequence of alarm trigger events need field verifications to assess the presence of fault or non-fault events.
Access to EPRI10 grid alarm data is currently limited within EPRI. Data gaps with respect to usability are the result of redundancies in grid alarm codes, requiring significant preprocessing and analysis of code ids, alarm priority, location, and timestamps. Alarm codes can vary based on sensor, asset and line. Resulting actions as a consequence of alarm trigger events need field verifications to assess the presence of fault or non-fault events.
Data Gap Type
Data Gap Details
U6: Usability > Large Volume
Operational alarm data volume is large, given the cadence of measurements made in the system at every millisecond. This can result in high data volume that is tabular in nature, but also unstructured with respect to text details associated with alarm trigger events, sensor measurements, and controller actions. Since the data also contains locations and grid asset information, spatio-temporal analysis can be made with respect to a single sensor and the conditions over which that sensor is operating. Therefore, indexing and mining time series data can be an approach for facilitating faster search over alarm data leading up to a fault event. Additionally, natural language processing and text mining techniques can also be utilized to facilitate search over alarm text and details.
U5: Usability > Pre-processing
In addition to challenges with respect to the decoding of remote signal identification data, the description fields associated with alarm trigger events are unstructured and vary in the amount of text detail provided. Typically, the details cover information with respect to the grid asset and its action. For example, a text description from a line monitoring device may describe the power, temperature, and the action taken in response to the grid alarm trigger event. Often, in real-world systems, the majority of grid alarm trigger events are short circuit faults and non-fault events, limiting the diversity of fault types found in the data.
To combat these issues, data pre-processing becomes necessary. For remote signal identification data, this includes parsing and hashing through text codes, assessing code components for redundancies, and building an associated reduced dictionary of alarm codes. For textual description fields and post-fault field reports, the use of natural language processing techniques to extract key information can provide more consistency between sensor data. Additionally, techniques like diverse sampling can account for the class imbalance with respect to the associated fault that can trigger the alarm.
U4: Usability > Documentation
Remote signaling identification information from monitoring sensors and devices encodes data with respect to the alarm trigger event in the context of fault priority. Based on the asset, line, or sensor, this identification code can vary depending on the naming conventions used. Documentation on remote signal ids associated with a dictionary of finite alarm code types can facilitate pre-processing of alarm data and assessment on the diversity of fault events occurring in real-time systems (as different alarm trigger codes may correspond to redundant events similar in nature).
U3: Usability > Usage Rights
Usage rights are currently constrained to those working within EPRI at this time.
U2: Usability > Aggregation
Reports on location, asset, and time can result in false alarm triggers requiring operators to send field workers to investigate, fix, and recalibrate field sensors. The data with respect to field assessments can be incorporated into the original data to provide greater context resulting in compilation of multimodal datasets which can enhance alarm data understanding.
U1: Usability > Structure
Grid alarm codes may be non-unique for different lines and grid assets. In other words, two different codes could represent equivalent information due to differences in naming conventions requiring significant alarm data pre-processing and analysis in identifying unique labels from over 2000 code words. Additional labels expressing alarm priority, for example high alarm type indicative of events such as fire, gas, or lightning, are also encoded into the grid alarm trigger event code. Creation of a standard structure for operational text data such as those already utilized in operational systems by companies such as General Electric or Siemens can avoid inconsistencies in data.
R1: Reliability > Quality
Alarm trigger events and the corresponding action taken by the events, require post assessment by field workers especially in cases of faults or perceived faults for verification.
O2: Obtainability > Accessibility
Data access is limited within EPRI due to restrictions with respect to data provided by utilities. Anonymization and aggregation of data to a benchmark or toy dataset by EPRI to the wider community can be a means of circumventing the security issues at the cost of operational context.
Electric vehicle charge station data
Details (click to expand)
Electric vehicle charging station datasets typically include location, charger specifications, energy delivery amounts, charge duration, costs, and usage patterns for both AC slow charging (depot-based) and DC fast charging (en-route) stations, though specific datasets vary by provider and region.
Critical gaps include limited findability of station-specific usage data due to proprietary restrictions and scattered data sources requiring aggregation from multiple providers. Manufacturer partnerships and utility collaboration can improve data access, while standardized reporting frameworks can consolidate fragmented datasets to enable comprehensive fleet optimization
Critical gaps include limited findability of station-specific usage data due to proprietary restrictions and scattered data sources requiring aggregation from multiple providers. Manufacturer partnerships and utility collaboration can improve data access, while standardized reporting frameworks can consolidate fragmented datasets to enable comprehensive fleet optimization
Data Gap Type
Data Gap Details
O1: Obtainability > Findability
Charging station usage profiles and vehicle-specific load data are often proprietary. Solution: Establish manufacturer partnerships and utility pilot programs to access detailed charging profiles.
U2: Usability > Aggregation
Charging data scattered across multiple providers and systems. Solution: Create standardized APIs and data sharing agreements between charging network operators.
Emission dataset compiled from FAO statistics
Details (click to expand)
Dataset Introduction: This dataset comprises agricultural emissions data compiled from Food and Agriculture Organization (FAO) statistics and spatially extrapolated to provide geospatial coverage. It includes estimates of greenhouse gas emissions related to agricultural practices across different regions worldwide and is periodically updated as new FAO statistics become available.
Data is extrapolated from statistics on a national level. It is unknown how accurate this data is when focusing on local information
Data Gap Type
Data Gap Details
R1: Reliability > Quality
Data is extrapolated from statistics on a national level. It is unknown how accurate this data is when focusing on local information.
Environmental DNA (eDNA)
Details (click to expand)
Environmental DNA (eDNA) datasets consist of genetic material obtained from environmental samples, like soil and water, after being shed by living or dead organisms. By analyzing this genetic material, researchers can detect and monitor species present in a non-invasive and efficient manner, aiding biodiversity studies, conservation efforts, and environmental monitoring. Some eDNA data can be found in GBIF (the Global Biodiversity Information Facility). BIOSCAN-5M is another relevant, comprehensive dataset containing multi-modal information, including DNA barcode sequences and taxonomic labels for over 5 million insect specimens, presenting as a large reference library on species- and genus-level classification tasks.
A significant challenge for eDNA-based monitoring is the incomplete barcoding reference databases, limiting the ability to accurately identify species from genetic material. Initiatives like the BIOSCAN project are actively working to address this gap by expanding reference collections for diverse taxonomic groups, particularly for understudied regions and species.
A significant challenge for eDNA-based monitoring is the incomplete barcoding reference databases, limiting the ability to accurately identify species from genetic material. Initiatives like the BIOSCAN project are actively working to address this gap by expanding reference collections for diverse taxonomic groups, particularly for understudied regions and species.
Data Gap Type
Data Gap Details
S2: Sufficiency > Coverage
Incomplete barcoding reference databases limit the identification of many species from eDNA samples, particularly in biodiverse regions.
One gap in data is the incomplete barcoding reference databases.
Data Gap Type
Data Gap Details
S2: Sufficiency > Coverage
Incomplete barcoding reference databases limit the identification of many species from eDNA samples, particularly in biodiverse regions.
Equivalent circuit models
Details (click to expand)
Equivalent circuit models are simplified representations of batteries represented by networks of resistors and capacitors to model battery behavior due to electrochemical reactions. Due to their ease of use, they can integrate easily into battery management control systems and customized to model a variety of battery chemistries and conditions. Different types of equivalent circuit models include the Rint model, hysteresis models, Randles models, and Thevenin models. These models differ in complexity with respect to the extent with which battery behavior is captured. For example, the simplest model, the Rint model, is static while other models vary in their representation of dynamic properties such as state of charge and battery lifetime.
While ECMs enable real-time battery SoC predictions due to their computational efficiency, they often oversimplify real-life operating conditions which limits the accuracy of SoH and RUL estimates. Additionally, verification with data from physical battery systems is required to validate simulated outcomes and improve prediction reliability across diverse operational environments.
While ECMs enable real-time battery SoC predictions due to their computational efficiency, they often oversimplify real-life operating conditions which limits the accuracy of SoH and RUL estimates. Additionally, verification with data from physical battery systems is required to validate simulated outcomes and improve prediction reliability across diverse operational environments.
Data Gap Type
Data Gap Details
R1: Reliability > Quality
Due to their simplified nature and assumptions based on ideal laboratory conditions, ECMs have limited accuracy in predicting battery aging and dynamics in real systems. Verification with real-life battery system data from diverse operational environments is essential for improving state of health (SoH) and remaining useful life (RUL) predictions.
S3: Sufficiency > Granularity
The resolution of SoH and SoC predictions of ECMs are impacted by assumptions made with respect to battery performance. These include constant internal resistance assumptions that don’t account for sensitivity to complex current profiles or temperature variations, leading to inaccurate voltage and subsequent SoH/SoC calculations. ECMs also simplify electrochemical processes by ignoring electrode polarization, diffusion, and transfer kinetics, while neglecting battery aging effects like capacity fade Linearity assumptions, in simpler ECMs do not hold true under high charge/discharge rates. Solutions include increasing the complexity of ECMs by adding parallel RC networks to model the internal resistance of the battery with different time constants, introducing non-linear elements for different operating conditions, incorporating adaptive hysteresis models, and integrating aging parameters.
Faraday (Synthetic smart meter data)
Details (click to expand)
Due to consumer privacy protections, advanced metering infrastructure (AMI) data is unavailable for realistic demand response studies. In an effort to open smart meter data, the Octopus Energy’s Centre for Net Zero has generated a synthetic dataset conditioned on the presence of low carbon technologies, energy efficiency, and property type from a model trained on 300 million actual smart meter readings from a United Kingdom (UK) energy supplier. Faraday is currently accessible through the Centre for Net Zero’s API.
Despite the model being trained on actual AMI data, synthetic data generation requires a priori knowledge with respect to building type, efficiency, and presence of low-carbon technologies. Furthermore, since the dataset is trained on UK building data, AMI time series generated may not be an accurate representation of load demand in regions outside the UK. Finally, since data is synthetically generated, studies will require validation and verification using real data or aggregated per substation level data to assess its effectiveness.
Despite the model being trained on actual AMI data, synthetic data generation requires a priori knowledge with respect to building type, efficiency, and presence of low-carbon technologies. Furthermore, since the dataset is trained on UK building data, AMI time series generated may not be an accurate representation of load demand in regions outside the UK. Finally, since data is synthetically generated, studies will require validation and verification using real data or aggregated per substation level data to assess its effectiveness.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
The Variational Autoencoder Model can generate synthetic AMI data based on several conditions. The presence of low carbon technology (LCT) for a given household or property type depends on access to battery storage solutions, solar rooftop panels, and the presence of electric vehicles. This type of data may require curation of LCT purchases by type and household. Building type and efficiency at the residential and commercial/industrial level may also be difficult to access, requiring the user to set initial assumptions or seek additional datasets. Furthermore, data verification requires a performance metric based on actual readings. This may be done through access to substation- level load demand data.
U3: Usability > Usage Rights
Faraday is open for alpha testing by request only.
S2: Sufficiency > Coverage
Faraday is trained from utility provided AMI data from the UK which may not be representative of load demand and corresponding building type and temperate zone of other global regions. To generate similar synthetic data, custom data may be retrieved through a pilot test bed for private collection or the result of a partnership with a local utility. Additionally, pre-existing AMI data over an area of interest can be utilized to generate similar synthetic data.
Datasets are restricted to past pilot study coverage areas requiring further data collection for fine-tuning models to a different coverage area.
S3: Sufficiency > Granularity
Data granularity is limited to the granularity of data the model was trained on. Generative modeling approaches similar to Faraday, can be built using higher resolution data or interpolation methods could also be employed.
S4: Sufficiency > Timeliness
Timeliness of dataset would require continuous integration and development of the model using MLOps best practices to avoid data and model drift. By contributing to Linux Foundation Energy’s OpenSynth initiative, Centre for Net Zero hopes to build a global community of contributors to facilitate research.
R1: Reliability > Quality
Verification of AMI synthetic data requires verification, which can be done in a bottom-up grid modeling manner. For example, load demand at the substation level can be estimated based on the sum of individual building loads that the substation services. This value can then be compared to the actual substation load demand provided through private partnerships with distribution network operators (DNOs). However, the accuracy of a specific demand profile per property or section of properties would require identification of a population of buildings, a connected real-world substation, and residential low carbon technology investment for the set of properties under study.
FathomNet (marine wildlife annotated imagery)
Details (click to expand)
FathomNet is an open-source image database that standardizes and aggregates expertly curated labeled data. The data can be used to train, test, and validate ML algorithms to help us understand our ocean and its inhabitants.
The biggest challenge for the developer of FathomNet is enticing different institutions to contribute their data.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
The biggest challenge for the developer of FathomNet is enticing different institutions to contribute their data.
Financial loss datasets related to the impacts of disasters
Details (click to expand)
Financial loss datasets related to disasters track the economic impacts of catastrophic events, including insurance claims and damages to infrastructure. They help assess financial repercussions and guide risk management and preparedness strategies.
Financial loss data for disasters is primarily proprietary and inaccessible to researchers, limiting the development of comprehensive disaster impact assessment models.
Financial loss data for disasters is primarily proprietary and inaccessible to researchers, limiting the development of comprehensive disaster impact assessment models.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Financial loss data is typically proprietary and held by insurance and reinsurance companies, as well as financial and risk management firms. Some of the data should be made available for research purposes to improve disaster response and planning.
Financial loss data is typically proprietary and not publicly accessible.
Data Gap Type
Data Gap Details
O1: Obtainability > Findability
Most consistent loss data is produced by the insurance industry and remains proprietary.
O2: Obtainability > Accessibility
Collecting robust, homogeneous loss data even for a single event presents significant challenges.
U4: Usability > Documentation
Loss data frequently lacks metadata, making it difficult to determine data completeness.
Grid2Op and PandaPower (power systems simulation outputs))
Details (click to expand)
Grid2Op is a power systems simulation framework to perform reinforcement learning for electricity network operation that focuses on the use of topology to control the flows of the grid.
Grid2Op allows users to control voltages by manipulating shunts or changing setpoint values of generators, influence active generation by use of redispatching, and manipulate storage units such as batteries or pumped storage to produce or absorb energy from the grid when needed. The grid is represented as a graph with nodes being buses and edges corresponding to power lines and transformers. Grid2Op has several available environments with different network topologies as well as variables that can be monitored as observations. The environment is designed for reinforcement learning agents to act upon with a variety of actions some of which are binary or continuous. This includes changes in topology such as changing bus, changing line status, setting storage, curtailment, redispatching, setting bus values, and setting line status. Multiple reward functions are also available in the platform for experimentation with different agents. It is important to note that Grid2Op has no internal modeling of equations of the grids or what kind of solver is necessary to adopt. Data on how the power grid is evolving is represented by the “Chronics.” The solver that computes the state of the grid is represented by the “Backend” which utilizes PandaPower to compute power flows.
Grid2Op faces several data gaps related to usability, reliability, and coverage. Key issues include poor documentation, limited customization options (especially for reward functions and cascading failure scenarios), and a lack of support for multi-agent setups. The framework also lacks realistic system dynamics, fine time resolution, and flexible backend modeling, making it challenging to use for advanced research or real-world grid simulations without significant modification. These gaps can hinder the framework’s ability to accurately train reinforcement learning agents and simulate real-world power grid behavior.
Grid2Op faces several data gaps related to usability, reliability, and coverage. Key issues include poor documentation, limited customization options (especially for reward functions and cascading failure scenarios), and a lack of support for multi-agent setups. The framework also lacks realistic system dynamics, fine time resolution, and flexible backend modeling, making it challenging to use for advanced research or real-world grid simulations without significant modification. These gaps can hinder the framework’s ability to accurately train reinforcement learning agents and simulate real-world power grid behavior.
Data Gap Type
Data Gap Details
U4: Usability > Documentation
In the customization of the reward function, there are several TODOs in place concerning the units and attributes of the reward function related to redispatching. Documentation and code comments can sometimes provide conflicting information. Modularity of reward, adversary, action, environment, and backend are non-intuitive, requiring pregenerated dictionaries rather than dynamic inputs or conversion from single agent to multi-agent functionality. Refactoring of documentation and comments to reflect updates can assist users and avoid having to cross-reference information from the Discord channel for “Learning to Run a Power Network” and github issues.
U5: Usability > Pre-processing
The game over rules and constraints are difficult to adapt when customizing the environment for cascading failure scenarios and more complex adversaries such as natural disasters. Code base variations between versions especially between the native and Gym formatted framework lose features present in the legacy version including topology graphics. Open source refactoring efforts can assist in updating the code base to run latest and previous versions without loss of features.
R1: Reliability > Quality
The grid2op framework relies on mathematical robust control laws and rewards which train the RL agent based on set observation assumptions rather than actual system dynamics which are susceptible to noise, uncertainty, and disturbances not represented in the simulation environment. It has no internal modeling of the equations of the grids nor can it suggest which solver should be adopted to solve traditional nonlinear optimal power flow equations. Specifics concerning modeling and preferred solver require users to customize or create a new “Backend.” Additionally, such RL human-in-the-loop systems in practice require trustworthiness and quantification of risk. A library of open source contributed “Backends” from independent projects that customize the framework with supplemental documentation and paper references can assist in further development of the environment for different conditions. Human-in-the-loop studies can be completed by testing the environment scenario and control response of the system over a model of a real grid. Generated observations and control actions can then be compared to historical event sequences and grid operator responses.
S2: Sufficiency > Coverage
Coverage is limited to the network topologies provided by the grid2op environment which is based on different IEEE bus topologies. While customization of the environment in terms of the “Backend,” “Parameters,” and “Rules” are possible, there may be dependent modules that may still enforce game-over rules. Furthermore, since backend modeling is not the focus of grid2op, verification that customization obeys physical laws or models is necessary.
S3: Sufficiency > Granularity
The time resolution of 5-minute increments may not represent realistic observation time series grid data or chronics. Furthermore, the granularity may limit the effectiveness of specific actions in the provided action space. For example, the use of energy storage devices in the presence of overvoltage has little effect on energy absorption, incentivizing the agent to select from grid topology actions such as line changing line status or changing bus rather than setting storage. Expansion of the framework with efforts from the open source community to include multiple time resolutions may allow for generalization of the tool for different forecasting time horizons as well as action evaluation.
Ground survey of land use and land management
Details (click to expand)
Ground surveys collect direct field observations on land use practices and management approaches, providing critical ground-truth data that complements remote sensing. This information is essential for understanding human impacts on ecosystems and validating satellite-derived land cover classifications.
Access to comprehensive ground survey data is restricted due to institutional barriers and privacy concerns, limiting its availability for ecosystem change analysis.
Access to comprehensive ground survey data is restricted due to institutional barriers and privacy concerns, limiting its availability for ecosystem change analysis.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Access to the data is restricted, with limited availability to the public. Users often find themselves unable to access the comprehensive information they require and must settle for suboptimal or outdated data. Addressing this challenge necessitates a legislative process to facilitate broader access to data.
Ground-Based Weather Station Observations
Details (click to expand)
Ground-based weather station data provides point measurements of atmospheric variables including temperature, precipitation, and humidity from meteorological networks worldwide. These observations serve as ground truth for validating and bias-correcting climate model outputs, though spatial coverage varies significantly by region and is particularly sparse in developing countries.
Irregular spatial distribution and point-based measurements require extensive preprocessing to create gridded datasets suitable for ML applications. Limited station density in many regions, especially over oceans and remote areas, constrains bias-correction accuracy. Enhanced observation networks and improved interpolation techniques can provide more comprehensive spatial coverage for model validation.
Irregular spatial distribution and point-based measurements require extensive preprocessing to create gridded datasets suitable for ML applications. Limited station density in many regions, especially over oceans and remote areas, constrains bias-correction accuracy. Enhanced observation networks and improved interpolation techniques can provide more comprehensive spatial coverage for model validation.
Data Gap Type
Data Gap Details
U1: Usability > Structure
Point measurements require gridding - Statistical interpolation methods and geostatistical techniques can convert station data to regular grids
S2: Sufficiency > Coverage
Sparse coverage in remote regions - Expanded observation networks and satellite-derived proxies can fill spatial gaps
O2: Obtainability > Accessibility
The access to weather station data in some regions can be very largely restricted; only a small fraction of the data is open to the public.
Data is not regularly gridded and needs to be preprocessed before being used in an ML model.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
The access to weather station data in some regions can be very largely restricted; only a small fraction of the data is open to the public.
U1: Usability > Structure
Point measurements require gridding - Statistical interpolation methods and geostatistical techniques can convert station data to regular grids
S2: Sufficiency > Coverage
Sparse coverage in remote regions - Expanded observation networks and satellite-derived proxies can fill spatial gaps
Ground-survey based forest inventory data
Details (click to expand)
Forest information collected directly from forested areas through on-the-ground observations and measurements serves as ground truth for training and validating estimates. This data is crucial for accurate assessments, such as estimating forest canopy height using machine learning models. https://research.fs.usda.gov/programs/fia#data-and-tools
Manual collection results in data quality issues and limited spatial coverage, requiring improved collection protocols and integration with remote sensing to expand usability.
Manual collection results in data quality issues and limited spatial coverage, requiring improved collection protocols and integration with remote sensing to expand usability.
Data Gap Type
Data Gap Details
U5: Usability > Pre-processing
Ground-survey data often contains missing values, measurement errors, and duplicates that require cleaning before use. Standardizing collection protocols and developing automated quality control procedures could improve data usability.
S2: Sufficiency > Coverage
Manual collection methods limit geographical coverage and collection frequency. Integrating ground surveys with remote sensing approaches and developing citizen science initiatives could help expand coverage while maintaining data quality.
Health data
Details (click to expand)
Health data refers to information related to individuals’ physical and mental well-being. This can include a wide range of data, such as medical records, health surveys, healthcare utilization, and epidemiological data.
The biggest issue for health data is its limited and restricted access.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
There is, in general, not a lot of datasets one can use to cover the spectrum of population, age, gender, economic, etc. To make good use of available data, there should be more efforts to integrate available data from disparate data sources, such as the creation of data repositories and the open community data standard.
U4: Usability > Documentation
There are some data repositories available. The existing issues are that data is not always accompanied by the source code that created the data or other types of good documentation.
U2: Usability > Aggregation
Integrating climate data and health data is challenging. Climate data is usually in raster files or gridded format, whereas health data is usually in tabular format. Mapping climate data to the same geospatial entity of health data is also computationally expensive.
High-Resolution Rapid Refresh (HRRR) weather forecast
Details (click to expand)
The High-Resolution Rapid Refresh (HRRR) dataset contains near-term weather forecasts produced at 3-km resolution with hourly updates. It is a cloud-resolving, convection-allowing atmospheric model that assimilates radar data every 15 minutes over a 1-hour period. See https://rapidrefresh.noaa.gov/hrrr/
Data volume is large, and only data covering the US is available.
Data Gap Type
Data Gap Details
U6: Usability > Large Volume
The large data volume, resulting from its high spatio-temporal resolution, makes transferring and processing the data very challenging. It would be beneficial if the data were accessible remotely and if computational resources were provided alongside it.
S2: Sufficiency > Coverage
This assimilated dataset currently covers only the continental US. It would be highly beneficial to have a similar dataset that includes global coverage.
Historical climate observations
Details (click to expand)
Historical climate observations provide essential baseline data for tracking ecosystem changes over time. This dataset includes both global reanalysis products like ERA5 that offer comprehensive but coarse-resolution data, and more granular observations aggregated from local weather stations that provide detailed climate information at specific locations.
Processing climate data and Integrating climate data with health data is a big challenge.
Data Gap Type
Data Gap Details
O1: Obtainability > Findability
For people without expertise in climate data only, it is hard to find the data right for their needs, as there is no centralized platform where they can turn for all available climate data.
U1: Usability > Structure
Datasets are of different formats and structures.
O1: Obtainability > Findability
For people without expertise in climate data only, it is hard to find the data right for their needs, as there is no centralized platform where they can turn for all available climate data.
Integrating climate data and health data is challenging. Climate data is usually in raster files or gridded format, whereas health data is usually in tabular format. Mapping climate data to the same geospatial entity of health data is also computationally expensive.
For highly biodiverse regions, like tropical rainforests, there is a lack of high-resolution data that captures the spatial heterogeneity of climate, hydrology, soil, and the relationship with biodiversity.
For highly biodiverse regions, like tropical rainforests, there is a lack of high-resolution data that captures the spatial heterogeneity of climate, hydrology, soil, and the relationship with biodiversity.
Data Gap Type
Data Gap Details
S3: Sufficiency > Granularity
There is a lack of high-resolution data that captures the spatial heterogeneity of climate, hydrology, soil, and so on, which are important for biodiversity patterns. This is because of a lack of observation systems that are dense enough to capture the subtleties in those variables caused by terrain. It would be helpful to establish decentralized monitoring networks to cost-effectively collect and maintain high-quality data over time, which cannot be done by one single country.
Lab measurements of material property and carbon absorption
Details (click to expand)
Lab measurements of material properties (such as chemical composition and physical properties) and their performance on carbon absorption (such as absorption capacity).
The major challenge is that data is not shared with the public.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Data related to carbon absorption materials is often not readily accessible to the public, as it is typically withheld until commercial products are developed. While it is possible to scrape data from published literature, this approach can be cumbersome, especially for large datasets. To advance research and innovation in this field, establishing mandatory data sharing as a requirement for publication is essential. When a paper is published, authors should be required to provide their data in open, machine-readable formats to facilitate accessibility and usability.
Creating open initiatives where companies and institutions recognize the mutual benefits of data sharing is also vital. Until such initiatives demonstrate clear advantages for all stakeholders, private companies may be hesitant to share proprietary data. Initiatives like OpenDAC are promising steps toward fostering collaboration and transparency in the field.
Large-eddy simulations (atmospheric processes)
Details (click to expand)
Large-eddy simulations are very high-resolution atmospheric simulations (finer than 150 m) where atmospheric turbulence is explicitly resolved in the model, providing detailed insights into small-scale atmospheric processes.
These simulations are essential for resolving turbulent processes that current climate models cannot capture, but they require significant computational resources and are not readily available as benchmark datasets for the wider research community.
These simulations are essential for resolving turbulent processes that current climate models cannot capture, but they require significant computational resources and are not readily available as benchmark datasets for the wider research community.
Data Gap Type
Data Gap Details
S6: Sufficiency > Missing Components
Current high-resolution simulations cannot resolve many physical processes like turbulence. Extremely high-resolution simulations (sub-kilometer or tens of meters) are needed to serve as ground truth for training ML models as they provide a more realistic representation of atmospheric processes. Creating and sharing benchmark datasets based on these simulations would facilitate model development and validation.
LiDAR point cloud – airbone
Details (click to expand)
Airborne LiDAR (Light Detection and Ranging) collects high-resolution, three-dimensional point clouds of forest structure using sensors mounted on aircraft or drones. This technology captures precise data about forest canopies, enabling detailed assessment of biomass and carbon stocks at local to regional scales.
Limited geographical coverage due to high collection costs, combined with the need for domain expertise to process the complex point cloud data, restricts the use of this high-value data source.
Limited geographical coverage due to high collection costs, combined with the need for domain expertise to process the complex point cloud data, restricts the use of this high-value data source.
Data Gap Type
Data Gap Details
U5: Usability > Pre-processing
Domain expertise is required to process raw LiDAR point clouds and generate canopy height metrics used for training ML models. Developing open-source processing tools with standardized workflows would make this data more accessible to non-experts.
S2: Sufficiency > Coverage
Airborne LiDAR provides the most accurate measurements of canopy height but is not collected everywhere due to the high costs of aircraft or drone operations. Coordinated efforts to expand coverage and make existing data publicly available would significantly improve forest carbon stock estimation capabilities.
Micro-synchrophasors (µPMU data)
Details (click to expand)
Micro-phasor measurement units (µPMUs) provide synchronized voltage and current measurements with higher accuracy, precision, and sampling rate making it ideal for distribution network monitoring.
For example, µPMUs have an angle accuracy to the allowance of .01 degrees and a total vector error allowance of .05% in contrast to 1 degree and 1% total vector error allowance for classic PMUs. With sampling rates of 10-120 samples per second, µPMUs are capable of capturing dynamic and transient states within the low voltage distribution network allowing for improved event and fault detection and localization. Today most µPMU datasets can be accessed through manual field deployments in test-beds, collaborative research studies, or through publicly available datasets.
For effective fault localization using µPMU data, it is crucial that distribution circuit models supplied by utility partners or Distribution System Operators (DSOs) are accurate, though they often lack detailed phase identification and impedance values, leading to imprecise fault localization and time series contextualization. Noise and errors from transformers can degrade µPMU measurement accuracy. Integrating additional sensor data affects data quality and management due to high sampling rates and substantial data volumes, necessitating automated analysis strategies. Cross-verifying µPMU time series data, especially around fault occurrences, with other data sources is essential due to the continuous nature of µPMU data capture. Additionally, µPMU installations can be costly, limiting deployments to pilot projects over a small network. Furthermore, latencies in the monitoring system due to communication and computational delays challenge real-time data processing and analysis.
For effective fault localization using µPMU data, it is crucial that distribution circuit models supplied by utility partners or Distribution System Operators (DSOs) are accurate, though they often lack detailed phase identification and impedance values, leading to imprecise fault localization and time series contextualization. Noise and errors from transformers can degrade µPMU measurement accuracy. Integrating additional sensor data affects data quality and management due to high sampling rates and substantial data volumes, necessitating automated analysis strategies. Cross-verifying µPMU time series data, especially around fault occurrences, with other data sources is essential due to the continuous nature of µPMU data capture. Additionally, µPMU installations can be costly, limiting deployments to pilot projects over a small network. Furthermore, latencies in the monitoring system due to communication and computational delays challenge real-time data processing and analysis.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Typically the distribution circuit lacks notation with respect to the phase identification and impedance values, often providing rough approximations which can ultimately influence the accuracy of localization as well as time series contextualization of a fault. Decreased accuracy of localization can then affect downstream control mechanisms to ensure operational reliability. For µPMU data to be utilized for fault localization, the distribution circuit model must be provided by the partnering utility or DSO.
U5: Usability > Pre-processing
µPMU data is sensitive to noise especially from geomagnetic storms which can induce electric currents in the atmosphere and impact measurement accuracy. Data can also be compromised by errors introduced by current and potential transformers. One way to mitigate this error is to monitor and re-calibrate transformers or deploy redundant µPMUs to verify measurements.
Depending on whether additional data from other sensors or field reports is being used to classify µPMU time series data, creation of a joint sensor dataset may improve quality based on the overall sampling rate and format of the additional non-µPMU data.
U6: Usability > Large Volume
Due to the high sampling rates, data volume from each individual µPMU can be challenging to manage and analyze due to its continuous nature. Coupled with the number of µPMUs required to monitor a portion of the distribution network, the amount of data can easily exceed terabytes. Automation of indexing and mining time series by transient characteristics can facilitate domain specialist verification efforts.
R1: Reliability > Quality
Since µPMU data is continuously captured, time series data leading up to or even identifying a fault or potential fault requires verification from other data sources.
Digital Fault Recorders (DFRs) capture high resolution event driven data such as disturbances due to faults, switching and transients. They are able to detect rapid events like lightning strikes and breaker trips while also recording the current and voltage magnitude with respect to time. Additionally, system dynamics over a longer period following a disturbance can also be captured. When used in conjunction with µPMU data, DFR data can assist in verifying significant transients found in the µPMU data which can facilitate improved analysis of both signals leading up to and after an event from the perspective of distribution-side state.
S2: Sufficiency > Coverage
Currently µPMU installation to existing distribution grids have significant financial costs so most deployments have been in the form of pilot projects with utilities. Pilot studies include the Flexgrid testing facility at Lawrence Berkeley National Laboratory (LBNL), Philadelphia Navy Yard microgrid (2016-2017), the micro-synchrophasors for distribution systems plus-up project (2016-2018), resilient electricity grids in the Philippines (2016), the GMLC 1.2.5- sensing and measurement strategy (2016), the bi-national laboratory for the intelligent management of energy sustainability and technology education in Mexico City (2017-2018) based on North American Synchrophasor Initiative (NASPI) reports.
Coverage is also limited by acceptance to this technology due to a pre-existing reliance on SCADA systems which measure grid conditions on a 15 minute cadence. As transients become more common, especially on the low voltage distribution grid, transition to monitoring with higher resolution will become necessary. Multi-objective evaluation with respect to the value proposition of further µPMU sensor monitoring networks can provide utilities and DSOs with a framework for assessing the economic, environmental, and operational benefit to pursue larger scale studies.
S4: Sufficiency > Timeliness
µPMU data can suffer from multiple latencies within the monitoring system of the grid that are unable to keep up with the high sampling rate of the continuous measurements that µPMUs generate. Latencies occur in the context of the system communications surrounding signals as they are being recorded, processed, sent, and received. This can be due to the communication medium used, cable distance, amount of processing, and computational delay. More specifically, the named latencies are measurement, transmission, channel, receiver, and algorithm related. Identification of characteristics preceding fault events with lead times to overcome potential latencies through machine learning or other techniques can be of benefit.
Museum specimens
Details (click to expand)
Museum specimens contain detailed biological records documenting species’ characteristics, including morphological traits. Data on where and when they were collected is also often recorded. This offers documentation on the occurrence of species in both space and time. Museum specimens are valuable resources for various applications, such as species classification and species distribution modeling.
The majority of the world’s museum specimens remain undigitized, creating a significant barrier to using these records in machine learning applications for biodiversity monitoring and climate change research.
The majority of the world’s museum specimens remain undigitized, creating a significant barrier to using these records in machine learning applications for biodiversity monitoring and climate change research.
Data Gap Type
Data Gap Details
S1: Sufficiency > Insufficient Volume
Museum specimens only become valuable to ML studies when they are digitized. Many museum specimens remain to be digitized, and this task presents significant challenges. Much of the information about these specimens, such as species traits and occurrence data, is often recorded in handwritten notes, making parsing and recognizing this information a complex and error-prone process.
Digitizing these specimens has become a priority for many museums. To support this effort, adequate funding, and technical and scientific assistance should be provided. Machine learning itself can help support some of these efforts e.g. when it comes to digitizing notes.
The NEX-GDDP-CMIP6 dataset provides high-resolution, bias-corrected global climate projections derived from Coupled Model Intercomparison Project Phase 6 (CMIP6) across four greenhouse gas emissions scenarios (Shared Socioeconomic Pathways). It includes daily climate variables such as temperature, precipitation, humidity, and radiation from 2015 to 2100 at approximately 25km resolution, enabling detailed analysis of climate change impacts sensitive to local topography and fine-scale climate gradients. For more information, see https://www.nccs.nasa.gov/services/data-collections/land-based-products/nex-gddp-cmip6.
The dataset’s massive size (petabytes of data) creates significant barriers for access, transfer, and analysis, requiring specialized computing infrastructure and technical expertise that many researchers lack. Additionally, efficiently extracting relevant extreme heat information from this comprehensive climate dataset presents computational and methodological challenges.
The dataset’s massive size (petabytes of data) creates significant barriers for access, transfer, and analysis, requiring specialized computing infrastructure and technical expertise that many researchers lack. Additionally, efficiently extracting relevant extreme heat information from this comprehensive climate dataset presents computational and methodological challenges.
Data Gap Type
Data Gap Details
U6: Usability > Large Volume
The NEX-GDDP-CMIP6 dataset requires substantial computational resources for processing and analysis. While cloud platforms provide access, they involve usage costs that may be prohibitive for some researchers. Processing such large datasets requires specialized techniques like distributed computing frameworks (e.g., Dask, Spark) and occasionally large-memory computing nodes for certain statistical analyses. Many researchers and practitioners lack either the technical expertise or computational resources to effectively utilize this valuable data.
NIST campus photovoltaic arrays and weather station data
Details (click to expand)
This dataset contains measurements from PV arrays at the National Institute of Standards and Technology campus from August 2014-July 2017, including electrical, temperature, meteorological, and radiation data sampled at high frequency with one-minute averages.
The dataset has limited spatial coverage (Gaithersburg, MD only) and is no longer maintained after July 2017, limiting its usefulness for current applications.
The dataset has limited spatial coverage (Gaithersburg, MD only) and is no longer maintained after July 2017, limiting its usefulness for current applications.
Data Gap Type
Data Gap Details
S2: Sufficiency > Coverage
Since testbeds are located on the NIST campus spatial coverage is limited to the institution’s site. Similar datasets outside which combine sensor information from the solar irradiance conditions, and the associated solar generated power at the output of the inverter would require investment in similar site-specific testbeds in different regions.
S4: Sufficiency > Timeliness
Spatial coverage is limited to the NIST campus in Gaithersburg, MD. Similar datasets in different regions would require investment in comparable testbeds.
NOAA's SOLRAD Network Solar Radiation Data
Details (click to expand)
The National Oceanic and Atmospheric Administration’s SOLRAD Network monitors surface radiation at nine locations across the United States. The data includes high-precision measurements from various instruments, including pyrheliometers, pyranometers, and UV radiometers that collect minute-interval measurements of incoming solar radiation. These measurements characterize the Earth’s surface radiation budget and can be used to accurately forecast solar energy generation for grid planning and management.
While NOAA’s SOLRAD is an excellent data source for long term solar irradiance and climate studies, it has limitations for short-term solar forecasting applications. Key gaps include lower quality hourly averages compared to native resolution data, and limited geographic coverage with only nine monitoring stations across the United States. These constraints impact the effectiveness of forecasting for real-time energy management, grid stability, and market operations.
While NOAA’s SOLRAD is an excellent data source for long term solar irradiance and climate studies, it has limitations for short-term solar forecasting applications. Key gaps include lower quality hourly averages compared to native resolution data, and limited geographic coverage with only nine monitoring stations across the United States. These constraints impact the effectiveness of forecasting for real-time energy management, grid stability, and market operations.
Data Gap Type
Data Gap Details
S2: Sufficiency > Coverage
The coverage area is constrained to nine SURFRAD network locations in the United States (Albuquerque, NM; Bismark, ND; Hanford, CA; Madison, WI; Oak Ridge, TN; Salt Lake City, UT; Seattle, WA; Sterling, VA; Tallahassee, FL). For generalization to other regions, locations with similar climates and temperate zones would need to be identified.
S3: Sufficiency > Granularity
Data quality of the hourly averages is lower than that of the native resolution data, impacting effective short-term forecasting for real-time energy management, grid stability, demand response, and market operations. To address this gap, using very short-term data or supplementing with data from sky imagers and other sensors with frequent measurement outputs would be beneficial.
NREL Physical Solar Model Solar Radiation Database
Details (click to expand)
The National Renewable Energy Laboratory (NREL)’s Solar Radiaion Database provides hourly and half-hourly solar radiation data modeled using NREL’s Physical Solar Model (PSM). The data is derived from multiple satellite sources including NOAA’s Geostationary Operational Environmental Satellites (GOES), the Interactive Multisensor Snow and Ice Mapping System (IMS), MODIS, and MERRA-2 reanalysis. The PSM derives cloud and aerosol properties as inputs for the Fast All-sky Radiation Model for Solar applications (FARMS), enabling users to access spectral irradiance data based on time, location, and PV orientation.
While NSRDB offers global coverage using satellite-derived data, several challenges exist. The dataset requires periodic recalculation and updating to remain current, with unbalanced temporal coverage favoring the United States. Satellite-based estimations may be inaccurate in regions with frequent cloud cover, snow, or bright surfaces, requiring ground-based verification. Additionally, data derived from satellite imagery may need preprocessing to account for parallax effects and field-of-view issues that aren’t fully addressed in the higher-level FARMS products.
While NSRDB offers global coverage using satellite-derived data, several challenges exist. The dataset requires periodic recalculation and updating to remain current, with unbalanced temporal coverage favoring the United States. Satellite-based estimations may be inaccurate in regions with frequent cloud cover, snow, or bright surfaces, requiring ground-based verification. Additionally, data derived from satellite imagery may need preprocessing to account for parallax effects and field-of-view issues that aren’t fully addressed in the higher-level FARMS products.
Data Gap Type
Data Gap Details
U5: Usability > Pre-processing
Data derived from satellite imagery requires pre-processing to account for pixel variability, parallax effects, and additional modeling using radiative transfer to improve solar radiation estimates.
S4: Sufficiency > Timeliness
Data flow from satellite imagery to solar radiation measurement output from FARMS needs to be recalculated and updated to expand beyond the current coverage years of the represented global regions.
R1: Reliability > Quality
Satellite-based estimation of solar resource information for sites susceptible to cloud cover, snow, and bright surfaces may not be accurate, requiring verification from ground-based measurements.
NREL SRRL Baseline Measurement System for Multi-Variable Solar Research
Details (click to expand)
The NREL Solar Radiation Research Laboratory’s Baseline Measurement System (SRRL BMS) provides 130 variables at 60-second intervals for site-specific environmental factors at its Golden, Colorado facility. This comprehensive dataset includes co-located measurements of temperature, pressure, precipitation, wind parameters, humidity, UV index, aerosol optical depth, albedo, and cloud cover categorized as opaque, thin, and clear. This multi-variable dataset supports photovoltaic potential studies and renewable resource climatology research.
While NREL’S SRRL BMS provides real-time joint variable data from ground-based sensors, its coverage is limited to the single location in Golden, CO in the United States. The diverse sensor network requires regular maintenance, and instrument malfunctions or calibration issues may lead to data inaccuracies if not promptly detected and addressed, affecting the reliability of solar forecasting applications.
While NREL’S SRRL BMS provides real-time joint variable data from ground-based sensors, its coverage is limited to the single location in Golden, CO in the United States. The diverse sensor network requires regular maintenance, and instrument malfunctions or calibration issues may lead to data inaccuracies if not promptly detected and addressed, affecting the reliability of solar forecasting applications.
Data Gap Type
Data Gap Details
U5: Usability > Pre-processing
Instrument malfunction or calibration requires human intervention, leading to inaccuracies in measured data quantities, especially if detection is delayed, affecting solar forecast accuracies. Despite this, the dataset continues to be maintained.
S2: Sufficiency > Coverage
Coverage is reserved to Golden, CO. Other locations would benefit from similar sensor monitoring systems, especially those with variations in weather patterns that could affect solar irradiance forecasting and energy harvesting.
NREL Wind Active Power Control Simulation Tools
Details (click to expand)
NREL has developed simulation tools to understand the effects of wind power on interconnection system frequency, including the Flexible Energy Scheduling Tool for Integrating Variable Generation (FESTIV) and Multi-Area Frequency Response Integration Tool (MAFRIT). These tools use traditional commercial software and custom-developed models to perform dynamic simulations and wind generation studies for active power control of the grid.
Access to NREL’s FESTIV model requires special permission, limiting broader research applications. The model’s hourly temporal resolution cannot capture sub-hourly dynamics critical for frequency response and system stability. Additionally, the simulation-based approach requires validation with real-world operational data to ensure accuracy for practical grid applications.
Access to NREL’s FESTIV model requires special permission, limiting broader research applications. The model’s hourly temporal resolution cannot capture sub-hourly dynamics critical for frequency response and system stability. Additionally, the simulation-based approach requires validation with real-world operational data to ensure accuracy for practical grid applications.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Access to the FESTIV model requires permission by contacting the group manager
R1: Reliability > Quality
The model may not account for all real-time system dynamics and complexities, requiring verification from operational data. Scenario-based forecasting may not capture real-world uncertainties, and operating reserve values may be inaccurate without practical validation.
S3: Sufficiency > Granularity
FESTIV operates on hourly unit commitment time resolution, which cannot capture reliability impacts occurring on sub-hourly scales including frequency response, voltage magnitudes, and reactive power flows that affect system stability.
NREL solar power data for integration studies
Details (click to expand)
The NREL Solar Power Data for Integration Studies provides one year (2006) of 5-minute solar power data and hourly day-ahead forecasts for 6,000 simulated PV plants across the United States. The dataset was created using sub-hour irradiance algorithms and Numeric Weather Prediction simulations, covering both utility-scale (with single-axis tracking) and distributed-scale (fixed-tilt) PV systems.
While valuable for renewable energy integration studies, this dataset has limitations in geographic coverage (limited to the US), temporal scope (only 2006 data), and relies on simulated rather than measured PV outputs. Addressing these gaps would enable more accurate and globally applicable ML-based solar forecasting models.
While valuable for renewable energy integration studies, this dataset has limitations in geographic coverage (limited to the US), temporal scope (only 2006 data), and relies on simulated rather than measured PV outputs. Addressing these gaps would enable more accurate and globally applicable ML-based solar forecasting models.
Data Gap Type
Data Gap Details
R1: Reliability > Quality
The dataset uses simulated outputs based on weather predictions rather than actual PV measurements, which may introduce systematic biases. Site-specific projects require additional validation with real measurements from solar power inverters. Developers can improve model accuracy by supplementing with local measurements and adapting simulation parameters to better represent specific regions.
S2: Sufficiency > Coverage
The dataset is limited to US locations based on 2006 solar conditions and is not representative of other geographic regions or more recent climate patterns. Expanding data collection to include diverse global regions and updating with more recent measurements would improve model transferability.
S4: Sufficiency > Timeliness
The dataset only covers 2006, which may not capture recent climate trends or technology improvements in PV systems. Updated datasets with more recent time periods would better represent current conditions and improve forecasting accuracy.
Natural hazards forecasts
Details (click to expand)
Natural hazard data used for risk assessments can usually be modeled with characteristics derived from, and statistically consistent with, the observational record. Some hazard data catalogs can be found here https://sedac.ciesin.columbia.edu/theme/hazards/data/sets/browse, as well as from the Risk Data Library of the World Bank.
The resolution of current natural hazard forecast data is not sufficient for effective physical risk assessment.
Data Gap Type
Data Gap Details
S3: Sufficiency > Granularity
Climate hazard data (e.g., floods, tropical cyclones, droughts) is often too coarse for effective physical risk assessments, which focus on evaluating damage to infrastructure such as buildings and power grids. While exposure data, including information on buildings and power grids, is available at resolutions ranging from 25 meters to 250 meters, climate hazard projections, especially those extending beyond a year, are typically at resolutions of 25 kilometers or more.
To provide meaningful risk assessments, more granular data is required. This necessitates downscaling efforts, both dynamical and statistical, to refine the resolution of climate hazard data. Machine learning (ML) can play a valuable role in these downscaling processes. Additionally, the downscaled data should be made publicly available, and a dedicated portal should be established to facilitate access and sharing of this refined information.
R1: Reliability > Quality
Projecting future climate hazards is crucial for assessing long-term risks. Climate simulations from CMIP models are currently our primary source for future climate projections. However, these simulations come with significant uncertainties due to both uncertainties in model and emission scenarios. To improve their utility for disaster risk assessment and other applications, increased funding and efforts are needed to advance climate model development for greater accuracy. Additionally, machine learning methods can help mitigate some of these uncertainties by bias-correcting the simulations.
S6: Sufficiency > Missing Components
Seasonal climate hazard forecasts are crucial for disaster risk assessment, management, and preparation. However, high-resolution data at this scale is often lacking for many hazards. This challenge is likely due to the difficulty in generating accurate seasonal weather forecasts. ML has the potential to address this gap by improving forecast accuracy and granularity.
Ocean observations from floating infrastructure (FINO3)
Details (click to expand)
FINO3 is an off-shore wind mast based wind speed and wind direction research platform datasets which include time series data with respect to temperature, air pressure, relative humidity, global radiation, and precipitation. Images from the perspective of the platform provide a snapshot of of environmental conditions directly.
The platform is located in the northern part of the German Bight, 80km northwest of the island of Sylt in the midst of wind farms. Wind measurements are taken between 32 to 102 meters above sea level with wind speed measurements taken every 10 meters. Data is collected from August 2009 until the present day.
Due to the location, FINO platform sensors are prone to failures of measurement sensors due to adverse outdoor conditions such as high wind and high waves. Often, when sensors fail, manual recalibration or repair is nontrivial requiring weather conditions to be amenable for human intervention. This can directly affect the data quality which can last for several weeks to a season. Coverage is also constrained to the platform and associated wind farm location.
Due to the location, FINO platform sensors are prone to failures of measurement sensors due to adverse outdoor conditions such as high wind and high waves. Often, when sensors fail, manual recalibration or repair is nontrivial requiring weather conditions to be amenable for human intervention. This can directly affect the data quality which can last for several weeks to a season. Coverage is also constrained to the platform and associated wind farm location.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
The data is free to use but requires sign up through a login account at: https://login.bsh.de/fachverfahren/
U5: Usability > Pre-processing
The dataset is prone to failures of measurement sensors. Issues with data loggers, power supplies, and effects of adverse conditions such as low aerosol concentrations can influence data quality. High wind and wave conditions impact the ability to correct or recalibrate sensors creating data gaps that can last for several weeks or seasons.
S2: Sufficiency > Coverage
Coverage of wind farms is relegated to the dimensions of the platform itself and the wind farm that it is built in proximity to. For locations with different offshore characteristics similar testbed platforms or buoys can be developed.
S5: Sufficiency > Proxy
Due to the nature of sensors exposed to environmental ocean conditions and storms, often FINO sensors may need maintenance and repair but are difficult to physically access. Gaps in the data from lack of data points can be addressed by utilizing mesoscale wind modeling output.
Offshore wind data from masts and LiDAR
Details (click to expand)
Offshore wind data from mast measurements and LiDAR can be found from several providers.
LiDAR based-wind mapping has advantages over traditional wind mast tower measurements, namely higher resolution, larger coverage, and improved data quality. This is because LiDAR can measure wind speeds at various heights from the ground reducing the impact of turbulence on measurements that would typically affect mast measurements. Furthermore, LiDAR based wind mapping can provide near real time wind data suitable for control optimization and load forecasting applications.
The spatiotemporal coverage of the offshore windspeed mast data is restricted to the dimensions of the platform/tower itself as well as the time of construction. Depending on the data provider access to the data may require the signing of a non-disclosure agreement.
The spatiotemporal coverage of the offshore windspeed mast data is restricted to the dimensions of the platform/tower itself as well as the time of construction. Depending on the data provider access to the data may require the signing of a non-disclosure agreement.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Access to data must be requested with different data providers having varying levels of restrictions. For data obtained from Orsted, access is only provided by signing a standard non-disclosure agreement. For more information mail R&D at datasharing@orsted.com.
S2: Sufficiency > Coverage
Spatiotemporal coverage of the dataset varies depending on the construction of the platform testbed and location but overall data is available from 2014 to the present. While measurements from LiDAR have higher resolution than wind mast data, sensor information is still restricted to the dimensions of the platform and the associated off-shore windfarm when present. Data provided by Orsted from LiDAR sensors includes 10 minute statistics.
Offshore wind farm operation data (Orsted)
Details (click to expand)
The offshore operation data from the Danish energy company Orsted provides 2 years worth of 10-minute Supervisory Control and Data Acquisition (SCADA) information for nacelle wind speed, electrical power, rotor speed, yaw position, as well as pitch angle for turbines with on-site wave buoy data and ground based LiDAR from different offshort wind farm sites.
For one site, the Anholt Westermost Rough offshore wind farm, data is collected from 111 Siemens SWT-120-3.6 MW wind turbines arranged in a layout of 20 km by 8 km with internal spacing between turbines being 5-7 rotors and a depth of 15-19 m. In another site, The Northeast of Withernsea off Holderness coast in North Sea, England, has a windfarm with a 35 km by 35 km spatial coverage area.
Data can be accessed by requesting access via the Orsted form. Sufficiency of the dataset is constrained by volume where only a finite amount of short term off-shore wind farms exist to which expanding the coverage area, volume and time granularity of data to under 10 minutes may enable transient detection from generated active power.
Data can be accessed by requesting access via the Orsted form. Sufficiency of the dataset is constrained by volume where only a finite amount of short term off-shore wind farms exist to which expanding the coverage area, volume and time granularity of data to under 10 minutes may enable transient detection from generated active power.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Access requests are needed via a form from Orsted.
S1: Sufficiency > Insufficient Volume
Data from multiple wind farms over a variety of regions would be required to get a more accurate comparison against simulated weather data.
S2: Sufficiency > Coverage
The coverage is over parts of Europe; off-shore wind conditions will vary depending on the environment and cannot scale or transfer to other temperate regions of the world
S3: Sufficiency > Granularity
The time granularity of 10 min is too coarse to capture transients in active power generated.
S4: Sufficiency > Timeliness
Only two years worth of data from 2016–2018 is provided. Additional data collection from offshore wind farms or simulations are needed.
OpenStreetMap (land use map)
Details (click to expand)
Open Street Map is an open-source map database providing worldwide geographic features such as buildings, roads, and land uses, maintained by a community of mappers who add objects manually or trace them from remote sensing imagery.
The quality of OpenStreetMap is very variable in terms of coverage of geometries e.g. buildings and attributes. Roads are better mapped than buildings in general. The very permissive data model from OpenStreetMap enables users to provide a variety of information, but it is often not well harmonized. Recent corporate editing efforts have increased dramatically the coverage in previously poorly mapped regions.
The quality of OpenStreetMap is very variable in terms of coverage of geometries e.g. buildings and attributes. Roads are better mapped than buildings in general. The very permissive data model from OpenStreetMap enables users to provide a variety of information, but it is often not well harmonized. Recent corporate editing efforts have increased dramatically the coverage in previously poorly mapped regions.
Data Gap Type
Data Gap Details
U6: Usability > Large Volume
When working at a large geographical scale, e.g. continental scale, the data volume requires significant computational resources for the processing.
U4: Usability > Documentation
The origin of attributes is often unknown, creating uncertainty about values.
U5: Usability > Pre-processing
The flexible data model lacks type enforcement, requiring additional processing for analysis.
R1: Reliability > Quality
Data quality ranges from excellent (sometimes surpassing official sources) to very low (including mapping vandalism).
S2: Sufficiency > Coverage
Street coverage is generally good, while building coverage varies widely.
S4: Sufficiency > Timeliness
Update frequency varies from multiple times per year to decades-old data, with disaster areas often updated quickly by active communities.
S6: Sufficiency > Missing Components
Most attributes remain incomplete, with completeness levels below 10%.
Optimal power flow simulation outputs
Details (click to expand)
PowerWorld Simulator and MATPOWER are software tools used for optimizing power systems and include representation of both alternating current (AC) and direct current (DC) systems. PowerWorld Simulator models, analyzes, and optimizes power systems for a wide range of configurations and scenarios with the ability to model small distribution networks as well as transmission systems.
MATPOWER is an open source alternative and also solves both the AC and DC versions of optimal power flow (OPF) with DC OPF simplified into a quadratic program using DC modeling assumptions and reducing polynomial costs to second order using real power flows as a function of voltage angles (thereby eliminating voltage magnitude and reactive power). PowerWorld Simulator utilizes a combination of iterative algorithms (Newton-Raphson) with traditional power flow equations.
MATPOWER is open source and PowerWorld Simulator has several options for industry practitioners as well as those who would like to use it for academic purposes. Demo software that is licensed for educational use that includes simulator features such as available transfer capability, optimal power flow, security-constrained OPF, OPF reserves, PV/QV curve tool, transient stability, and geomagnetically induced current. In terms of topology, the free version contains up to 13 buses while the full version of the simulator can handle 250,000 buses.
Traditional OPF simulation software may require the purchase of licenses for advanced features and functionalities. To simulate more complex systems or regions, additional data regarding energy infrastructure, region-specific load demand, and renewable generation may be needed to conduct studies. OPF simulation output would require verification and performance evaluation to assess results in practice. Increasing the granularity of the simulation model by increasing the number of buses, limits, or additional parameters increases the complexity of the OPF problem, thereby increasing the computational time and resources required.
Traditional OPF simulation software may require the purchase of licenses for advanced features and functionalities. To simulate more complex systems or regions, additional data regarding energy infrastructure, region-specific load demand, and renewable generation may be needed to conduct studies. OPF simulation output would require verification and performance evaluation to assess results in practice. Increasing the granularity of the simulation model by increasing the number of buses, limits, or additional parameters increases the complexity of the OPF problem, thereby increasing the computational time and resources required.
Data Gap Type
Data Gap Details
U2: Usability > Aggregation
In MATPOWER and PowerWorld outside data may be required to simulate conditions over a specific region with a given amount of DERs, generating sources, bus topology, and line limits. This will require collation of pre-existing synthetic grid data with additional data to model specific scenarios.
U3: Usability > Usage Rights
Depending on whether proprietary simulators are pursued (PowerWorld) there may be licensing costs for use of certain features.
R1: Reliability > Quality
Traditional OPF simulation software simplifies the power system and makes assumptions about the system behavior such as perfect power factor correction or constant system parameters. Simulation results may need to be verified with real-world results.
S3: Sufficiency > Granularity
In PowerWorld, bus topologies available may be simplified representations of actual grids to simplify the modeling and simulation techniques to represent overall system behavior. MATPOWER requires the user to define the bus matrix. As the number of buses in a power system increases the computational complexity of OPF increases, requiring more resources and time to solve. Additional parameters such as line limits, number of generating sources, number of DERs, and load demand also increase the complexity of the model as more constraints and assets are introduced.
Outputs from distribution connected inverter systems simulations
Details (click to expand)
There is a need to enhance existing simulation tools to study inverter based power systems rather than traditional machine based. Simulations should be able to represent a large number of distribution connected inverters which incorporate DER models with the larger power system. Uncertainty surrounding model outputs will require verification and testing.
NREL’s PREconfiguring and Controlling Inverter SEt-points (PRECISE) can identify interconnection located on network based on PV customer’s address and model the distribution feeder and preconfigure advanced inverter modes to provide grid support and minimize energy curtailment. The tool can allow utilities to perform power flow analysis and analyze inverter modes.
Furthermore, NREL’s Energy Systems Integration Facility (ESIF) has real-time simulation connected with power hardware that allows for smart inverter manufacturers to test operational control with simulated dynamics and scenarios.
There is a need to enhance existing simulation tools to study inverter-based power systems rather than traditional machine-based based. Simulations should be able to represent a large number of distribution-connected inverters that incorporate DER models with the larger power system. Uncertainty surrounding model outputs will require verification and testing. Furthermore, accessibility to simulations and hardware in the loop facilities and systems requires user access proposal submission for NREL’s Energy Systems Integration Facility access. Similar testing laboratories may require access requests and funding.
There is a need to enhance existing simulation tools to study inverter-based power systems rather than traditional machine-based based. Simulations should be able to represent a large number of distribution-connected inverters that incorporate DER models with the larger power system. Uncertainty surrounding model outputs will require verification and testing. Furthermore, accessibility to simulations and hardware in the loop facilities and systems requires user access proposal submission for NREL’s Energy Systems Integration Facility access. Similar testing laboratories may require access requests and funding.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Contact NREL precise@nrel.gov for access to the PRECISE model
Submit an Energy Systems Integration Facility (ESIF) laboratory request form to userprogram.esif@nrel.gov to gain access to hardware in the loop inverter simulation systems. Access to particular hardware may require collaboration with inverter manufacturers which may have additional permission requirements.
R1: Reliability > Quality
The optimization routine of the simulation model may face challenges in determining the precise balance between grid operation criteria and impacts on customer PV generation. Generation may still require curtailment by the utility to prioritize grid stability. To circumvent this gap external data on distribution side operating conditions, load demand, solar generation, and utility-initiated generation curtailment can be collected and introduced into expanded simulation studies.
Passive acoustic monitoring for biodiversity assessment
Details (click to expand)
Passive acoustic recording provides continuous monitoring of both environment and species vocalizations.While some annotated datasets are available through repositories like ARBIMON (https://arbimon.org/), Macaulay Library (www.macaulaylibrary.org), and Xeno-canto (www.xeno-canto.org), there remains a general lack of robust, large, and diverse annotated bioacoustic datasets for machine learning applications.
There is a lack of standardized protocol to guide data collection for restoration projects, including which variables to measure and how to measure them. Developing such protocols would standardize data collection practices, enabling consistent assessment and comparison of restoration successes and failures on a global scale.
There is a lack of standardized protocol to guide data collection for restoration projects, including which variables to measure and how to measure them. Developing such protocols would standardize data collection practices, enabling consistent assessment and comparison of restoration successes and failures on a global scale.
Data Gap Type
Data Gap Details
U1: Usability > Structure
For restoration projects, there is an urgent need for standardized protocols to guide data collection and curation for measuring biodiversity and other complex ecological outcomes of restoration projects. For example, there is an urgent need of clearly written guidance on what variables to collect and how to collect them.
Turning raw data into usable insights is also a significant challenge. There is a lack of standardized tools and workflows to turn the raw data into analysis-ready data and analyze the data in a consistent way across projects.
U4: Usability > Documentation
For restoration projects, there is an urgent need of standardized protocol to guide data collection and curation for measuring biodiversity and other complex ecological outcomes of restoration projects. For example, there is an urgent need of a clearly written guidance on what variables to collect and how to collect them.
Turning raw data into usable insights is also a significant challenge. There is a lack of standardized tools and workflows to turn the raw data into some analysis-ready data and analyze the data in a consistent way across projects.
The major data gap is the limited institutional capacity to process and analyze data promptly to inform decision-making.
Data Gap Type
Data Gap Details
M: Misc/Other
There is a significant institutional challenge in processing and analyzing data promptly to inform decision-making. To enhance institutional capacity for leveraging global data sources and analytical methods effectively, a strategic, ecosystem-building approach is essential, rather than solely focusing on individual researcher skill development. This approach should prioritize long-term sustainability over short-term project-based funding.
The first and foremost challenge of bioacoustic data is its sheer volume, which makes its data sharing especially difficult due to limited storage options and high costs. Urgent solutions are needed for cheaper and more reliable data hosting and sharing platforms.
Additionally, there’s a significant shortage of large and diverse annotated datasets, much more severe compared to image data like camera trap, drone, and crowd-sourced images.
The first and foremost challenge of bioacoustic data is its sheer volume, which makes its data sharing especially difficult due to limited storage options and high costs. Urgent solutions are needed for cheaper and more reliable data hosting and sharing platforms.
Additionally, there’s a significant shortage of large and diverse annotated datasets, much more severe compared to image data like camera trap, drone, and crowd-sourced images.
Data Gap Type
Data Gap Details
S1: Sufficiency > Insufficient Volume
One of the foremost challenges is the scarcity of publicly open and adequately annotated data for model training. It is worth mentioning that the lack of annotated datasets is a common and major challenge applied to almost every modality of biodiversity data and is particularly acute for bioacoustic data, where the availability of sufficiently large and diverse datasets, encompassing a wide array of species, remains limited for model training purposes.
This scarcity of publicly open and well annotated datasets arises from the labor-intensive nature of data annotation and the dearth of expertise necessary to accurately identify species. This is the case even for relatively well-studied taxa such as birds, and is even more notable for taxa such as Orthoptera (grasshoppers and relatives). The incompleteness of our current taxonomy also compounds this issue.
Though some data cannot be shared with the public because of concerns about poaching or other legal restrictions imposed by local governments, there are still considerable sources of well-annotated data that currently reside within the confines of individual researchers’ or organizations’ hard drives. Unlocking these data will partially address the data gap.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
S2: Sufficiency > Coverage
There exists a significant geographic imbalance in data collected, with insufficient representation from highly biodiverse regions. The data also does not cover a diverse spectrum of species, intertwining with taxonomy insufficiencies.
Addressing these gaps involves several strategies:
Data sharing: Unlocking existing datasets scattered across individual laboratories and organizations can partially alleviate the geographic and taxonomic gaps in data coverage.
Annotation Efforts:
Leveraging crowdsourcing platforms enables collaborative data collection and facilitates volunteer-driven annotation, playing a pivotal role in expanding annotated datasets.
Collaborating with domain experts, including ecologists and machine learning practitioners, is crucial. Incorporating local ecological knowledge, particularly from indigenous communities in target regions, enriches data annotation efforts.
Developing AI-powered methods for automatic cataloging, searching, and data conversion is imperative to streamline data processing and extract relevant information efficiently.
Data Collection:
Prioritize continuous data collection efforts, especially in underrepresented regions and for less documented species and taxonomic groups.
Explore innovative approaches like Automated Multisensor Stations for Monitoring of Species Diversity (AMMODs) to enhance data collection capabilities.
Strategically plan data collection initiatives to target species at risk of extinction, ensuring data capture before irreversible loss occurs.
U2: Usability > Aggregation
A lot of well-annotated datasets are currently scattered within the confines of individual researchers’ or organizations’ hard drives. Aggregating and sharing these datasets would largely address the lack of publicly open annotated data needed for model training. This initiative would also mitigate geographic and taxonomic imbalances in existing data.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
U6: Usability > Large Volume
One of the biggest challenges in bioacoustic data lies in its sheer volume, stemming from continuous monitoring processes. Researchers face significant hurdles in sharing and hosting this data, as existing online platforms often don’t provide sufficient long-term storage capacity or are very expensive. Solutions are needed to provide cheaper and more reliable hosting options. Moreover, accessing these extensive datasets demands advanced computing infrastructure and solutions. The availability of more funding sources may push more people to start sharing their bioacoustic data.
Pecan Street (appliance-level consumption data)
Details (click to expand)
Pecan Street DataPort began as a Smart Grid Demonstration program through the Pecan Street energy research nonprofit organization which worked closely with the University of Texas at Austin. Funded by the DOE in 2014, the project signed up 1000 research participants from the Mueller community in Austin, Texas to share green button, smart meter, and home energy management system (HEMS) data in 750 homes and 25 commercial properties. Financial incentivization of plug-in electric vehicle use and rooftop solar installation by Austin Energy encouraged residential lifestyle shifts. In addition to providing access to sub-metered appliance level consumption data, Pecan Street includes electric vehicle charging, rooftop solar, heating, cooling, and water usage data. Data coverage has expanded to volunteer households from California, New York and Colorado. Previously open for use, Pecan Street has been privatized and now data access and products are available for commercial and academic purchase depending on the level of access requested.
Pecan Street DataPort requires non-academic and academic users to purchase access via licensing which varies depending on the building data features requested. Coverage area of data is primarily concentrated in the Mueller planned housing community in Austin, Texas–a modern built environment which is not representative of older historical buildings that may be in need of energy efficient upgrades and retrofits. In customer segmentation studies and consumer-in-the-loop load consumption modeling, annual socio-demographic survey data may be too coarse and not provide insight into behavioral effects of household members on consumption profiles with time.
Pecan Street DataPort requires non-academic and academic users to purchase access via licensing which varies depending on the building data features requested. Coverage area of data is primarily concentrated in the Mueller planned housing community in Austin, Texas–a modern built environment which is not representative of older historical buildings that may be in need of energy efficient upgrades and retrofits. In customer segmentation studies and consumer-in-the-loop load consumption modeling, annual socio-demographic survey data may be too coarse and not provide insight into behavioral effects of household members on consumption profiles with time.
Data Gap Type
Data Gap Details
U3: Usability > Usage Rights
Usage rights vary depending on the agreed upon licensing agreement.
S6: Sufficiency > Missing Components
The data does not track real-time occupancy of individuals in the household which may provide insight into behavioral effects on energy consumption. Addition of this data, can allow for improved consumption based customer segmentation models, as patterns can change with respect to time and day of the week. The data would also be amenable for consumer-in-the-loop energy management studies with respect to comfort based on customer habitual activity, location in the house, and number of occupants.
S3: Sufficiency > Granularity
Disaggregated data may provide greater granular context for customer segmentation studies than those relying on aggregate data only. However, such segmentation studies ultimately depend on the number of household members that may be using appliances at a given time. Pecan Street data contains annual survey responses by participants with respect to household demographics and home features which may be too coarse in granularity to tracking how customer segments change over time as members move in or out of a building. Jointly taking occupancy data, can address the gap in granularity but can potentially limit volunteer engagement as concerns with respect to privacy will need to be evaluated.
S2: Sufficiency > Coverage
Data coverage primarily focuses on Texas with limited coverage in New York and California. Though there are efforts to include Puerto Rico data hinges on volunteer participation. This could introduce self-selection bias, as households who participate are likely more interested in energy conservation than the general population. Furthermore, a majority of the dataset covers the Mueller community in Austin, a planned community developed after 1999 with modern built types. Enrollment of older built environment homes and different temperate regions within the United States and globally, may provide greater insight into household appliance usage patterns as well as generation patterns which vary depending on temperate region as well as appliance age. Identifying high consumption older appliances can assist in identifying upgrades.
O2: Obtainability > Accessibility
Data is downloadable as a static file or accessible via the DataPort API. Based on the licensing agreement, a small dataset is available for free for academic individuals with pricing for larger datasets. Commercial use requires paid access based on requested features ranging from the standard to unlimited customer tier and plan.
Population and asset exposure to natural hazards
Details (click to expand)
Exposure is defined as the representative value of populations and assets potentially exposed to a natural hazard occurrence. such as population, physical assets (e.g. buildings), economic output (e.g. measured by GDP),, buildings, or agriculture output, depending on the risk exposed to.
There areopen datatasets with global coverage, for example, the Global Exposure Model, as well as proprietary data with more detailed information coming from well-established insurance companies.
Accessibility and reliability are the most significant challenges with exposure data.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Country-specific exposure data varies widely in availability, with some existing only as hard copies in government offices.
U3: Usability > Usage Rights
OpenQuake GEM project provides comprehensive data on the residential, commercial, and industrial building stock worldwide, but it is restricted to non-commercial use only.
R1: Reliability > Quality
Population datasets show significant discrepancies, requiring validation before confident use. Some geospatial socioeconomic data from sources like UNEP are outdated or incomplete.
S3: Sufficiency > Granularity
Open global data often lacks sufficient resolution and completeness for hazard risk assessment, such as World Bank or US CIA GDP data.
Power Grid Lib (optimal power flow benchmark library)
Details (click to expand)
The Power Grid Library (PGLib-OPF) is a collection of git repositories that house benchmark data for validating power system simulations.
It contains 36 networks with 3-13,659 buses sourced from IEEE Power Flow Test Cases, IEEE Dynamic Test Cases, IEEE Reliability Test System, Polish Test Cases, PEGASE Test Cases, and RTE Test Cases which have been modified to raise optimality gaps to values between 1-10% thereby creating more challenging suboptimal solutions to AC-OPF.
By curating and collecting this data, users who want to study more realistic AC-OPF simulation scenarios can directly retrieve compiled bus IDs, branch IDs, generator IDs, power demand, shunt admittance, voltage magnitude range for buses, power injection range for generators, quadratic active power cost function coefficients for generators, branch parameters like series admittance, line charge, transformer parameters, thermal limits, and branch voltage angle difference range which are more realistic. All parameters are conveniently standardized to MATPOWER data file format for direct use. PGLib-OPF is open source.
While network datasets are open source, maintenance of the repository requires continuous curation and collection of more complex benchmark data to enable diverse AC-OPF simulation and scenario studies. Industry engagement can assist in developing more realistic data though such data without cooperative effort may be hard to find.
While network datasets are open source, maintenance of the repository requires continuous curation and collection of more complex benchmark data to enable diverse AC-OPF simulation and scenario studies. Industry engagement can assist in developing more realistic data though such data without cooperative effort may be hard to find.
Data Gap Type
Data Gap Details
O1: Obtainability > Findability
Industry engagement can assist in developing detailed and realistic networked datasets and operating conditions, limits, and constraints.
U2: Usability > Aggregation
Repository maintenance requires continuous curation of more complex networked benchmark data for more realistic AC-OPF simulation studies.
Power line robot inspection imagery
Details (click to expand)
Cable inspection robot data includes LiDAR and image captures of Specific Power Line (SPL) components such as dampers, insulators, broken strands, and attachments that may have degraded due to exposure to natural elements. The data also focuses on assessing risk at the lowest part of power lines near trees, roofs, and other crossing power lines. Since the robots physically traverse the lines, this data is particularly valuable for degradation detection of high voltage transmission lines and for maintenance scheduling.
Grid inspection robot imagery requires coordination with local utilities foraccess, multiple robot trips for complete coverage, image preprocessing to remove ambient artifacts, position and location calibration, and may be limited by camera resolution for detecting subtle degradation patterns.
Grid inspection robot imagery requires coordination with local utilities foraccess, multiple robot trips for complete coverage, image preprocessing to remove ambient artifacts, position and location calibration, and may be limited by camera resolution for detecting subtle degradation patterns.
Data Gap Type
Data Gap Details
U2: Usability > Aggregation
Data needs to be aggregated from multiple cable inspection robots for improved generalizability of detection models. Multiple robot trips over areas of interest can help identify target locations needing further inspection.
U3: Usability > Usage Rights
Data is proprietary and requires coordination with utility.
U5: Usability > Pre-processing
Data may need significant preprocessing and thresholding to perform image segmentation tasks.
S2: Sufficiency > Coverage
Data must be supplemented with position orientation system information for accurate robot localization, potentially requiring preliminary inspections followed by detailed autonomous inspection of targets.
S3: Sufficiency > Granularity
Spatial resolution depends on the type of cable inspection robot utilized. Data from multiple multispectral imagers, drones, cable-mounted sensors, and additional robots may be employed to improve the level of detail needed for specific obstructions.
Regularly gridded high-resolution atmospheric observations
Details (click to expand)
Though a lot of data is available, a set of regularly gridded 3D high-resolution observations of the atmosphere state (like a higher-resolution version of ERA5) is still needed. This is essential for both an improved understanding of the atmospheric processes and the development of ML-based weather forecast models and climate models.
An enhanced version of ERA5 with higher granularity and fidelity is needed. Many surface observations and remote sensing data are available but underutilized for developing such a dataset.
An enhanced version of ERA5 with higher granularity and fidelity is needed. Many surface observations and remote sensing data are available but underutilized for developing such a dataset.
Data Gap Type
Data Gap Details
W: Wish
ERA5 is currently widely used in ML-based weather forecasts and climate modeling because of its high resolution and ready-for-analysis characteristics. However, large volumes of observations from radiosondes, balloons, and weather stations are largely underutilized. Creating a well-structured dataset like ERA5 but with more observational data would be valuable.
While conceptually needed, this dataset does not exist in the form required. An enhanced version of ERA5 with higher resolution and fidelity would significantly improve ML model training and validation.
While conceptually needed, this dataset does not exist in the form required. An enhanced version of ERA5 with higher resolution and fidelity would significantly improve ML model training and validation.
Data Gap Type
Data Gap Details
W: Wish
ERA5 is currently widely used in ML-based weather forecasts and climate modeling because of its high resolution and ready-for-analysis characteristics. But large volumes of observations, e.g. data from radiosonde, balloons, and weather stations are largely under-utilized. It would be great to create a dataset that is well-structured like ERA5 but from more observations.
Residential daylight performance metric (DPM) data
Details (click to expand)
The amount of daylight that buildings are exposed to through windows is an important parameter for heating demand (via heat gains from solar radiations) and electricity demand for lighting (via the illumination of indoor spaces by natural light). Architects can optimize these dimensions via adjusting window placement and window-to-window ratios.
Daylight performance metrics (DPMs) have been developed by building researchers and architects based on daylight access simulation output to quantify the illumination of indoor spaces by natural light.
Residential daylight performance metric data (DPM) with respect to daylight autonomy (DA), continuous daylight autonomy (cDA), spatial daylight autonomy (sDA), and useful daylight illuminance (UDI) can be generated using physics-based ray tracing simulations that calculate illuminances over a prototype building layout. Some simulation software available to calculate DPMs include IES virtual environment (IESVE), DesignBuilder, VELUX daylight visualizer, and the open source RADIANCE 5.0. To generate synthetic data from these simulation frameworks, the user must provide a geometric model of the building, climate data with respect to the building location, reflectance and transmittance values for materials, desired radiance parameters, occupancy schedule, and a virtual sensor grid over which the incident illuminance is to be calculated. Strategies based on the output of the simulations can assist architects in optimizing window placement and size, incorporation of shading devices, and the design of floor plans to control building direct and diffuse natural light.
While daylight performance metric (DPM) evaluation is an important step in the planning of commercial buildings, residential buildings do not have a similar focus, which is unusual given that most new building construction occurs within the residential sector. Residential DPMs often lack metrics associated with direct sunlight access, rely on annual averages for seasons, and utilize fixed occupancy schedules that are overly simplified for residential spaces. Additionally, illuminance metrics and thresholds utilized in commercial spaces do not translate well for residential spaces where people may prefer higher or lower illuminances depending on their location and lifestyles. Lastly, DPM optimization is based on operational metrics and assumptions on illumination in a space and its effects on the resulting thermal comfort and operational consumption of a traditional urban residential spaces, vernacular architecture which is specific to a local region and culture may not share similar objectives, preferring more indoor-outdoor transitional spaces, earthen materials, and less focus on windows and incident natural sunlight.
While daylight performance metric (DPM) evaluation is an important step in the planning of commercial buildings, residential buildings do not have a similar focus, which is unusual given that most new building construction occurs within the residential sector. Residential DPMs often lack metrics associated with direct sunlight access, rely on annual averages for seasons, and utilize fixed occupancy schedules that are overly simplified for residential spaces. Additionally, illuminance metrics and thresholds utilized in commercial spaces do not translate well for residential spaces where people may prefer higher or lower illuminances depending on their location and lifestyles. Lastly, DPM optimization is based on operational metrics and assumptions on illumination in a space and its effects on the resulting thermal comfort and operational consumption of a traditional urban residential spaces, vernacular architecture which is specific to a local region and culture may not share similar objectives, preferring more indoor-outdoor transitional spaces, earthen materials, and less focus on windows and incident natural sunlight.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Depending on the simulation software selected, intended use, and number of features requested, simulation software is available for purchase.
S2: Sufficiency > Coverage
Vernacular architecture, characterized by traditional building styles and techniques specific to a local region or culture, are not covered in simulation tools. In fact, most simulation output focus on residential areas in primarily urban regions to minimize future operational costs with assumptions made based on desired illuminance thresholds which may not be universal. By including the ability to evaluate passive design strategies adapted to a specific climate and expanding the materials expression to include high thermal inertia walls and roofs such as those of earthen or thatched construction, additional thermal comfort studies can be performed for given incident illuminance. Cultural considerations to outdoor spaces in relation to indoor spaces can provide even greater context of simulation studies and their usefulness in new construction for diverse regions.
S3: Sufficiency > Granularity
Simulations use fixed occupancy schedules which work well in the context of commercial buildings but are overly prescriptive in the context of residential buildings where user occupancy may vary depending on the number of occupants, time of day, day of week, and season. Residential buildings are multipurpose and can be characterized with a member spending more time in some areas rather than others depending on activity. This gap can be alleviated by adapting and expanding simulation inputs to take diverse occupancy scenarios into consideration.
Current DPMs take into account annual averages rather than granular information with respect to seasonal variations in daylight availability. While some advances have been made to incorporate this information through tools like Daysim which defines new DPMs for residential buildings, further work is needed for regions where occupants may want to minimize direct light access and focus more on diffuse lighting. Expanding studies for clients in warmer more arid climates may provide different thresholds and comfort parameters depending on preferences and lifestyle and may even take into account daylight oversupply, glare, and even thermal discomfort.
Materials used in the construction process of the building may change after initial simulation development depending on availability. Finalized building materials and interior absorption and reflectance may diverge from those simulated. Use of dynamic shading devices could also decrease indoor temperature due to incident irradiance. Simulated results could be provided over a range.
More data is needed to take advantage of the large ML models.
Data Gap Type
Data Gap Details
S1: Sufficiency > Insufficient Volume
Currently available data is not sufficient for training large ML models. More data is needed.
SMA Solar Technology PV System Performance database
Details (click to expand)
PV Anlage-Reinhart System provides hourly photovoltaic power, energy production, CO2 emissions avoided, and system configuration information for publicly available PV installations worldwide. SMA, a leading German manufacturer of solar inverters, has compiled data from their international deployments across multiple countries including Germany, the US, Chile, Brazil, Mexico, Canada, Spain, Italy, France, China, Australia, Belgium, India, Poland, Japan, UK, South Africa, Türkiye, and the UAE. This dataset, which includes inverter specifications, module information, and sometimes battery data, supports microgrid studies and distributed energy resource forecasting.
The SMA PV monitoring system requires user profile creation and specific system access requests, with documentation primarily in German creating potential language barriers. Data representation is geographically unbalanced with stronger coverage in Germany, Netherlands, and Australia despite its global presence. Additionally, only a subset of systems includes energy storage data, which would be valuable for comprehensive distributed energy resource load forecasting studies.
The SMA PV monitoring system requires user profile creation and specific system access requests, with documentation primarily in German creating potential language barriers. Data representation is geographically unbalanced with stronger coverage in Germany, Netherlands, and Australia despite its global presence. Additionally, only a subset of systems includes energy storage data, which would be valuable for comprehensive distributed energy resource load forecasting studies.
Data Gap Type
Data Gap Details
U4: Usability > Documentation
Documentation is primarily in German and lacks the same detail in the English version of the website. Companion research utilizing the data is not readily cited or linked. Language barriers can challenge the interpretation of displayed data values when accessed through the portal interface.
S2: Sufficiency > Coverage
Coverage varies significantly by country, with representation ranging from single systems to over 43,000 systems per country. Systems in Germany, the Netherlands, and Australia are more comprehensively represented than other regions. Additionally, battery storage information is inconsistently available across monitored systems. This gap could be addressed by increasing private user-contributed system data from diverse regions.
O2: Obtainability > Accessibility
Users must utilize the web interface or create a user profile to request access to additional data or preferred formats. Data cannot be freely downloaded in bulk or raw format and must be scraped from the web portal. Contact with SMA is required for membership or extended usage rights.
SOLETE Hybrid Solar-Wind Generation dataset
Details (click to expand)
SOLETE, developed by the Energy System Integration Lab (SYSLAB) at the Technical University of Denmark, provides 15 months of measurements at multiple resolutions (seconds to hours) from June 2018 to September 2019. The dataset includes timestamps, meteorological data (temperature, humidity, pressure, wind speed and direction), solar irradiance measurements (global horizontal and plane of array), and active power generated by an 11 kW Gaia wind turbine and a 10 kW PV inverter. This comprehensive dataset supports time-series forecasting for hybrid solar-wind distributed energy resource systems.
While SOLETE offers valuable data for joint wind-solar distributed energy resource forecasting, several sufficiency gaps limit its application. The dataset’s 15-month temporal coverage doesn’t capture long-term seasonal variations, and it monitors only a single wind turbine and PV array, limiting analyseis of multi-source generation coordination. Additionally, maintenance schedule and system downtime data are missing, which would enhance realistic system dynamic modeling. Supplementing with external data sources or simulation could address these limitations.
While SOLETE offers valuable data for joint wind-solar distributed energy resource forecasting, several sufficiency gaps limit its application. The dataset’s 15-month temporal coverage doesn’t capture long-term seasonal variations, and it monitors only a single wind turbine and PV array, limiting analyseis of multi-source generation coordination. Additionally, maintenance schedule and system downtime data are missing, which would enhance realistic system dynamic modeling. Supplementing with external data sources or simulation could address these limitations.
Data Gap Type
Data Gap Details
S6: Sufficiency > Missing Components
SOLETE lacks maintenance schedule data and system downtime information. Retroactively supplementing this data through simulation or SYSLAB records would improve system forecasting to account for scheduled maintenance uncertainties.
S3: Sufficiency > Granularity
Varying resolution and sampling rates (seconds to hours) can impact analysis precision, particularly when fusing data of different temporal resolutions. Aggregating second-level data to hourly intervals may affect joint short-term solar and wind forecasting outcomes.
S2: Sufficiency > Coverage
The 15-month temporal coverage is insufficient to capture long-term seasonal variations in joint wind and irradiance patterns.
S1: Sufficiency > Insufficient Volume
The dataset covers only a single wind turbine and PV array, limiting insights into coordination between multiple generation sources. This gap could be addressed by physically expanding the network or combining SOLETE with external datasets from utility and energy technology companies to enable larger grid control studies.
SRRL TSI-880 (sky imagery)
Details (click to expand)
The SRRL TSI-880 contains data from ground-based sky imagers that provide high temporal and spatial resolution (<1 km) information at single locations to support cloud detection and solar forecasting.
Data coverage is limited by camera locations, temporal resolution is restricted to 10-minute increments, and image resolution is limited to 352x288 24-bit jpeg images.
Data coverage is limited by camera locations, temporal resolution is restricted to 10-minute increments, and image resolution is limited to 352x288 24-bit jpeg images.
Data Gap Type
Data Gap Details
S3: Sufficiency > Granularity
Images have limited resolution (352x288 pixels) with 10-minute capture intervals, potentially insufficient for very-short-term forecasting.
S2: Sufficiency > Coverage
Coverage is constrained by sensor network location and density. Expanded networks in diverse environments would improve coverage.
S2: Sufficiency > Coverage
The current dataset derives from sky imager datasets in Singapore, requiring similar networks in other regions or alternative data sources.
SWINySEG (sky imagery)
Details (click to expand)
SWINySEG (Singapore whole sky Nychthemeron image SEGmentation database) contains 6,768 daytime and nighttime sky/cloud images with corresponding binary ground truth maps taken in Singapore over 12 months in 2016, with annotations by the Singapore Meteorological Services.
The dataset provides valuable annotated data for cloud detection and segmentation but is limited to Singapore, has an insufficient volume of samples (especially nighttime images), and restricts commercial use.
The dataset provides valuable annotated data for cloud detection and segmentation but is limited to Singapore, has an insufficient volume of samples (especially nighttime images), and restricts commercial use.
Data Gap Type
Data Gap Details
S1: Sufficiency > Insufficient Volume
The dataset needs more manually annotated cloud mask labels and is imbalanced with fewer nighttime samples.
O2: Obtainability > Accessibility
The dataset is under a Creative Commons license that prohibits commercial use, and access must be requested.
Satellite imagery
Details (click to expand)
This category encompasses satellite imagery of various spatial and spectral resolutions with global coverage captured at different time intervals. Open-access options include Sentinel-1/2, MODIS, VIIRS, and Landsat (resolution down to 5m), while commercial providers like Maxar offer higher resolutions (down to 30cm). Planet NICFI provides free high-resolution mosaics of the world’s tropics for non-commercial use.
Satellite imagery for disaster assessment faces challenges with temporal currency and spatial resolution, with public datasets having insufficient resolution for accurate damage assessment and commercial high-resolution options being prohibitively expensive.
Satellite imagery for disaster assessment faces challenges with temporal currency and spatial resolution, with public datasets having insufficient resolution for accurate damage assessment and commercial high-resolution options being prohibitively expensive.
Data Gap Type
Data Gap Details
S4: Sufficiency > Timeliness
Both pre- and post-disaster imagery are needed, but pre-disaster imagery sometimes is outdated, not really reflecting the condition right before the disaster.
S3: Sufficiency > Granularity
Accurate damage assessment requires high-resolution images, but the resolution of current publicly open datasets is inadequate for this purpose. Some commercial high-resolution images should be made available for research purposes at no cost.
Satellite images are intensively used for Earth system monitoring. One of the two biggest challenges of using satellite images is the sheer volume of data which makes downloading, transferring, and processing data all difficult. The other one is the lack of annotated data. For many use cases, the lack of publicly open high-resolution imagery is also a bottleneck.
Satellite images are intensively used for Earth system monitoring. One of the two biggest challenges of using satellite images is the sheer volume of data which makes downloading, transferring, and processing data all difficult. The other one is the lack of annotated data. For many use cases, the lack of publicly open high-resolution imagery is also a bottleneck.
Data Gap Type
Data Gap Details
S3: Sufficiency > Granularity
Publicly available datasets often lack sufficient granularity. This is particularly challenging for the Global South, which typically lacks the funding for high-resolution commercial satellite imagery.
U6: Usability > Large Volume
The sheer volume of data now poses one of the biggest challenges for satellite imagery. When data reaches the terabyte scale, downloading, transferring, and hosting become extremely difficult. Those who create these datasets often lack the storage capacity to share the data. This challenge can potentially be addressed by one or more of the following strategies:
Data compression: Compress the data while retaining lower-dimensional information.
Lightweight models: Build models with fewer features selected through feature extraction. A successful example can be found here.
Large foundation model for remote sensing data: Purposefully construct large models (e.g., foundation models) that can handle vast amounts of data. This requires changes in research architecture, such as preprocessing architecture modifications.
O2: Obtainability > Accessibility
Very high-resolution satellite images (e.g., finer than 10 meters) typically come from commercial satellites and are not publicly available. One exception is the NICFI dataset, which offers high-resolution, analysis-ready mosaics of the world’s tropics.
U5: Usability > Pre-processing
Satellite images often contain a lot of redundant information, such as large amounts of data over the ocean that do not always contain useful information. It is usually necessary to filter out some of this data during model training.
U2: Usability > Aggregation
Due to differences in orbits, instruments, and sensors, imagery from different satellites can vary in projection, temporal and spatial zones, and cloud blockage, each with its own pros and cons. To overcome data gaps (e.g. cloud blocking) or errors, multiple satellite images are often assimilated. Harmonizing these differences is challenging, and sometimes arbitrary decisions must be made.
U5: Usability > Pre-processing
The lack of annotated data presents another major challenge for satellite imagery. It is suggested that collaboration and coordination at the sector level should be organized to facilitate annotation efforts across multiple sectors and use cases. Additionally, the granularity of annotations needs to be increased. For example, specifying crop types instead of just “crops” and detailing flood damage levels rather than general “damaged” are necessary for more precise analysis.
M: Misc/Other
Cloud cover presents a major technical challenge for satellite imagery, significantly reducing its usability. To obtain information beneath the clouds, pixels from clear-sky images captured by other satellites are often used. However, this method can introduce noise and errors.
M: Misc/Other
There is also a lack of technical capacity in the Global South to effectively utilize satellite imagery.
Satellite images provide environmental information for habitat monitoring. Combined with other data, e.g. bioacoustic data, they have been used to model and predict species distribution, richness, and interaction with the environment. High-resolution images are needed but most of them are not open to the public for free.
Satellite images provide environmental information for habitat monitoring. Combined with other data, e.g. bioacoustic data, they have been used to model and predict species distribution, richness, and interaction with the environment. High-resolution images are needed but most of them are not open to the public for free.
Data Gap Type
Data Gap Details
S3: Sufficiency > Granularity
Resolution of the publicly open satellite images are not sufficient for some environment reconstruction studies.
O2: Obtainability > Accessibility
The resolution of publicly open satellite images is not high enough. High-resolution images are usually commercial and not open for free.
Satellite remote sensing data for solar forecasting faces several challenges: variability in spatial and temporal resolution across different satellite sources, complex preprocessing requirements for multispectral data, and the need to accurately translate cloud observations into ground-level irradiance predictions. Improving granularity through supplementation with ground-based measurements and developing standardized preprocessing pipelines would significantly enhance forecast accuracy for grid management applications.
Satellite remote sensing data for solar forecasting faces several challenges: variability in spatial and temporal resolution across different satellite sources, complex preprocessing requirements for multispectral data, and the need to accurately translate cloud observations into ground-level irradiance predictions. Improving granularity through supplementation with ground-based measurements and developing standardized preprocessing pipelines would significantly enhance forecast accuracy for grid management applications.
Data Gap Type
Data Gap Details
U2: Usability > Aggregation
Data from different satellite sources (both geostationary and polar-orbiting) needs to be collated and harmonized when analyzing multiple regions of interest, creating challenges in data integration and standardization.
U5: Usability > Pre-processing
Multispectral remote sensing data requires preprocessing, including atmospheric correction and band combinations in the visible and infrared spectra, before it can be effectively used for solar forecasting models.
R1: Reliability > Quality
Different cloud types affect ground-level solar irradiance in varying ways that satellite imagery alone cannot fully capture, necessitating verification and supplementation with ground-based measurements for improved model accuracy.
S3: Sufficiency > Granularity
Spatial and temporal resolution varies significantly between satellite sources, limiting the ability to capture rapid changes in cloud cover that impact solar irradiance, particularly during partly cloudy conditions which create high variability in short timeframes.
Some commercial high-resolution satellite images can also be used to identify large animals such as whales, but those images are not open to the public.
Some commercial high-resolution satellite images can also be used to identify large animals such as whales, but those images are not open to the public.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
The resolution of publicly open satellite images is not high enough. High-resolution images are usually commercial and not open for free.
Satellite imagery – GEDI LiDAR
Details (click to expand)
The Global Ecosystem Dynamics Investigation (GEDI) is a NASA/University of Maryland mission that uses LiDAR to create detailed 3D maps of forest canopy height and structure. By measuring forests in 3D, GEDI data enables accurate estimation of forest biomass and carbon storage across global scales.
Quality uncertainties in GEDI data affect carbon stock estimation reliability, requiring validation methods and calibration procedures to improve measurement accuracy.
Quality uncertainties in GEDI data affect carbon stock estimation reliability, requiring validation methods and calibration procedures to improve measurement accuracy.
Data Gap Type
Data Gap Details
R1: Reliability > Quality
GEDI data contains inherent uncertainties including geolocation errors and weak return signals in dense forests, which introduce errors into canopy height estimates and subsequent carbon calculations. Combining GEDI with other data sources like airborne LiDAR for validation and developing region-specific calibration methods could improve data reliability.
Satellite imagery – Hyperspectral
Details (click to expand)
This dataset consists of hyperspectral satellite imagery from platforms such as PRISMA and EnMAP, which capture hundreds of narrow spectral bands across the electromagnetic spectrum, providing detailed spectral information for detecting methane plumes with greater sensitivity than multispectral systems.
Very few actual hyperspectral images of methane plumes exist, creating a significant data volume limitation for training robust detection algorithms.
Data Gap Type
Data Gap Details
S1: Sufficiency > Insufficient Volume
Images of methane plumes in hyperspectral satellite data are very rare, leading to insufficient data for developing and training robust detection algorithms. Consequently, researchers often use synthetic data, transposing high-resolution methane plume images from other sources such as Sentinel-2 onto hyperspectral images from platforms like PRISMA. Expanding the collection of actual hyperspectral methane plume observations or developing more sophisticated methods for generating realistic synthetic data would significantly improve detection capabilities.
Satellite imagery – Multi-Radar/Multi-Sensor System
Details (click to expand)
The Multi-Radar Multi-Sensor (MRMS) system combines data from multiple radars, satellites, surface observations, lightning reports, rain gauges, and numerical weather prediction models to produce decision-support products every two minutes. It provides detailed depictions of high-impact weather events such as heavy rain, snow, hail, and tornadoes, enabling forecasters to issue more accurate and earlier warnings. See https://www.nssl.noaa.gov/projects/mrms/
Obtaining and integrating radar data from various sources is challenging due to access restrictions, format inconsistencies, and limited global coverage.
Obtaining and integrating radar data from various sources is challenging due to access restrictions, format inconsistencies, and limited global coverage.
Data Gap Type
Data Gap Details
U3: Usability > Usage Rights
Many radar data are rescrited for academic and research purpose only.
O2: Obtainability > Accessibility
Radar data from many countries are not open to the public. They must be purchased or formally requested. Different agencies apply differing quality control protocols, making global-scale analysis challenging.
U1: Usability > Structure
Radar data from different sources vary in format, spatial resolution, and temporal resolution, making data assimilation difficult.
S2: Sufficiency > Coverage
There is insufficient data or no data available from the Global South.
Satellite imagery – Multispectral
Details (click to expand)
This dataset contains images captured by spectrometer-equipped satellites that record data at specific wavelengths to detect the spectral signatures associated with methane. Notable missions include the Sentinel-5P TROPOMI instrument and the upcoming MethaneSAT, which provide global coverage of methane concentrations in the atmosphere.
Current multispectral satellite data has insufficient spatial resolution to detect smaller methane leaks.
Data Gap Type
Data Gap Details
S3: Sufficiency > Granularity
Many current satellites have limited spatial resolution, making it challenging to detect smaller or localized methane sources. This low resolution can result in inaccurate assessments, potentially missing smaller leaks or misidentifying emission sources. Higher resolution is necessary for accurately identifying and quantifying methane emissions from specific facilities or small-scale sources.
Satellite imagery – PALSAR radar images
Details (click to expand)
PALSAR (Phased Array type L-band Synthetic Aperture Radar) provides radar imagery that can capture the 3D structure of forests by penetrating cloud cover and forest canopies. This technology enables consistent monitoring regardless of weather conditions or time of day, making it valuable for continuous forest carbon stock estimation.
Domain expertise is needed to preprocess this data, limiting its accessibility to researchers and practitioners without specialized knowledge in radar imagery interpretation.
Domain expertise is needed to preprocess this data, limiting its accessibility to researchers and practitioners without specialized knowledge in radar imagery interpretation.
Data Gap Type
Data Gap Details
U5: Usability > Pre-processing
Domain expertise is required to understand the raw radar data and preprocess it properly for use in ML models for forest carbon estimation. Developing standardized preprocessing pipelines and tools could make this valuable data more accessible to the broader ML and climate science communities.
Simulated variables from process-based models of soil organic carbon dynamics
Details (click to expand)
This dataset contains soil data generated by physics-based or process-based soil models that simulate soil organic carbon dynamics based on environmental and management inputs. These simulations provide alternatives to direct measurements where field data collection is prohibitively expensive or impractical.
Data collection is extremely expensive for some variables, leading to the use of simulated variables. Unfortunately, simulated values have large uncertainties due to the assumptions and simplifications made within simulation models.
Data collection is extremely expensive for some variables, leading to the use of simulated variables. Unfortunately, simulated values have large uncertainties due to the assumptions and simplifications made within simulation models.
Data Gap Type
Data Gap Details
R1: Reliability > Quality
Soil carbon generated from simulators is not reliable because these process-based models might be obsolete or might have a certain kind of systematic bias, which gets reflected in the simulated variables. But ML scientists who use those simulated variables usually don’t have the proper knowledge to properly calibrate these process-based models.
Smart inverter devices database
Details (click to expand)
The California Energy Commission keeps a list of smart inverters that meet strict standards for safety and communication. These inverters must pass extra tests to show they can handle things like voltage, frequency, timing, and how they connect or disconnect from the grid, along with other technical functions to keep the power system safe and stable.
Those include: CEC Grid Support Solar Inverters, CEC Grid Support Battery Inverters, CEC Grid Support Solar/Battery Inverters, CEC Inverters with Power Control Systems functionality.
Additional vendors can also be contacted for smart inverter information:
Smart inverter operational data is not publicly available and requires partnerships with research labs, utilities, and smart inverter manufacturers. However, the California Energy Commission maintains a database of UL 1741-SB compliant manufacturers and smart inverter models that can then be contacted for research partnerships. In terms of coverage area, while California and Hawaii are now moving towards standardizing smart inverter technology in their power systems, other regions outside of the United States may locate similar manufacturers through partnerships and collaborations.
Smart inverter operational data is not publicly available and requires partnerships with research labs, utilities, and smart inverter manufacturers. However, the California Energy Commission maintains a database of UL 1741-SB compliant manufacturers and smart inverter models that can then be contacted for research partnerships. In terms of coverage area, while California and Hawaii are now moving towards standardizing smart inverter technology in their power systems, other regions outside of the United States may locate similar manufacturers through partnerships and collaborations.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Particularly for the CEC database, one will need to contact the CEC or manufacturer to receive additional information for a particular smart inverter. Detailed studies using smart inverter hardware may require collaboration with a utility and research organization to perform advanced research studies.
U2: Usability > Aggregation
To retrieve additional data beyond the single entry model and manufacturer of a particular smart inverter, one may need to contact a variety of manufacturers to get access to datasets and specifications for operational smart inverter data, laboratories to get access to hardware in the loop test centers, and utilities or local energy commissions for smart inverter safety compliance and standards.
S2: Sufficiency > Coverage
New grid support functions defined by UL1741-SA and UL1741-SB are optional but will be required for California and Hawaii as of now, public manufacturing data is available only via the CEC website. Collaborations and contact with manufacturers outside the US may be necessary to compile a similar database and contact with utilities can provide a better understanding of similar UL 1741-SB criteria adoption.
Soil Survey Geographic Database (SSURGO)
Details (click to expand)
The Soil Survey Geographic Database (SSURGO) contains soil organic carbon data collected through field observations and laboratory analysis of soil samples. It provides comprehensive soil information for the United States, including physical and chemical soil properties.
In general, there is insufficient soil organic carbon data for training a well-generalized ML model (both in terms of coverage and granularity).
Data Gap Type
Data Gap Details
U1: Usability > Structure
Data is collected by different farmers on different farms, leading to consistency issues and a need to better structure the data.
S3: Sufficiency > Granularity
In general, there is insufficient soil organic carbon data for training a well-generalized ML model (both in terms of coverage and granularity). One reason is that collecting such data is very expensive – the hardware is costly and collecting data at a high frequency is even more expensive.
S2: Sufficiency > Coverage
In general, there is insufficient soil organic carbon data for training a well-generalized ML model (both in terms of coverage and granularity). One reason is that collecting such data is very expensive – the hardware is costly and collecting data at a high frequency is even more expensive.
Soil measurements – NorthWyke Farms platform
Details (click to expand)
The NorthWyke Farms platform data is a collection of soil measurements from the UK’s North Wyke Farm Platform, providing quarterly soil organic carbon values along with other environmental parameters. The dataset covers experimental farm plots under different management practices and is continuously updated with new measurements.
The common and biggest challenges for use cases involving soil organic carbon is the insufficiency of data and the lack of high granularity data.
Data Gap Type
Data Gap Details
S3: Sufficiency > Granularity
Data is quarterly value for soil carbon, but this is not enough for capturing the weekly changes in soil carbon when we change the fertilizer amount or the tilling practices.
Solar panel PV system dataset
Details (click to expand)
The Solar Panel PV System Dataset (https://www.kaggle.com/datasets/arnavsharmaas/solar-panel-pv-system-dataset/data) is a tabular dataset from the National Renewable Energy Laboratory that includes specific feature data on PV system size, rebate, construction, tracking, mounting, module types, number of inverters and types, capacity, electricity pricing, and battery-rated capacity.
The solar panel PV system dataset was created by collecting and cleaning data for 1.6 million individual PV systems, representing 81% of all U.S. distributed PV systems installed through 2018. The analysis of installed prices focused on a subset of roughly 680,000 host-owned systems with available installed price data, of which 127,000 were installed in 2018. The dataset was sourced primarily from state agencies, utilities, and organizations administering PV incentive programs, solar renewable energy credit registration systems, or interconnection processes.
The solar panel PV system dataset excluded third-party-owned systems and systems with battery backup. Since data was self-reported, component cost data may be imprecise. The dataset also has a data gap in timeliness as some of the data includes historical data, which may not reflect current pricing and costs of PV systems.
The solar panel PV system dataset excluded third-party-owned systems and systems with battery backup. Since data was self-reported, component cost data may be imprecise. The dataset also has a data gap in timeliness as some of the data includes historical data, which may not reflect current pricing and costs of PV systems.
Data Gap Type
Data Gap Details
S2: Sufficiency > Coverage
The dataset excluded third-party-owned systems, systems with battery backup, self-installed systems, and data that was missing installation prices. Data was self-reported and may be inconsistent based on the reporting of component costs. Furthermore, some state markets were underrepresented or missing, which can be alleviated by new data collection jointly with simulation studies.
S4: Sufficiency > Timeliness
The dataset includes historical data that may not reflect current pricing for PV systems. To alleviate this, updated pricing may be incorporated in the form of external data or as additional synthetic data from simulation.
Solcast (global solar forecasting and historical solar irradiance data)
Details (click to expand)
Solcast is a global solar forecasting and historical solar irradiance data provider that combines satellite imagery from Himawari 8, GOES-16, GOES-17, and Numeric Weather Prediction models to deliver 10-15 minute scale solar irradiance data products.
Solcast data is only accessible through academic or research institutions, uses coarse elevation models, has limited coverage (33 global sites), and provides data at 5-60 minute intervals, insufficient for very-short-term forecasting.
Solcast data is only accessible through academic or research institutions, uses coarse elevation models, has limited coverage (33 global sites), and provides data at 5-60 minute intervals, insufficient for very-short-term forecasting.
Data Gap Type
Data Gap Details
S3: Sufficiency > Granularity
Time resolution ranges from 5 to 60 minutes, which is insufficient for sub-5-minute forecasting needs.
S2: Sufficiency > Coverage
Coverage is limited to 33 global sites (18 tropical/subtropical, 15 temperate), requiring expansion to other regions and environmental conditions.
R1: Reliability > Quality
Significant elevation differences between ground sites and cell height affect clear-sky irradiance estimation accuracy.
O1: Obtainability > Findability
Data is only accessible through collaborating academic or research institutions.
Sub-metered appliance-level data
Details (click to expand)
This collection includes multiple international datasets of sub-metered building electricity consumption, primarily from residential buildings across North America, Europe, and Asia collected between 2011-2020. These datasets provide granular appliance-level energy consumption data at varying sampling frequencies (1Hz-15kHz) along with aggregate building-level measurements. Some datasets include additional measurements such as occupancy information, environmental conditions, and utility billing data. The datasets vary in coverage from single households to hundreds of homes, with monitoring periods ranging from two months to several years.
- Almanac of Minutely Power dataset (AMPds2): A single building electricity, water, and natural gas consumption dataset from a home in Burnaby, British Columbia, Canada from 2012-2014 which includes environment and utility billing data as well.
- Commercial building energy dataset (COMBED): A dataset of 6 commercial buildings on the Indraprastha Institute of Information Technology (IIIT-Delhi) from August 2013 to the present containing data with respect to the total power consumption, sub-metered data with respect to elevators, air handling units (AHUs), uninterruptible power supplies (UPS), and central campus heating, ventilation, and air conditioning (HVAC) pumps and chillers at a 30 second cadence.
- DEDDIAG: A dataset comprised of aggregate and disaggregated power consumption from 15 southern German homes monitored at 1Hz containing 50 appliances including dishwashers, washing machines, refrigerators and dryers over a span of 3.5 years (2016-2020). Aggregated data includes three-phase measurements. This dataset also contains event start and stop timestamps for 14 appliances.
- Dutch Residential Energy Dataset (DRED): Requires request. Consists of data collected from a single household in the Netherlands which contains the appliance level and total energy consumption over two months. Appliance consumption measured was a refrigerator, washing machine, central heating, microwve, oven, cooker, blender, toaster, television, fan, living room outlets, and a laptop recorded with a sampling frequency of 1 Hz. DRED additionally has data on human occupancy based on WiFi and bluetooth signals received from occupant smartphones and wearable devices to allow for locating the consumer without setting up the home with more intrusive monitoring devices. DRED can be accessed by request.
- Electricity Consumption and Occupation (ECO): A dataset collected from June 2012-January 2013 covering 6 home in Switzerland where 6-10 smart plugs were deployed in each household. Aggregate consumption at the building level was measured in three phases to capture voltage, current, and phase shifts. Occupancy data was tracked by residents manually and via a passive infrared entry door sensor.
- Greend: A dataset of 9 households in Austria and Italy for one year covering December 2013-April 2014. Data included aggregated and submetered appliance level data which varied depending on the appliance inventory of the household covering active power measurements taken at a frequency of 1Hz. GREEND can be requested by form.
- HIPE: A dataset from October 2017-December 2017 recording smart meter measurements from 10 machines and the main terminal of an electronics production site operated by the Institute of Data Processing and Electronics (IPE) at Karlsruhe Institute of Technology (KIT) in Germany at a cadence of 5 seconds with measurements with respect to active power, reactive power, voltage, frequency, and distortion.
- Indian data for Ambient Water and Electricity Sensing (iAWE): Total consumption, appliance level, as well as circuit panel level in a single family home in New Delhi, India was collected in summer of 2013 over the course of 73 days. Additional quantities such as water usage from an overhead tank, and network strength based on packet loss was also jointly measured.
- IDEAL: A joint electricity, gas, temperature, humidity, and light dataset for 255 homes in the UK from August 2016 to June 2018. Aggregate and sub-metered consumption was measured at 1 second intervals, while temperature, humidity and light were measured at 12 second intervals. Household occupancy was measured through initial surveys with respect to socio-demographic data and self-reported updates to the data in the event that there was a change in occupancy.
- Reference Energy Disaggregation Dataset (REDD): Contains 119 days worth of aggregate consumption taken in 2011 from 10 residential buildings located in the greater Boston area. The data includes meter level phases of power, and voltage recorded at 15kHz as well as sub-meter level 24 circuits labeled by appliance category and measured at a cadence of 0.5Hz and 1Hz for large and small plug level appliances respectively.
- REFIT: A dataset containing aggregate and individual appliance monitor sub-meter data taken every 8 seconds from 20 UK households from September 2013 to September 2015. Of the 8 households, 6 households had rooftop solar panels however, 3 were rewired to remove the effect of generation.
- UMass Smart Home data set: This dataset is comprised of metered and sub-metered data from three homes in west Massachussetts taken over a period of three years. Measurements included average household load, circuit-level load, and plug load per second. Accompanying generation data from solar panels and wind turbines is available for one of the three homes. Environmental data with respect to the outdoor weather and indoor temperature and humidity are provided as well as occupancy information through wall switch data, doors, and motion sensors. HVAC trigger events and corresponding temperature settings and operational status are also provided.
- UK Domestic Appliance-Level Electricity data set (UK-DALE): A dataset comprised of measurements of aggregated as well as individual appliance level consumption recorded every 6 seconds from 5 UK homes taken from researchers at Imperial College. The continuous coverage varied per house ranging from 39 to 786 days spanning dates from 2012 to 2015. Data included whole house active power, apparent power, and RMS voltage. Appliance level measurements were taken every 6 seconds using individual appliance monitors for up to 54 appliances per residence.
For accurate NILM studies, benchmark datasets are required to include not only consumption but local power generation (e.g., from rooftop solar), as it can affect the overall aggregate load observed at the building level. While some datasets may include generation information, most studies do not take rooftop solar generation into account. Additionally, devices that can behave both as a load and generator such as electric vehicles or stationary batteries were also not included. The majority of building types are single family housing units limiting the diversity of representation. Furthermore, most datasets are no longer maintained following study close.
For accurate NILM studies, benchmark datasets are required to include not only consumption but local power generation (e.g., from rooftop solar), as it can affect the overall aggregate load observed at the building level. While some datasets may include generation information, most studies do not take rooftop solar generation into account. Additionally, devices that can behave both as a load and generator such as electric vehicles or stationary batteries were also not included. The majority of building types are single family housing units limiting the diversity of representation. Furthermore, most datasets are no longer maintained following study close.
Data Gap Type
Data Gap Details
U5: Usability > Pre-processing
Sub-metered data relies heavily on the sensor network installation used in monitoring the building. Depending on the technology used, some sensors require calibration or are prone to malfunctions and delays. Additionally, interference from other devices can be present in the aggregate building level readings, such as that experienced by REFIT, that need to be addressed manually to enhance the usability of the dataset. These may vary depending on the submeter dataset utilized, requiring a clear understanding of associated metadata and documentation specific to the testbed the study was built upon. Exploratory data analysis of the time series data may assist in identifying outliers that may be a result of sensor drift.
U1: Usability > Structure
When retrieving NILM data from a variety of sources from pre-existing studies as well as through custom data collection, the structure of the received data can vary. Testbed design, hardware, and variables monitored depend on sensor availability which can ultimately influence schema and data formats. Data structure may also differ based on the level of disaggregation at the plug level or the individual appliance level. When building future testbeds for data collection, it may help to to follow the standards set by APIs such as NILMTK which has successfully integrated multiple datasets from different sources. Using the REDD dataset format as inspiration, the toolkit developers created a standard energy disaggregation data format (NILMTK-DF) with common metadata and structure which requires manual dataset-specific preprocessing. When working with non-standardized data that may require aggregation, machine learning based data fusion strategies may automating schema matching and data integration.
S6: Sufficiency > Missing Components
While sub-metered data provides a means of verifying non-intrusive load monitoring techniques, it does not capture the hidden human motivators driving appliance usage (such as comfort, utility cost, and daily activities) as well as other important factors contributing to the aggregate load seen at the building level meter. The key to improving these studies is to provide greater context to the sub-metered data by taking additional joint measurements such as rooftop solar power production, electric vehicle load, occupancy related information, and battery storage. Some dataset-specific missing data components are highlighted below.
All datasets mentioned do not include electric vehicle loads. REDD, AMPds2, COMBED, DEDDIAG, DRED, GREEND, iAWE, UK-DALE do not include generation from rooftop solar. REFIT contains solar from three homes but they were not the focus of the study and were treated as solar interference to the aggregate load. The UMass smart home dataset only had representation of one home with solar and wind generation, though at a significantly larger square footage and build compared to the other two homes that were featured.
While DRED provided occupancy information through data collected from wearable devices with respect to the home and ECO and IDEAL through self-reporting and an infrared entryway sensor, all other studies did not.
The majority of datasets are not amenable to human in the loop user behavior analysis with respect to consumption patterns, response to feedback, and the effectiveness in load shifting to promote energy conserving behaviors due to their lack of representation.
While AMPds2 includes some utility data, most datasets do not incorporate billing or real time pricing. This type of data would be beneficial as it varies from time, season, region, and utility.
Battery storage was not taken into account in all building consumption datasets.
S2: Sufficiency > Coverage
Gaps in dataset coverage are specific to the sub-metered dataset. These gaps may be due to unaccounted loads, level of disaggregation (e.g. circuit level, plug level, or individual appliance level), or limited appliance types. Diversity of building types are limited as most studies take place in single family residences. Some dataset specific gaps are detailed below that may be addressed by collecting new data on existing testbeds or by augmenting already collected data with synthetic information. Future data collection efforts should be mindful of avoiding the kinds of gaps associated with existing datasets.
In AMPds2 data there were some missing data from electricity and water and natural gas readings. Additionally, there exist un-metered household loads which were not accounted for in the aggregate building level readings. With respect to dishwasher consumption, AMPds2 did not have a direct water meter monitoring. REFIT did not monitor appliances that could not be accessed through wall plugs such as electric ovens. Depending on the built environment and building type, larger loads may not be able to be connected to building level meters. For example, in the GREEND dataset, electric boilers in Austria were connected to separate external meters. In the UMass smart home dataset, gas furnaces, exhaust fans, and recirculator pump loads were not able to be monitored.
AMPds2, DEDDIAG, DRED, iAWE, REDD, REFIT, and UMass smart home dataset all gather data in single family homes which may not be representative of the diversity of buildings in terms of age, location, construction, and potential household demographics. REFIT data covers different single family home types such as detached, semi-detached, and mid-terrace homes that ranged from 2-6 bedrooms with builds from the 1850s to 2005. GREEND covers apartments in addition to single family homes but the number of households was 9. AMPds2, DRED, iAWE only cover a single household. Additionally, datasets are specific to the location where the measurements were taken which are susceptible to the environmental conditions of the region as well as the culture of the population. For example, REDD consists of data from 10 monitored homes which may not be representative of common appliances contributing to the overall load of the broader population outside of Boston.
COMBED contains complex load types that may rely on variable-speed drives as well as multi-state devices, which the other datasets do not contain which may be due to the difference in building type but could also be due to the lack of diversity of appliance representation.
ECO data relied on smart plugs for disaggregated load consumption measurements which varied between households depending on the smart plug appliance coverage. For all households the total consumption was not equal to the sum of the consumption measured from the plugs alone, indicative of a high proportion of non-attributed consumption.
R1: Reliability > Quality
In the AMPds2 data, the sum of the sub-metered consumption data did not add up to the whole house consumption due to some rounding error in the meter measurements, highlighting not only the need for NILM studies with sub-metered data as ground truth, but also the type of building level meter. Future data collection efforts may want to not only focus on retrieval of utility-side building meter data but also supplemental aggregate meter data to detect mismatches in measurements.
Datasets or studies that require self-reporting by customers may introduce participant bias, as the resolution with which households update voluntary information may vary. For example, if the number of household members, occupancy schedule, and the addition of new plug loads is self-reported, frequency of updates vary depending on volunteer engagement. Additionally, volunteers who participate in NILM studies may have a particular propensity for energy efficient actions, and may not be representative of the general population. For example, some participants of UK-DALE were Imperial College graduate students were motivated to participate to advance their own projects. To make sure that electricity usage represents the general population, future case studies can recruit potential volunteer communities with diversity of socioeconomic background and location.
The Public Utility Data Liberation (PUDL)
Details (click to expand)
The Public Utility Data Liberation (PUDL) project, maintained by Catalyst Cooperative, integrates and standardizes energy sector data from US government agencies including EIA, FERC, EPA, and system operators into analysis-ready formats. This continuously updated database covers power generation, fuel consumption, emissions, and financial data from 2009 to present across the United States.
Government energy datasets suffer from inconsistent formats, missing documentation, and aggregation challenges that prevent ready analysis. Key gaps include complex pre-processing requirements due to format changes, limited documentation maintenance, and missing weather and transmission data. Standardized reporting formats across agencies, improved documentation practices, and expanded data collection could significantly enhance the utility of integrated energy datasets for policy analysis.
Government energy datasets suffer from inconsistent formats, missing documentation, and aggregation challenges that prevent ready analysis. Key gaps include complex pre-processing requirements due to format changes, limited documentation maintenance, and missing weather and transmission data. Standardized reporting formats across agencies, improved documentation practices, and expanded data collection could significantly enhance the utility of integrated energy datasets for policy analysis.
Data Gap Type
Data Gap Details
U5: Usability > Pre-processing
Source data format changes (e.g., FERC’s shift from PDF to XBRL) and semi-structured formats require extensive preprocessing. PDF-based data extraction faces OCR challenges due to scan quality and inconsistent formatting. Standardized reporting formats and machine-readable data standards across agencies could reduce preprocessing burden.
U4: Usability > Documentation
Documentation updates lag behind source data changes, requiring continuous monitoring by maintainers. Proactive documentation standards and change notification systems from data providers could improve maintenance efficiency.
U3: Usability > Usage Rights
While PUDL uses Creative Commons licensing, some utility operator data has unclear public use rights despite being provided to regulatory agencies. Explicit public use licensing statements from government agencies could clarify usage permissions.
U2: Usability > Aggregation
Varying schema and naming conventions across agencies complicate data joining. Probabilistic entity matching helps but requires manual verification. Universal relational database standards and common identifiers across agencies could streamline aggregation.
U1: Usability > Structure
Source data structures vary significantly between reporting years and agencies, with inconsistent plant identification systems. Standardized data schemas and versioning practices could improve structural consistency.
S6: Sufficiency > Missing Components
Weather model data and transmission/congestion information from grid operators would enhance analysis capabilities. Integration partnerships with weather services and grid operators could expand dataset utility.
S3: Sufficiency > Granularity
Temporal resolution varies from hourly to annual across sources, requiring interpolation techniques. More frequent and standardized reporting intervals could improve data granularity.
S2: Sufficiency > Coverage
Dataset coverage limited to US regulatory agencies and organizations. International data partnerships could expand geographic scope for comparative analysis.
US large-scale solar photovoltaic database
Details (click to expand)
The US Large-scale Solar Photovoltaic Database (USPVDB) contains polygon representation of large-scale photovoltaic installations, associated with facility-specific data attributes.
They were mined from the US Energy Information Administration (EIA) form 860 and facility type designation by the US Environmental Protection Agency (EPA). The dataset also has information on whether the large-scale PV installations are for agrivoltaic purposes. Overall, 3,699 US ground mounted facilities with capacity greater than or equal to 1MWdc are represented. The USPVDB data must be accessed through the United States Geological Survey (USGS) mapper browser application or for download as GIS data in the form of shapefiles or geojsons. Tabular data and metadata are provided in CSV and XML format.
Only the US are covered in this dataset.. Enhancing the data by supplementing it with international large-scale photovoltaic satellite imagery can expand the coverage area of the dataset.
Only the US are covered in this dataset.. Enhancing the data by supplementing it with international large-scale photovoltaic satellite imagery can expand the coverage area of the dataset.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Data may be accessed through the USGS’s designated USPVDB mapper or downloaded as shapefiles for GIS data, tabular data, or as XML: metadata. Data is open and easily obtainable.
S2: Sufficiency > Coverage
Coverage is over the US and specifically over densely populated regions that may or may not correlate to areas of low cloud cover and high solar irradiance. Representation of smaller scale private PV systems could expand the current dataset to less populated areas as well as regions outside the US.
US school bus fleet dataset
Details (click to expand)
The US school bus fleet dataset compiled by the World Resources Institute contains information on school district, model year, fuel type, manufacturer, seating capacity, and ownership mode for over 450,000 buses from 46 states and the District of Columbia, covering data collected from March to November 2022.
The dataset suffers from inconsistent state-level reporting structures and missing data from 4 US states, limiting comprehensive national analysis. Standardizing reporting formats and expanding state participation could enable more robust AI models for fleet electrification planning across diverse geographic and operational contexts.
The dataset suffers from inconsistent state-level reporting structures and missing data from 4 US states, limiting comprehensive national analysis. Standardizing reporting formats and expanding state participation could enable more robust AI models for fleet electrification planning across diverse geographic and operational contexts.
Data Gap Type
Data Gap Details
U1: Usability > Structure
Inconsistent state-level reporting creates varying data structures and fields, with some states excluding contractor-owned buses. Solution: Develop federal reporting standards for consistent data collection across all states.
S2: Sufficiency > Coverage
Inconsistent state-level reporting creates varying data structures and fields, with some states excluding contractor-owned buses. Solution: Develop federal reporting standards for consistent data collection across all states.
S4: Sufficiency > Timeliness
Dataset maintenance discontinued after November 2022. Solution: Establish ongoing federal or industry-supported data collection mechanisms.
Weather Bench 2 is based on ERA5, so the issues of ERA5 are also inherent here, that is, data has biases over regions where there are no observations.
Data Gap Type
Data Gap Details
R1: Reliability > Quality
Inherent biases limit ground truth applications - ML-enhanced data assimilation techniques and ensemble reanalysis approaches can reduce model-dependent biases, particularly improving precipitation and cloud field accuracy
More data is needed to develop a more accurate and robust ML model. It is also important to note that SUBX data contains biases and uncertainties, which can be inherited by ML models trained with this data.
More data is needed to develop a more accurate and robust ML model. It is also important to note that SUBX data contains biases and uncertainties, which can be inherited by ML models trained with this data.
Data Gap Type
Data Gap Details
S1: Sufficiency > Insufficient Volume
Larger models generally offer improved performance for developing data-driven sub-seasonal forecast models. However, with only a limited number of models contributing to the SUBX dataset, there is a scarcity of training data. To enhance ML model performance, more SUBX data generated by physics-based numerical weather forecast models is required.
xBD Dataset (pre- and post-disaster satellite imagery)
Details (click to expand)
xBD is an annotated benchmark dataset containing pre- and post-disaster satellite imagery used for training and evaluating ML models in disaster damage assessment. The dataset is publicly available at https://paperswithcode.com/dataset/xbd
The xBD dataset has two significant limitations: it is geographically biased toward North America and lacks granular damage severity classification, limiting its global applicability and assessment precision.
The xBD dataset has two significant limitations: it is geographically biased toward North America and lacks granular damage severity classification, limiting its global applicability and assessment precision.
Data Gap Type
Data Gap Details
S3: Sufficiency > Granularity
There is no differentiation of grades of damage. More granular information about the severity of damage is needed for more precise assessments.
S2: Sufficiency > Coverage
Data is highly biased towards North America. Similar data from the other parts of the world is urgently needed.