Artificial intelligence (AI) and machine learning (ML) offer a powerful suite of tools to accelerate climate change mitigation and adaptation across different sectors. However, the lack of high-quality, easily accessible, and standardized data often hinders the impactful use of AI/ML for climate change applications.
In this project, Climate Change AI, with the support of Google DeepMind, aims to identify and catalog critical data gaps that impede AI/ML applications in addressing climate change, and lay out pathways for filling these gaps. In particular, we identify candidate improvements to existing datasets, as well as "wishes" for new datasets whose creation would enable specific ML-for-climate use cases. We hope that researchers, practitioners, data providers, funders, policymakers, and others will join the effort to address these critical data gaps.
This project is currently in its beta phase, with ongoing improvements to content and usability. We encourage you to provide input and contributions via the routes listed below, or by emailing us at datagaps@climatechange.ai. We are grateful to the many stakeholders and interviewees who have already provided input.
Analysis of grid reliability events
Details (click to expand)
Due to rapid fluctuations in power generation, renewables introduce variability into the grid. These signals are capable of triggering safety monitoring systems related to grid stability. Power grid control centers receive multiple streams of data from these systems (e.g. alarms, sensors, and field reports) that are semi-structured and arriving at a high volume. For operators, these alarm triggers and associated data can be overwhelming to rationalize, reduce, and contextualize to diagnose grid conditions. ML can assist in interpreting these data to better understand the sequence of events leading up to an incident as well as to identify and detect the causes behind system disturbances affecting grid reliability.
Access to EPRI10 grid alarm data is currently limited within EPRI. Data gaps with respect to usability are the result of redundancies in grid alarm codes, requiring significant preprocessing and analysis of code ids, alarm priority, location, and timestamps. Alarm codes can vary based on sensor, asset and line. Resulting actions as a consequence of alarm trigger events need field verifications to assess the presence of fault or non-fault events.
Access to EPRI10 grid alarm data is currently limited within EPRI. Data gaps with respect to usability are the result of redundancies in grid alarm codes, requiring significant preprocessing and analysis of code ids, alarm priority, location, and timestamps. Alarm codes can vary based on sensor, asset and line. Resulting actions as a consequence of alarm trigger events need field verifications to assess the presence of fault or non-fault events.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Data access is limited within EPRI due to restrictions with respect to data provided by utilities. Anonymization and aggregation of data to a benchmark or toy dataset by EPRI to the wider community can be a means of circumventing the security issues at the cost of operational context.
U1: Usability > Structure
Grid alarm codes may be non-unique for different lines and grid assets. In other words, two different codes could represent equivalent information due to differences in naming conventions requiring significant alarm data pre-processing and analysis in identifying unique labels from over 2000 code words. Additional labels expressing alarm priority, for example high alarm type indicative of events such as fire, gas, or lightning, are also encoded into the grid alarm trigger event code. Creation of a standard structure for operational text data such as those already utilized in operational systems by companies such as General Electric or Siemens can avoid inconsistencies in data.
U3: Usability > Usage Rights
Usage rights are currently constrained to those working within EPRI at this time.
U4: Usability > Documentation
Remote signaling identification information from monitoring sensors and devices encode data with respect to the alarm trigger event in the context of fault priority. Based on the asset, line, or sensor, this identification code can vary depending on naming conventions used. Documentation on remote signal ids associated with a dictionary of finite alarm code types can facilitate pre-processing of alarm data and assessment on the diversity of fault events occurring in real-time systems (as different alarm trigger codes may correspond to redundant events similar in nature).
U5: Usability > Pre-processing
In addition to challenges with respect to the decoding of remote signal identification data, the description fields associated with alarm trigger events are unstructured and vary in the amount of text detail provided. Typically the details cover information with respect to the grid asset and its action. For example, a text description from a line monitoring device may describe the power, temperature, and the action taken in response to the grid alarm trigger event. Often in real world systems the majority of grid alarm trigger events are short circuit faults and non-fault events, limiting the diversity of fault types found in the data.
To combat these issues, data pre-processing becomes necessary. For remote signal identification data this includes parsing and hashing through text codes, assessing code components for redundancies, and building an associated reduced dictionary of alarm codes. For textual description fields, and post-fault field reports, the use of natural language processing techniques to extract key information can provide more consistency between sensor data. Additionally, techniques like diverse sampling can account for the class imbalance with respect to the associated fault that can trigger the alarm.
U6: Usability > Large Volume
Operational alarm data volume is large given the cadence of measurements made in the system at every millisecond. This can result in high data volume that is tabular in nature, but also unstructured with respect to text details associated with alarm trigger events, sensor measurements, and controller actions. Since the data also contains locations and grid asset information, spatio-temporal analysis can be made with respect to a single sensor and the conditions over which that sensor is operating. Therefore indexing and mining time series data can be an approach for facilitating faster search over alarm data leading up to a fault event. Additionally natural language processing and text mining techniques can also be utilized to facilitate search over alarm text and details.
R1: Reliability > Quality
Alarm trigger events and the corresponding action taken by the events, require post assessment by field workers especially in cases of faults or perceived faults for verification.
U2: Usability > Aggregation
Reports on location, asset, and time can result in false alarm triggers requiring operators to send field workers to investigate, fix, and recalibrate field sensors. The data with respect to field assessments can be incorporated into the original data to provide greater context resulting in compilation of multimodal datasets which can enhance alarm data understanding.
Assessing forest restoration outcomes
Details (click to expand)
Efforts are being made to restore ecosystems like forests and mangroves. ML can be used to monitor biodiversity changes before and after restoration efforts, in order to quantify their effectiveness and outcomes.
There is a lack of standardized protocol to guide data collection for restoration projects, including which variables to measure and how to measure them. Developing such protocols would standardize data collection practices, enabling consistent assessment and comparison of restoration successes and failures on a global scale.
There is a lack of standardized protocol to guide data collection for restoration projects, including which variables to measure and how to measure them. Developing such protocols would standardize data collection practices, enabling consistent assessment and comparison of restoration successes and failures on a global scale.
Data Gap Type
Data Gap Details
U1: Usability > Structure
For restoration projects, there is an urgent need of standardized protocol to guide data collection and curation for measuring biodiversity and other complex ecological outcomes of restoration projects. For example, there is an urgent need of a clearly written guidance on what variables to collect and how to collect them.
Turning raw data into usable insights is also a significant challenge. There is a lack of standardized tools and workflows to turn the raw data into some analysis-ready data and analyze the data in a consistent way across projects.
U4: Usability > Documentation
For restoration projects, there is an urgent need of standardized protocol to guide data collection and curation for measuring biodiversity and other complex ecological outcomes of restoration projects. For example, there is an urgent need of a clearly written guidance on what variables to collect and how to collect them.
Turning raw data into usable insights is also a significant challenge. There is a lack of standardized tools and workflows to turn the raw data into some analysis-ready data and analyze the data in a consistent way across projects.
There is a lack of standardized protocol to guide data collection for restoration projects, including which variables to measure and how to measure them. Developing such protocols would standardize data collection practices, enabling consistent assessment and comparison of restoration successes and failures on a global scale.
There is a lack of standardized protocol to guide data collection for restoration projects, including which variables to measure and how to measure them. Developing such protocols would standardize data collection practices, enabling consistent assessment and comparison of restoration successes and failures on a global scale.
Data Gap Type
Data Gap Details
U1: Usability > Structure
For restoration projects, there is an urgent need of standardized protocol to guide data collection and curation for measuring biodiversity and other complex ecological outcomes of restoration projects. For example, there is an urgent need of a clearly written guidance on what variables to collect and how to collect them.
Turning raw data into usable insights is also a significant challenge. There is a lack of standardized tools and workflows to turn the raw data into some analysis-ready data and analyze the data in a consistent way across projects.
U4: Usability > Documentation
For restoration projects, there is an urgent need of standardized protocol to guide data collection and curation for measuring biodiversity and other complex ecological outcomes of restoration projects. For example, there is an urgent need of a clearly written guidance on what variables to collect and how to collect them.
Turning raw data into usable insights is also a significant challenge. There is a lack of standardized tools and workflows to turn the raw data into some analysis-ready data and analyze the data in a consistent way across projects.
There is a lack of standardized protocol to guide data collection for restoration projects, including which variables to measure and how to measure them. Developing such protocols would standardize data collection practices, enabling consistent assessment and comparison of restoration successes and failures on a global scale.
There is a lack of standardized protocol to guide data collection for restoration projects, including which variables to measure and how to measure them. Developing such protocols would standardize data collection practices, enabling consistent assessment and comparison of restoration successes and failures on a global scale.
Data Gap Type
Data Gap Details
U1: Usability > Structure
For restoration projects, there is an urgent need of standardized protocol to guide data collection and curation for measuring biodiversity and other complex ecological outcomes of restoration projects. For example, there is an urgent need of a clearly written guidance on what variables to collect and how to collect them.
Turning raw data into usable insights is also a significant challenge. There is a lack of standardized tools and workflows to turn the raw data into some analysis-ready data and analyze the data in a consistent way across projects.
U4: Usability > Documentation
For restoration projects, there is an urgent need of standardized protocol to guide data collection and curation for measuring biodiversity and other complex ecological outcomes of restoration projects. For example, there is an urgent need of a clearly written guidance on what variables to collect and how to collect them.
Turning raw data into usable insights is also a significant challenge. There is a lack of standardized tools and workflows to turn the raw data into some analysis-ready data and analyze the data in a consistent way across projects.
Assessment of climate impacts on public health
Details (click to expand)
Climate change has major implications for public health. ML can help analyze the relationships between climate variables and health outcomes to assess how changes in climate conditions affect public health.
The biggest issue for health data is its limited and restricted access.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
There is, in general, not a lot of datasets one can use to cover the spectrum of population, age, gender, economic, etc. To make good use of available data, there should be more efforts to integrate available data from disparate data sources, such as the creation of data repositories and the open community data standard.
U4: Usability > Documentation
There are some data repositories available. The existing issues are that data is not always accompanied by the source code that created the data or other types of good documentation.
U2: Usability > Aggregation
Integrating climate data and health data is challenging. Climate data is usually in raster files or gridded format, whereas health data is usually in tabular format. Mapping climate data to the same geospatial entity of health data is also computationally expensive.
Processing climate data and Integrating climate data with health data is a big challenge.
Data Gap Type
Data Gap Details
O1: Obtainability > Findability
For people without expertise in climate data only, it is hard to find the data right for their needs, as there is no centralized platform where they can turn for all available climate data.
U1: Usability > Structure
Datasets are of different formats and structures.
O1: Obtainability > Findability
For people without expertise in climate data only, it is hard to find the data right for their needs, as there is no centralized platform where they can turn for all available climate data.
Integrating climate data and health data is challenging. Climate data is usually in raster files or gridded format, whereas health data is usually in tabular format. Mapping climate data to the same geospatial entity of health data is also computationally expensive.
Automatic individual re-identification for wildlife
Details (click to expand)
Identification of individuals in wildlife (e.g., individual animals) refers to the process of recognizing and confirming the identity of an animal during subsequent encounters. It is crucial for identifying and monitoring endangered species to better understand their needs and threats, and to aid in conservation efforts. Computer vision related ML techniques are widely used for automatic individual identification.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
Data Gap Type
Data Gap Details
S1: Sufficiency > Insufficient Volume
One of the foremost challenges is the scarcity of publicly open and adequately annotated data for model training. It is worth mentioning that the lack of annotated datasets is a common and major challenge applied to almost every modality of biodiversity data and is particularly acute for bioacoustic data, where the availability of sufficiently large and diverse datasets, encompassing a wide array of species, remains limited for model training purposes.
This scarcity of publicly open and well annotated datasets arises from the labor-intensive nature of data annotation and the dearth of expertise necessary to accurately identify species. This is the case even for relatively well-studied taxa such as birds, and is even more notable for taxa such as Orthoptera (grasshoppers and relatives). The incompleteness of our current taxonomy also compounds this issue.
Though some data cannot be shared with the public because of concerns about poaching or other legal restrictions imposed by local governments, there are still considerable sources of well-annotated data that currently reside within the confines of individual researchers’ or organizations’ hard drives. Unlocking these data will partially address the data gap.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
S2: Sufficiency > Coverage
There exists a significant geographic imbalance in data collected, with insufficient representation from highly biodiverse regions. The data also does not cover a diverse spectrum of species, intertwining with taxonomy insufficiencies.
Addressing these gaps involves several strategies:
Data sharing: Unlocking existing datasets scattered across individual laboratories and organizations can partially alleviate the geographic and taxonomic gaps in data coverage.
Annotation Efforts:
Leveraging crowdsourcing platforms enables collaborative data collection and facilitates volunteer-driven annotation, playing a pivotal role in expanding annotated datasets.
Collaborating with domain experts, including ecologists and machine learning practitioners, is crucial. Incorporating local ecological knowledge, particularly from indigenous communities in target regions, enriches data annotation efforts.
Developing AI-powered methods for automatic cataloging, searching, and data conversion is imperative to streamline data processing and extract relevant information efficiently.
Data Collection:
Prioritize continuous data collection efforts, especially in underrepresented regions and for less documented species and taxonomic groups.
Explore innovative approaches like Automated Multisensor Stations for Monitoring of Species Diversity (AMMODs) to enhance data collection capabilities.
Strategically plan data collection initiatives to target species at risk of extinction, ensuring data capture before irreversible loss occurs.
U2: Usability > Aggregation
A lot of well-annotated datasets are currently scattered within the confines of individual researchers’ or organizations’ hard drives. Aggregating and sharing these datasets would largely address the lack of publicly open annotated data needed for model training. This initiative would also mitigate geographic and taxonomic imbalances in existing data.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
Data Gap Type
Data Gap Details
S1: Sufficiency > Insufficient Volume
One of the foremost challenges is the scarcity of publicly open and adequately annotated data for model training. It is worth mentioning that the lack of annotated datasets is a common and major challenge applied to almost every modality of biodiversity data and is particularly acute for bioacoustic data, where the availability of sufficiently large and diverse datasets, encompassing a wide array of species, remains limited for model training purposes.
This scarcity of publicly open and well annotated datasets arises from the labor-intensive nature of data annotation and the dearth of expertise necessary to accurately identify species. This is the case even for relatively well-studied taxa such as birds, and is even more notable for taxa such as Orthoptera (grasshoppers and relatives). The incompleteness of our current taxonomy also compounds this issue.
Though some data cannot be shared with the public because of concerns about poaching or other legal restrictions imposed by local governments, there are still considerable sources of well-annotated data that currently reside within the confines of individual researchers’ or organizations’ hard drives. Unlocking these data will partially address the data gap.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
S2: Sufficiency > Coverage
There exists a significant geographic imbalance in data collected, with insufficient representation from highly biodiverse regions. The data also does not cover a diverse spectrum of species, intertwining with taxonomy insufficiencies.
Addressing these gaps involves several strategies:
Data sharing: Unlocking existing datasets scattered across individual laboratories and organizations can partially alleviate the geographic and taxonomic gaps in data coverage.
Annotation Efforts:
Leveraging crowdsourcing platforms enables collaborative data collection and facilitates volunteer-driven annotation, playing a pivotal role in expanding annotated datasets.
Collaborating with domain experts, including ecologists and machine learning practitioners, is crucial. Incorporating local ecological knowledge, particularly from indigenous communities in target regions, enriches data annotation efforts.
Developing AI-powered methods for automatic cataloging, searching, and data conversion is imperative to streamline data processing and extract relevant information efficiently.
Data Collection:
Prioritize continuous data collection efforts, especially in underrepresented regions and for less documented species and taxonomic groups.
Explore innovative approaches like Automated Multisensor Stations for Monitoring of Species Diversity (AMMODs) to enhance data collection capabilities.
Strategically plan data collection initiatives to target species at risk of extinction, ensuring data capture before irreversible loss occurs.
U2: Usability > Aggregation
A lot of well-annotated datasets are currently scattered within the confines of individual researchers’ or organizations’ hard drives. Aggregating and sharing these datasets would largely address the lack of publicly open annotated data needed for model training. This initiative would also mitigate geographic and taxonomic imbalances in existing data.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
One gap in data is the incomplete barcoding reference databases.
Data Gap Type
Data Gap Details
S2: Sufficiency > Coverage
eDNA is an emerging new technique in biodiversity monitoring. There are still a lot of issues impeding the application of eDNA-based tools. One gap in data is the incomplete barcoding reference databases. However, a lot of attention and effort are being devoted to filling this data gap. For example, the BIOSCAN project. It is worth mentioning that BIOSCAN-5M is a comprehensive dataset containing multi-modal information, including DNA barcode sequenceses and taxonomic labels for over 5 million insect specimens, presenting as a large reference library on species- and genus-level classification tasks.
Bias-correction of climate projections
Details (click to expand)
Climate projection provides essential information about future climate conditions, guiding efforts in mitigation and adaptation, such as disaster risk assessments and power grid optimization. ML enhances the accuracy of these projections by bias-correcting forecasts generated by physics-based climate models (e.g., CMIP6). ML achieves this by learning the relationship between historical climate simulations (e.g., CMIP6 data) and observed ground truth data (such as ERA5 or weather station observations).
The large uncertainties in future climate projection is a big problem of CMIP6. The large volume of data and the lack of uniform structure—such as inconsistent variable names, data formats, and resolutions across different CMIP6 models—make it challenging to utilize data from multiple models effectively.
The large uncertainties in future climate projection is a big problem of CMIP6. The large volume of data and the lack of uniform structure—such as inconsistent variable names, data formats, and resolutions across different CMIP6 models—make it challenging to utilize data from multiple models effectively.
Data Gap Type
Data Gap Details
U6: Usability > Large Volume
The large volume size of CMIP6 data poses significant challenges for downloading, transferring, storing, and processing with standard computational infrastructure. Although cloud-based access is available from various providers, it often comes with high costs and is not free. To address these challenges, it would be beneficial to make additional computational resources available alongside the storage of data.
U1: Usability > Structure
Data from different models is of different resolutions and variable names, which makes assimilating data from multiple models challenging.
R1: Reliability > Quality
There are large biases and uncertainties in the data, which can be improved by improving the climate models used to generate the simulations.
ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.
ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Downloading ERA5 data from the Copernicus Climate Data Store can be very time-consuming, taking anywhere from days to months. This delay is due to the large size of the dataset, which results from its high spatio-temporal resolution, its high demand, and the fact that the data is stored on tape.
U6: Usability > Large Volume
The large volume of data poses significant challenges for downloading, transferring, storing, and processing with standard computational infrastructure. Although cloud-based access is available from various providers, it often comes with high costs and is not free. To address these challenges, it would be beneficial to make additional computational resources available alongside the storage of data.
R1: Reliability > Quality
ERA5 is often used as “ground truth” in bias-correction tasks, but it has its own biases and uncertainties. ERA5 is derived from observations combined with physics-based models to estimate conditions in areas with sparse observational data. Consequently, biases in ERA5 can stem from the limitations of these physics-based models. It is worth noting that precipitation and cloud-related fields in ERA5 are less accurate compared to other fields and are not suitable for validating ML models. A higher-fidelity atmospheric dataset, such as an enhanced version of ERA5, is greatly needed. Machine learning can play a significant role in this area by improving the assimilation of atmospheric observation data from various sources.
Data is not regularly gridded and needs to be preprocessed before being used in an ML model.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
The access to weather station data in some regions can be very largely restricted; only a small fraction of the data is open to the public.
U1: Usability > Structure
The data is not regularly gridded and requires preprocessing before being used in an ML model. In regions with dense station coverage, making decisions about how to handle overlapping data can be somewhat arbitrary. Machine learning can assist in optimizing this process.
S2: Sufficiency > Coverage
There is no sufficient data or even no data from global south.
Bias-correction of weather forecasts
Details (click to expand)
ML can be used to improve the fidelity of high-impact weather forecasts by post-processing outputs from physics-based numerical forecast models and by learning to correct the systematic biases associated with physics-based numerical forecasting models.
Same as HRES, the biggest challenge of ENS is that only a portion of it is available to the public for free.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
The high-resolution real-time forecast is for purchase only and it is expensive to get them.
U6: Usability > Large Volume
Due to its high spatio-temporal resolution, HRES data is quite large, creating significant challenges for downloading, transferring, storing, and processing with standard computational infrastructure. To address this, the WeatherBench 2 benchmark dataset provides a solution for certain machine learning tasks involving ENS data.
The biggest challenge with using HRES data is that only a portion of it is available to the public for free.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
The high-resolution real-time forecast is for purchase only and it is expensive to get them.
U6: Usability > Large Volume
Due to its high spatio-temporal resolution, HRES data is quite large, creating significant challenges for downloading, transferring, storing, and processing with standard computational infrastructure. To address this, the WeatherBench 2 benchmark dataset provides a solution for certain machine learning tasks involving HRES data.
Data is not regularly gridded and needs to be preprocessed before being used in an ML model.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
The access to weather station data in some regions can be very largely restricted; only a small fraction of the data is open to the public.
U1: Usability > Structure
The data is not regularly gridded and requires preprocessing before being used in an ML model. In regions with dense station coverage, making decisions about how to handle overlapping data can be somewhat arbitrary. Machine learning can assist in optimizing this process.
S2: Sufficiency > Coverage
There is no sufficient data or even no data from global south.
Data-driven generation of climate simulations
Details (click to expand)
Generating climate simulations by running physics-based climate models is time consuming. ML can be used to more quickly generate climate simulations corresponding to different greenhouse gas emissions scenarios. Specifically, ML can be used to learn a surrogate model that approximates computationally-intensive climate simulations generated via Earth system models.
The dataset currently includes simulations from only one model. To enhance accuracy and reliability, it is important to include simulations from multiple models.
The dataset currently includes simulations from only one model. To enhance accuracy and reliability, it is important to include simulations from multiple models.
Data Gap Type
Data Gap Details
S2: Sufficiency > Coverage
Currently, the dataset includes information from only one model. Training a machine learning model with this single source of data may result in limited generalization capabilities. To improve the model’s robustness and accuracy, it is essential to incorporate data from multiple models. This approach not only enhances the model’s ability to generalize across different scenarios but also helps reduce uncertainties associated with relying on a single model.
The large data volume and lack of uniform structure (no consistent variable names, data strucuture, and data resolution across all models) makes it difficult to use data from more than one model of CMIP6.
The large data volume and lack of uniform structure (no consistent variable names, data strucuture, and data resolution across all models) makes it difficult to use data from more than one model of CMIP6.
Data Gap Type
Data Gap Details
U6: Usability > Large Volume
The large volume size of CMIP6 data poses significant challenges for downloading, transferring, storing, and processing with standard computational infrastructure. Although cloud-based access is available from various providers, it often comes with high costs and is not free. To address these challenges, it would be beneficial to make additional computational resources available alongside the storage of data.
U1: Usability > Structure
Data from different models is of different resolutions and variable names, which makes assimilating data from multiple models challenging.
R1: Reliability > Quality
There are large biases and uncertainties in the data, which can be improved by improving the climate models used to generate the simulations.
Detection of climate-induced ecosystem changes
Details (click to expand)
Climate change is inducing significant changes in ecosystems. ML can be used to assess the impact of climate change on biodiversity and identify critical areas for conservation.
The major data gap is the limited institutional capacity to process and analyze data promptly to inform decision-making.
Data Gap Type
Data Gap Details
M: Misc/Other
There is a significant institutional challenge in processing and analyzing data promptly to inform decision-making. To enhance institutional capacity for leveraging global data sources and analytical methods effectively, a strategic, ecosystem-building approach is essential, rather than solely focusing on individual researcher skill development. This approach should prioritize long-term sustainability over short-term project-based funding.
Funding presents a major bottleneck for ecosystem monitoring initiatives. While most funding allocations are short-term, there is a critical need for sustained and adequate funding to support ongoing monitoring efforts and maintain data processing capabilities.
The major data gap is the limited institutional capacity to process and analyze data promptly to inform decision-making.
Data Gap Type
Data Gap Details
M: Misc/Other
There is a significant institutional challenge in processing and analyzing data promptly to inform decision-making. To enhance institutional capacity for leveraging global data sources and analytical methods effectively, a strategic, ecosystem-building approach is essential, rather than solely focusing on individual researcher skill development. This approach should prioritize long-term sustainability over short-term project-based funding.
Funding presents a major bottleneck for ecosystem monitoring initiatives. While most funding allocations are short-term, there is a critical need for sustained and adequate funding to support ongoing monitoring efforts and maintain data processing capabilities.
Data access is restricted due to institutional barriers and other restrictions.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Access to the data is restricted, with limited availability to the public. Users often find themselves unable to access the comprehensive information they require and must settle for suboptimal or outdated data. Addressing this challenge necessitates a legislative process to facilitate broader access to data.
For highly biodiverse regions, like tropical rainforests, there is a lack of high-resolution data that captures the spatial heterogeneity of climate, hydrology, soil, and the relationship with biodiversity.
For highly biodiverse regions, like tropical rainforests, there is a lack of high-resolution data that captures the spatial heterogeneity of climate, hydrology, soil, and the relationship with biodiversity.
Data Gap Type
Data Gap Details
S3: Sufficiency > Granularity
There is a lack of high-resolution data that captures the spatial heterogeneity of climate, hydrology, soil, and so on, which are important for biodiversity patterns. This is because of a lack of observation systems that are dense enough to capture the subtleties in those variables caused by terrain. It would be helpful to establish decentralized monitoring networks to cost-effectively collect and maintain high-quality data over time, which cannot be done by one single country.
Development of hybrid-climate models
Details (click to expand)
Physics-based climate models incorporate numerous complex components, such as radiative transfer, subgrid-scale cloud processes, deep convection, and subsurface ocean eddy dynamics. These components are computationally intensive, which limits the spatial resolution achievable in climate simulations. ML models can emulate these physical processes, providing a more efficient alternative to traditional methods. By integrating ML-based emulations into climate models, we can achieve faster simulations and enhanced model performance.
An ML-ready benchmark dataset designed for hybrid ML-physics research, e.g. emulation of subgrid clouds and convection processes.
Data Gap Type
Data Gap Details
U6: Usability > Large Volume
A common challenge for emulating climate model components, especially subgrid scale processes is the large data volume, which makes data downloading, transferring, processing, and storing challenging. Computation resources, including GPUs and storage, are urgently needed for most ML practitioners. Technical help on optimizing code for large volumes of data would also be appreciated.
S3: Sufficiency > Granularity
The current resolution is still sufficient to resolve some physical processes, e.g. turbulence, and tornado. Extremely high-resolution simulations, like large-eddy-simulations, are what needed.
Intercomparison of global storm-resolving (5km or less) model simulations; used as the target of the emulator. Data can be found here.
Data Gap Type
Data Gap Details
U6: Usability > Large Volume
A common challenge for emulating climate model components, especially subgrid scale processes is the large data volume, which makes data downloading, transferring, processing, and storing challenging. Computation resources, including GPUs and storage, are urgently needed for most ML practitioners. Technical help on optimizing code for large volumes of data would also be appreciated.
S3: Sufficiency > Granularity
The current resolution is still sufficient to resolve some physical processes, e.g. turbulence, and tornado. Extremely high-resolution simulations, like large-eddy-simulations, are what needed.
ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.
ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Downloading ERA5 data from the Copernicus Climate Data Store can be very time-consuming, taking anywhere from days to months. This delay is due to the large size of the dataset, which results from its high spatio-temporal resolution, its high demand, and the fact that the data is stored on tape.
U6: Usability > Large Volume
The large volume of data poses significant challenges for downloading, transferring, storing, and processing with standard computational infrastructure. Although cloud-based access is available from various providers, it often comes with high costs and is not free. To address these challenges, it would be beneficial to make additional computational resources available alongside the storage of data.
R1: Reliability > Quality
ERA5 is often used as “ground truth” in bias-correction tasks, but it has its own biases and uncertainties. ERA5 is derived from observations combined with physics-based models to estimate conditions in areas with sparse observational data. Consequently, biases in ERA5 can stem from the limitations of these physics-based models. It is worth noting that precipitation and cloud-related fields in ERA5 are less accurate compared to other fields and are not suitable for validating ML models. A higher-fidelity atmospheric dataset, such as an enhanced version of ERA5, is greatly needed. Machine learning can play a significant role in this area by improving the assimilation of atmospheric observation data from various sources.
Extremely high-resolution simulations, such as large-eddy simulations, are needed. By explicitly resolving processes that are not resolved in current climate models, these simulations may serve as better ground truth for training machine learning models that emulate the physical processes and offer a more accurate basis for understanding and predicting climate phenomena.
Extremely high-resolution simulations, such as large-eddy simulations, are needed. By explicitly resolving processes that are not resolved in current climate models, these simulations may serve as better ground truth for training machine learning models that emulate the physical processes and offer a more accurate basis for understanding and predicting climate phenomena.
Data Gap Type
Data Gap Details
S6: Sufficiency > Missing Components
The resolution of current high-resolution simulations is still insufficient for resolving many physical processes, such as turbulence. To address this, extremely high-resolution simulations, like large-eddy simulations (with sub-kilometer or even tens of meter resolution), are needed. By explicitly resolving those turbulent processes, these simulations represent a more realistic realization of the atmosphere and therefore theoretically give better model results. These simulations may serve as ground truth for training machine learning models and offer a more accurate basis for understanding and predicting climate phenomena. Long-term climate simulations at this ultra-high resolution would significantly enhance both hybrid climate modeling and climate emulation, providing deeper insights into global warming scenarios.
Given the high computational cost of running such simulations, creating and sharing benchmark datasets based on these simulations is essential for the research community. This would facilitate model development and validation, promoting more accurate and efficient climate studies.
An enhanced version of ERA5 with higher resolution and fidelity is needed.
Data Gap Type
Data Gap Details
S6: Sufficiency > Missing Components
ML models are now trained on high-resolution climate model simulations and their skills are therefore limited by the performance of climate models. True observations of the atmosphere at high resolution (< 5km) are needed to train and validate the ML models and hence improve the models’ performance.
Digital reconstruction of the environment
Details (click to expand)
Modeling digital representations of environmental conditions and habitats using remote sensing data, such as satellite images, is crucial for understanding how environmental factors impact animal behavior and conservation efforts. This approach provides valuable insights into habitat conditions and changes, which are essential for effective wildlife conservation and management. ML can enhance this process by efficiently processing large volumes of data from various sources, leading to more detailed and accurate environmental reconstructions.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
Data Gap Type
Data Gap Details
S1: Sufficiency > Insufficient Volume
One of the foremost challenges is the scarcity of publicly open and adequately annotated data for model training. It is worth mentioning that the lack of annotated datasets is a common and major challenge applied to almost every modality of biodiversity data and is particularly acute for bioacoustic data, where the availability of sufficiently large and diverse datasets, encompassing a wide array of species, remains limited for model training purposes.
This scarcity of publicly open and well annotated datasets arises from the labor-intensive nature of data annotation and the dearth of expertise necessary to accurately identify species. This is the case even for relatively well-studied taxa such as birds, and is even more notable for taxa such as Orthoptera (grasshoppers and relatives). The incompleteness of our current taxonomy also compounds this issue.
Though some data cannot be shared with the public because of concerns about poaching or other legal restrictions imposed by local governments, there are still considerable sources of well-annotated data that currently reside within the confines of individual researchers’ or organizations’ hard drives. Unlocking these data will partially address the data gap.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
S2: Sufficiency > Coverage
There exists a significant geographic imbalance in data collected, with insufficient representation from highly biodiverse regions. The data also does not cover a diverse spectrum of species, intertwining with taxonomy insufficiencies.
Addressing these gaps involves several strategies:
Data sharing: Unlocking existing datasets scattered across individual laboratories and organizations can partially alleviate the geographic and taxonomic gaps in data coverage.
Annotation Efforts:
Leveraging crowdsourcing platforms enables collaborative data collection and facilitates volunteer-driven annotation, playing a pivotal role in expanding annotated datasets.
Collaborating with domain experts, including ecologists and machine learning practitioners, is crucial. Incorporating local ecological knowledge, particularly from indigenous communities in target regions, enriches data annotation efforts.
Developing AI-powered methods for automatic cataloging, searching, and data conversion is imperative to streamline data processing and extract relevant information efficiently.
Data Collection:
Prioritize continuous data collection efforts, especially in underrepresented regions and for less documented species and taxonomic groups.
Explore innovative approaches like Automated Multisensor Stations for Monitoring of Species Diversity (AMMODs) to enhance data collection capabilities.
Strategically plan data collection initiatives to target species at risk of extinction, ensuring data capture before irreversible loss occurs.
U2: Usability > Aggregation
A lot of well-annotated datasets are currently scattered within the confines of individual researchers’ or organizations’ hard drives. Aggregating and sharing these datasets would largely address the lack of publicly open annotated data needed for model training. This initiative would also mitigate geographic and taxonomic imbalances in existing data.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
One gap in data is the incomplete barcoding reference databases.
Data Gap Type
Data Gap Details
S2: Sufficiency > Coverage
eDNA is an emerging new technique in biodiversity monitoring. There are still a lot of issues impeding the application of eDNA-based tools. One gap in data is the incomplete barcoding reference databases. However, a lot of attention and effort are being devoted to filling this data gap. For example, the BIOSCAN project. It is worth mentioning that BIOSCAN-5M is a comprehensive dataset containing multi-modal information, including DNA barcode sequenceses and taxonomic labels for over 5 million insect specimens, presenting as a large reference library on species- and genus-level classification tasks.
Satellite images provide environmental information for habitat monitoring. Combined with other data, e.g. bioacoustic data, they have been used to model and predict species distribution, richness, and interaction with the environment. High-resolution images are needed but most of them are not open to the public for free.
Satellite images provide environmental information for habitat monitoring. Combined with other data, e.g. bioacoustic data, they have been used to model and predict species distribution, richness, and interaction with the environment. High-resolution images are needed but most of them are not open to the public for free.
Data Gap Type
Data Gap Details
S3: Sufficiency > Granularity
Resolution of the publicly open satellite images are not sufficient for some environment reconstruction studies.
O2: Obtainability > Accessibility
The resolution of publicly open satellite images is not high enough. High-resolution images are usually commercial and not open for free.
Disaster risk assessment
Details (click to expand)
As climate change progresses, extreme weather events and related hazards are expected to become more frequent and severe. To effectively address these challenges, robust disaster risk assessment and management are crucial. ML can be used within these efforts to analyze satellite imagery and geographic data, in order to pinpoint vulnerable areas and produce comprehensive risk maps.
More information, such as age of the building, should be included in the dataset.
Data Gap Type
Data Gap Details
U1: Usability > Structure
Building footprint datasets are usually in different formats other than the format/coordinate system used by the government. To ensure these datasets are usable for local government applications, it would be helpful to have them align with the government’s referred format and coordinate system.
S6: Sufficiency > Missing Components
More information about the building, such as the age of the building and the source of the data should be included in the dataset.
R1: Reliability > Quality
The building footprint data can contain errors due to detection inaccuracies in the models used to generate the dataset, as well as limitations of satellite imagery. These limitations include outdated images that may not reflect recent developments and visibility issues such as cloud cover or obstructions that can prevent accurate identification of buildings.
Country-specific exposure data can range from extensive and detailed to almost completely unavailable, even if they exist as hard copies in government offices.
U3: Usability > Usage Rights
OpenQuake GEM project provides comprehensive data on the residential, commercial, and industrial building stock worldwide, but it is restricted to non-commercial use only.
R1: Reliability > Quality
For some data, e.g. population data, there are several datasets available and they all differ from each other by a lot. Validation is needed before the data can be used comfortably and confidently.
Some data, e.g. geospatial socioeconomic data provided by the UNEP Global Resource Information Database, are not always current or complete.
S3: Sufficiency > Granularity
For open global data, the resolution and completeness are usually not sufficient for desired purposes, e.g. GDP data from the World Bank or US CIA is not sufficiently detailed for assessing risks from natural hazards.
The financial loss data is usually proprietary and not open to the public.
Data Gap Type
Data Gap Details
O1: Obtainability > Findability
Data tends to be proprietary, as the most consistent loss data is produced by the insurance industry.
O2: Obtainability > Accessibility
Even for a single event, collecting a robust set of homogeneous loss data poses a significant challenge.
U4: Usability > Documentation
With existing data, determining whether the data is complete can be a challenge as it is common that little or no metadata is associated with the loss data.
Resolution of current hazard data is not sufficient for effective physical risk assessment
Data Gap Type
Data Gap Details
S3: Sufficiency > Granularity
Climate hazard data (e.g., floods, tropical cyclones, droughts) is often too coarse for effective physical risk assessments, which focus on evaluating damage to infrastructure such as buildings and power grids. While exposure data, including information on buildings and power grids, is available at resolutions ranging from 25 meters to 250 meters, climate hazard projections, especially those extending beyond a year, are typically at resolutions of 25 kilometers or more.
To provide meaningful risk assessments, more granular data is required. This necessitates downscaling efforts, both dynamical and statistical, to refine the resolution of climate hazard data. Machine learning (ML) can play a valuable role in these downscaling processes. Additionally, the downscaled data should be made publicly available, and a dedicated portal should be established to facilitate access and sharing of this refined information.
R1: Reliability > Quality
Projecting future climate hazards is crucial for assessing long-term risks. Climate simulations from CMIP models are currently our primary source for future climate projections. However, these simulations come with significant uncertainties due to both uncertainties in model and emission scenarios. To improve their utility for disaster risk assessment and other applications, increased funding and efforts are needed to advance climate model development for greater accuracy. Additionally, machine learning methods can help mitigate some of these uncertainties by bias-correcting the simulations.
S6: Sufficiency > Missing Components
Seasonal climate hazard forecasts are crucial for disaster risk assessment, management, and preparation. However, high-resolution data at this scale is often lacking for many hazards. This challenge is likely due to the difficulty in generating accurate seasonal weather forecasts. ML has the potential to address this gap by improving forecast accuracy and granularity.
Doesn’t have meta-data regarding when the infrastructures, e.g. building was built, whereas this information is important to identify age of the building which in the end characterises the exposure to hazard.
Doesn’t have meta-data regarding when the infrastructures, e.g. building was built, whereas this information is important to identify age of the building which in the end characterises the exposure to hazard.
The availability, usability, and reliability of socioeconomic data are difficult. In general, there is a notable scarcity of data from the Global South. Data at a more granular scale is usually missing for the Global North. When data does exist, they lack consistency across multiple sources.
The availability, usability, and reliability of socioeconomic data are difficult. In general, there is a notable scarcity of data from the Global South. Data at a more granular scale is usually missing for the Global North. When data does exist, they lack consistency across multiple sources.
Data Gap Type
Data Gap Details
U1: Usability > Structure
Data is scattered and not easily findable in one single place.
U4: Usability > Documentation
The accompanied documentation lacks clarity or consistency over time.
U5: Usability > Pre-processing
There is a lack of standardized and coherent structures for data entry in socioeconomic datasets, resulting in users spending significant time on pre-processing tasks such as translating, standardizing, and harmonizing data. Integrating socioeconomic data with other types of data, such as environmental and health data for comprehensive analysis, can be even more challenging due to the differences in data formats and standards.
R1: Reliability > Quality
There is no single unified data source for many socio-economic data, e.g. population, human settlement. Each organization and company comes up with its own dataset and they all differ from each other by a lot. Data validation (e.g. talk to the government and ask for more information) is needed before using the data, but it is hard to do this validation at scalable way because of the time, financial, and technical efforts required.
To improve the reliability and usability of data, data entry methods need to be standardized and automated, and a unified platform for data entry is also wished for.
S2: Sufficiency > Coverage
There is often limited socio-economic data availability in developing regions due to inadequate data collection infrastructure, financial constraints, and political instability.
S3: Sufficiency > Granularity
In developed regions (Global North), there is a lack of granular data for precise analysis. For example, neither the GDP data from the World Bank nor the US CIA is sufficiently detailed for assessing risks from natural hazards. There is a lack of asset-level data for studying the physical risk of infrastructure.
Very high-resolution reference data, for example, DEM currently is not freely open to the public.
Data Gap Type
Data Gap Details
O1: Obtainability > Findability
Surface elevation data defined by a digital elevation model (DEM) is one of the most essential types of reference data. The high-resolution elevation data has huge value for disaster risk assessment, particularly for the Global South.
Open DEM data with global coverage now goes to a resolution of 30-m, but the resolution is still insufficient for many disaster risk assessments. Higher-resolution datasets exist, but they are either with limited spatial coverage or are commercial products and are very expensive to get.
Distribution-side hosting capacity estimation
Details (click to expand)
Historically the power grid has been designed for unidirectional flow from carbon-based generating sources to consumers. However, in the effort to lower greenhouse gas emissions, transition to and integration of renewable generation has become increasingly important in all aspects (e.g. transmission and distribution) of the grid from large scale generation farms to consumer-level rooftop solar and community wind turbine installations. The transition necessitates a restructuring of the grid from a unidirectional to a bidirectional energy network thereby stressing pre-existing systems–especially at the low-voltage distribution level. Due to its intermittent behavior, renewable integration at the low-voltage consumer level depends on the hosting capacity of the nearest substation feeder circuit. The hosting capacity determines the amount of generation from distributed energy resources (DERs) that a circuit can safely accommodate without setting off safety equipment. This can occur when generation exceeds consumption leading to overvoltage conditions or high current demand due to sudden peaks in demand leading to voltage sags. Faults may also lead to voltage sags.Operationally, distribution level substation feeders must surmount these conditions to ensure power quality. Traditional methods of assessing the hosting capacity of low-voltage distribution networks involve power flow analysis simulations which can be computationally expensive and difficult to perform in real-time operating conditions for large distribution circuits. For example, to analyze a particular feeder circuit, scenarios must be built by varying loads, DER generation, environmental conditions, power equipment availability, and human activity. Violations must then be identified with respect to voltage limits, thermal loads, and protection equipment to estimate hosting capacity. Machine learning models can serve as a surrogate to traditional models by capturing the spatio-temporal patterns of multiple streams of data for each node in the distribution network enabling real-time estimation capabilities. Additionally, reinforcement learning can enable accelerated scenario building and online control strategy evaluation. One such strategy, for example, may utilize inverter technology to modulate generation to match the larger power system’s needs and protect it from faults and overloads.
While OpenDSS and GridLab-D are free to use as an alternative when real distribution substation circuit feeder data is unavailable, to perform site-specific or scenario studies, data from different sources may be needed to verify simulation results. Actual hosting capacity may vary from simulation results due to differences in load, environmental conditions, and the level of DER penetration.
While OpenDSS and GridLab-D are free to use as an alternative when real distribution substation circuit feeder data is unavailable, to perform site-specific or scenario studies, data from different sources may be needed to verify simulation results. Actual hosting capacity may vary from simulation results due to differences in load, environmental conditions, and the level of DER penetration.
Data Gap Type
Data Gap Details
U2: Usability > Aggregation
To perform a realistic distribution system-level study for a particular region of interest, data concerning topology, loads, and penetration of DERs needs to be aggregated and collated from external sources.
U3: Usability > Usage Rights
Rights to external data for use with OpenDSS or GridLab-D may require purchase or partnerships with utilities and/or the Distribution System Operator (DSO) to perform scenario studies with high DER penetration and load demand.
R1: Reliability > Quality
OpenDSS and GridLab-D studies require real deployment data for verification of results from substations. Additionally, distribution level substation feeder hosting capacity may vary based on load, environmental conditions, and the level of DER penetration in a service area.
Early detection of fire
Details (click to expand)
Climate change is expected to increase both the frequency and intensity of wildfires, as well as lengthen the fire season due to rising temperatures and shifting precipitation patterns. ML can play a crucial role in wildfire detection and monitoring by synthesizing data from various sources in order to provide more timely and precise information. For instance, ML algorithms can analyze satellite imagery from different regions to detect early signs of fires and track their progression. Additionally, ML can enhance automatic fire detection systems, improving their accuracy and responsiveness.
Thermal images captured by drones have high value but the cost of high-resolution sensors is high.
Data Gap Type
Data Gap Details
S3: Sufficiency > Granularity
Thermal images are highly valuable, but their resolution is often too low (commonly 120x90 pixels) and their field of vision is limited. Commercially available sensors can achieve 640x480 pixels, but they are much more expensive (~$10K). There are even Higher-resolution sensors are available but are currently restricted to military use due to security, ethical, and privacy concerns. Those seeking such high-resolution sensors should carefully weigh the benefits and drawbacks of their request.
U6: Usability > Large Volume
Data volume is a concern for those collecting drone images and seeking to share them with the public. Finding a platform that offers adequate storage for hosting the data is challenging, as it must ensure that users can download the data efficiently without issues.
Earth observation for climate-related applications
Details (click to expand)
Many climate-related applications suffer from a lack of real-time and/or on-the-ground data. ML can be used to analyze satellite imagery at scale in order to fill some of these gaps, via applications such as land cover classification, footprint detection for buildings, solar panel detection, deforestation detection, and emissions monitoring.
Satellite images are intensively used for Earth system monitoring. One of the two biggest challenges of using satellite images is the sheer volume of data which makes downloading, transferring, and processing data all difficult. The other one is the lack of annotated data. For many use cases, the lack of publicly open high-resolution imagery is also a bottleneck.
Satellite images are intensively used for Earth system monitoring. One of the two biggest challenges of using satellite images is the sheer volume of data which makes downloading, transferring, and processing data all difficult. The other one is the lack of annotated data. For many use cases, the lack of publicly open high-resolution imagery is also a bottleneck.
Data Gap Type
Data Gap Details
S3: Sufficiency > Granularity
Publicly available datasets often lack sufficient granularity. This is particularly challenging for the Global South, which typically lacks the funding for high-resolution commercial satellite imagery.
U6: Usability > Large Volume
The sheer volume of data now poses one of the biggest challenges for satellite imagery. When data reaches the terabyte scale, downloading, transferring, and hosting become extremely difficult. Those who create these datasets often lack the storage capacity to share the data. This challenge can potentially be addressed by one or more of the following strategies:
Data compression: Compress the data while retaining lower-dimensional information.
Lightweight models: Build models with fewer features selected through feature extraction. A successful example can be found here.
Large foundation model for remote sensing data: Purposefully construct large models (e.g., foundation models) that can handle vast amounts of data. This requires changes in research architecture, such as preprocessing architecture modifications.
O2: Obtainability > Accessibility
Very high-resolution satellite images (e.g., finer than 10 meters) typically come from commercial satellites and are not publicly available. One exception is the NICFI dataset, which offers high-resolution, analysis-ready mosaics of the world’s tropics.
U5: Usability > Pre-processing
Satellite images often contain a lot of redundant information, such as large amounts of data over the ocean that do not always contain useful information. It is usually necessary to filter out some of this data during model training.
U2: Usability > Aggregation
Due to differences in orbits, instruments, and sensors, imagery from different satellites can vary in projection, temporal and spatial zones, and cloud blockage, each with its own pros and cons. To overcome data gaps (e.g. cloud blocking) or errors, multiple satellite images are often assimilated. Harmonizing these differences is challenging, and sometimes arbitrary decisions must be made.
U5: Usability > Pre-processing
The lack of annotated data presents another major challenge for satellite imagery. It is suggested that collaboration and coordination at the sector level should be organized to facilitate annotation efforts across multiple sectors and use cases. Additionally, the granularity of annotations needs to be increased. For example, specifying crop types instead of just “crops” and detailing flood damage levels rather than general “damaged” are necessary for more precise analysis.
M: Misc/Other
Cloud cover presents a major technical challenge for satellite imagery, significantly reducing its usability. To obtain information beneath the clouds, pixels from clear-sky images captured by other satellites are often used. However, this method can introduce noise and errors.
M: Misc/Other
There is also a lack of technical capacity in the Global South to effectively utilize satellite imagery.
Energy data fusion for policy and market analysis in energy systems
Details (click to expand)
Data collected from public utilities, energy companies, and government agencies by energy regulatory committees can provide detailed information with respect to generation, fuel consumption, emissions, and financial reports that better inform domestic policies to enforce and promote reduction of gas emissions through carbon pricing and renewable incentives, grid modernization and resilience planning for severe weather events, and equitable energy transitions. By providing continuously updated, well curated, analysis-ready energy system data, climate advocates will have better quantitative tools to influence political and administrative process thereby encouraging energy transition.
Public datasets from government agencies such as the EIA, EPA, FERC, and PHMSA are not ready for use in analysis ready data products. Data is often tabular as zip files with different file formats that may not share common identifiers or schema to readily join data. Collating, collecting, and merging these datasets can often provide greater context to the state of the energy system and the effectiveness of policy measures. Data can also be missing based on reporting gaps and redacted per-plant pricing information. While PUDL seeks to overcome the gaps by merging datasets based on entity matching and interpolation challenges still remain in terms of maintenance as usability can be sensitive to original source data format changes, updates, and new initiatives. The datagaps experienced in the maintenance of this dataset will be highlighted with respect to the source data that PUDL mines.
Public datasets from government agencies such as the EIA, EPA, FERC, and PHMSA are not ready for use in analysis ready data products. Data is often tabular as zip files with different file formats that may not share common identifiers or schema to readily join data. Collating, collecting, and merging these datasets can often provide greater context to the state of the energy system and the effectiveness of policy measures. Data can also be missing based on reporting gaps and redacted per-plant pricing information. While PUDL seeks to overcome the gaps by merging datasets based on entity matching and interpolation challenges still remain in terms of maintenance as usability can be sensitive to original source data format changes, updates, and new initiatives. The datagaps experienced in the maintenance of this dataset will be highlighted with respect to the source data that PUDL mines.
Data Gap Type
Data Gap Details
U1: Usability > Structure
The structure of PUDL is maintained, however, the structure of the source material from FERC, EIA, EPA, and other data providers can vary significantly between reporting years, new initiatives, and individual reporting details. For example, individual power plant identification numbers and associated operational data across sources may differ despite referencing the same plant.
Additionally, data versioning in spreadsheet format files, is non-existent, making it difficult to track updates and content changes made to an individual file that is provided by the regulatory agencies between website updates.
Common standards and formatting across agency datasets and reporting with documentation can provide the open source community with improved direction and responsiveness to changes between years and forms.
U2: Usability > Aggregation
Aggregation of data between publicly available agency provided materials is challenging as schema and naming conventions for similar information in addition to resolution can vary between sources. Probabilistic named entity recognition and interpolation techniques are utilized to join datasets where feasible. However, when aggregating public data with private data obtained by paid purchase access, data may not only require use of private APIs but may also have a significantly different schema or format. Utilizing relational database formatting standards and best practices as a universal standard across data providers can make joining data along common aliases, ids and correlations easier.
U3: Usability > Usage Rights
PUDL uses the Creative Commons Attribution License v4.0.
While PUDL sources data from public government regulatory agencies, data provided by private utility system operators may have quasi-public licensing restrictions. Despite the utility data being be provided to regulatory agencies, use by the wider public could be legally disputed. To overcome this, usage rights may be explicitly stated by the provider and/or the government agency publishing the data.
U4: Usability > Documentation
PUDL maintains and updates documentation as more datasets are incorporated to its database. However, Catalyst Cooperative, the group responsible for the continuous development of the project, monitor source datasets for changes in format and requirements between years.
U5: Usability > Pre-processing
PUDL seeks to create a parsing pipeline that allows concatenation and collation of information with respect to data requested by regulatory agencies. However, this parsing pipeline can face challenges as schema and format undergo changes with time. For example, FERC data has changed database formats from pdf to XBRL. Additionally, data sources can be semi-structured in nature requiring significant pre-processing and expertise in a variety of data formats including those that may be custom to the reporting agency.
Energy data in the form of PDFs can bottleneck data pipelines utilizing optical character recognition (OCR) technology due to scan quality thereby affecting data extraction especially data does not have a unified reporting format.
While PUDL relies on natural language processing techniques such as probabilistic entity matching to mitigate the challenges faced, oftentimes, these data gaps require human verification and extensive manual pre-processing . This can introduce technical debt while working with a solution provider to understand and better digest data contents.
S2: Sufficiency > Coverage
PUDL coverage is isolated to regulatory agencies and third party organizations within the United States.
S3: Sufficiency > Granularity
Granularity of data is constrained on a annual, monthly or quarterly basis depending on the source of data compiled by PUDL. PUDL utilizes interpolation techniques to unify resolution, however, this requires continuous maintenance and examination of historical data.
S6: Sufficiency > Missing Components
PUDL would like to incorporate open weather model data into the database to facilitate research into load demand and renewable generation. This would require down-sampling of weather model outputs in order to match pre-existing database resolution. Additionally, data with respect to transmission and congestion from grid operators would provide greater system context to the data provided by public agencies.
Energy-efficient new building design
Details (click to expand)
The built environment contributes significantly to global carbon dioxide emissions both through the embodied carbon associated with building materials and through operational emissions associated with thermal comfort, ventilation, and lighting. Detailed analysis is often applied too late into the building design process, thereby leaving out significant energy-saving potential. The integration of building performance simulation (BPS) in the initial phase can be critical to sustainable and energy efficient design thereby influencing subsequent construction as well as overall building lifecycle. However, traditional BPS relies on complex physics models with respect to fluid dynamics, thermodynamics, sunlight, and acoustics, increasing computational complexity and processing time associated with the evaluation of a candidate design. Machine learning models can significantly enhance evaluation by emulating BPS based on synthetic and real-world data enabling rapid prototyping and optimization of building topology along multiple comfort, consumption, and environmental objectives. Machine learning can also be introduced at the prototyping phase in response to evaluation, with generative and genetic algorithms based refinement of layouts.
Datasets featured can vary in types of data gaps depending on the content, coverage area, location, building type, building spatial plan, quantity measured, ambient environment, or power consumption or metered data availability.
Datasets featured can vary in types of data gaps depending on the content, coverage area, location, building type, building spatial plan, quantity measured, ambient environment, or power consumption or metered data availability.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
U3: Usability > Usage Rights
S3: Sufficiency > Granularity
Dataset time resolution and period of temporal coverage vary depending on the dataset selected. To overcome this gap, interpolation techniques may be employed and recorded.
S6: Sufficiency > Missing Components
Building data typically does not include grid interactive data, or signals from the utility side with respect to control or demand side management. Such data can be difficult to obtain or require special permissions. By enabling the collection of utility side signals, utility-initiated auto-demand response (auto-DR) and load shifting could be better assessed.
S2: Sufficiency > Coverage
Featured datasets are from test-beds, buildings, and contributing households from the United States. Similar data from other regions would require data collection as household usage behavior may differ depending on culture, location, building age, and weather.
Despite its usefulness in ventilation studies for new construction, CFD simulations are computationally expensive making it difficult to include in the early phase of the design process where building morphosis can be optimized to reduce future operational consumption associated with building lighting, heating, and cooling. Simulations require accurate input information with respect to material properties that may not be present in traditional urban building types. Output of models require the integration of domain knowledge to interpret results from large volumes of synthetic data for different wind directions becoming challenging to manage. Future data collection with respect to simulation output verification can benefit surrogate or proxy approaches to computationally expensive Navier-Stokes equations, and coverage is often restricted to modern building approaches, leaving out passive building techniques known as vernacular architecture from indigenous communities from being taken into design consideration.
Despite its usefulness in ventilation studies for new construction, CFD simulations are computationally expensive making it difficult to include in the early phase of the design process where building morphosis can be optimized to reduce future operational consumption associated with building lighting, heating, and cooling. Simulations require accurate input information with respect to material properties that may not be present in traditional urban building types. Output of models require the integration of domain knowledge to interpret results from large volumes of synthetic data for different wind directions becoming challenging to manage. Future data collection with respect to simulation output verification can benefit surrogate or proxy approaches to computationally expensive Navier-Stokes equations, and coverage is often restricted to modern building approaches, leaving out passive building techniques known as vernacular architecture from indigenous communities from being taken into design consideration.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
The computational overhead of running CFD simulations as well as the time can become prohibitive when used in the early stages of building design thereby limiting its use as a tool. To overcome this, surrogate models, such as GANs or physics constrained deep neural network architectures have been shown to provide promising results though further research with respect to turbulence representation needs to be taken into account.
U2: Usability > Aggregation
While the simulation framework is not difficult to acquire, aggregating and collating the input information regarding boundary conditions require predefined material properties to model heat transfer. Typically, traditional urban building materials are represented, however, in consideration of non-traditional materials additional data collecting and simulation adaptation may be necessary.
U6: Usability > Large Volume
CFD simulations can generate large amounts of data with respect to flow rates, surface temperatures, and turbulence which can be difficult to interpret manually by domain experts. By setting clear objectives at the planning stage to recognize specific flow or thermal phenomena in the synthetic data, model outputs and their associated visualizations can be better interpreted.
R1: Reliability > Quality
As with all simulations, verification with real world data after the building has been completed can allow architects to build a dataset that could be used to improve better surrogate models as well as validate the benefit of incorporating CFD in building planning and early prototype design.
S2: Sufficiency > Coverage
CFD simulations assume static rather than dynamic conditions for doors, windows, and vents in a building layout. In reality, these objects can significantly impact room ventilation as well as thermal conditions which can in turn affect operational consumption assumptions. Furthermore, vernacular architectural techniques such as use of vegetation to protect against winds, thermal chimneys, courtyards, stilted building design, and rooftop wind catchers (badgirs), are not considered in simulation frameworks which tend to focus on walls and external wind directions. Creation or expansion of current simulation frameworks to include scenarios created by passive building strategies can be beneficial in designing buildings that have less dependence on HVAC systems for thermal comfort.
Daylight performance metrics (DPMs) have been developed by building researchers and architects based on daylight access simulation output to quantify the illumination of indoor spaces by natural light. While DPM evaluation is an important step in the planning of commercial buildings, residential buildings do not have similar focus, which is unusual given that most new building construction occurs within the residential sector. Data gaps are provided in the context of residential DPMs which lack metrics associated with direct sunlight access, rely on annual averages for seasons, and utilize fixed occupancy schedules that are overly simplified for residential spaces. Additionally, illuminance metrics and thresholds utilized in commercial spaces do not translate well for residential spaces where people may prefer higher or lower illuminances depending on their location and lifestyles. Lastly, DPM optimization is based on operational metrics and assumptions on illumination in a space and its effects on the resulting thermal comfort and operational consumption of a traditional urban residential spaces, vernacular architecture which is specific to a local region and culture may not share similar objectives, preferring more indoor-outdoor transitional spaces, earthen materials, and less focus on windows and incident natural sunlight.
Daylight performance metrics (DPMs) have been developed by building researchers and architects based on daylight access simulation output to quantify the illumination of indoor spaces by natural light. While DPM evaluation is an important step in the planning of commercial buildings, residential buildings do not have similar focus, which is unusual given that most new building construction occurs within the residential sector. Data gaps are provided in the context of residential DPMs which lack metrics associated with direct sunlight access, rely on annual averages for seasons, and utilize fixed occupancy schedules that are overly simplified for residential spaces. Additionally, illuminance metrics and thresholds utilized in commercial spaces do not translate well for residential spaces where people may prefer higher or lower illuminances depending on their location and lifestyles. Lastly, DPM optimization is based on operational metrics and assumptions on illumination in a space and its effects on the resulting thermal comfort and operational consumption of a traditional urban residential spaces, vernacular architecture which is specific to a local region and culture may not share similar objectives, preferring more indoor-outdoor transitional spaces, earthen materials, and less focus on windows and incident natural sunlight.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Depending on the simulation software selected, intended use, and number of features requested, simulation software is available for purchase.
S2: Sufficiency > Coverage
Vernacular architecture, characterized by traditional building styles and techniques specific to a local region or culture, are not covered in simulation tools. In fact, most simulation output focus on residential areas in primarily urban regions to minimize future operational costs with assumptions made based on desired illuminance thresholds which may not be universal. By including the ability to evaluate passive design strategies adapted to a specific climate and expanding the materials expression to include high thermal inertia walls and roofs such as those of earthen or thatched construction, additional thermal comfort studies can be performed for given incident illuminance. Cultural considerations to outdoor spaces in relation to indoor spaces can provide even greater context of simulation studies and their usefulness in new construction for diverse regions.
S3: Sufficiency > Granularity
Simulations use fixed occupancy schedules which work well in the context of commercial buildings but are overly prescriptive in the context of residential buildings where user occupancy may vary depending on the number of occupants, time of day, day of week, and season. Residential buildings are multipurpose and can be characterized with a member spending more time in some areas rather than others depending on activity. This gap can be alleviated by adapting and expanding simulation inputs to take diverse occupancy scenarios into consideration.
Current DPMs take into account annual averages rather than granular information with respect to seasonal variations in daylight availability. While some advances have been made to incorporate this information through tools like Daysim which defines new DPMs for residential buildings, further work is needed for regions where occupants may want to minimize direct light access and focus more on diffuse lighting. Expanding studies for clients in warmer more arid climates may provide different thresholds and comfort parameters depending on preferences and lifestyle and may even take into account daylight oversupply, glare, and even thermal discomfort.
Materials used in the construction process of the building may change after initial simulation development depending on availability. Finalized building materials and interior absorption and reflectance may diverge from those simulated. Use of dynamic shading devices could also decrease indoor temperature due to incident irradiance. Simulated results could be provided over a range.
Estimation of forest carbon stock
Details (click to expand)
Forests are one of the Earth’s major carbon sinks, absorbing carbon dioxide (CO₂) from the atmosphere through photosynthesis and storing it in biomass (trees and vegetation) and soil. Accurate estimates of carbon stock help quantify the amount of CO₂ forests are sequestering, which is essential for climate change mitigation efforts. ML can help by providing more precise and large-scale estimates of forest carbon through the analysis of satellite imagery. This approach can significantly improve upon traditional, labor-intensive forest inventory surveys, making carbon stock assessments more efficient and scalable.
GEDI is globally available but has some intricacies, e.g. geolocation errors, and weak return signal if the forest is dense, which bring uncertainties and errors into the estimate of canopy height.
The data is manually collected and recorded, resulting in errors, missing values, and duplicates. Additionally, it is limited in coverage and collection frequency.
The data is manually collected and recorded, resulting in errors, missing values, and duplicates. Additionally, it is limited in coverage and collection frequency.
Data Gap Type
Data Gap Details
U5: Usability > Pre-processing
There is a lot of missing data and duplicates.
S2: Sufficiency > Coverage
Since data is collected manually, it is hard to scale and limited to certain regions only.
Estimation of methane emissions from rice paddies
Details (click to expand)
Rice paddies are a major source of global anthropogenic methane emissions. Accurate quantification of CH₄ emissions, especially how they vary with different agricultural practice is crucial for addressing climate change. ML can enhance methane emission estimation by automatically processing and analyzing remote-sensing data, leading to more efficient assessments.
There is a lack of direct observation of methane emissions from rice paddies.
Data Gap Type
Data Gap Details
W: Wish
Direct measurement of methane emissions is often expensive and labor-intensive. But this data is essential as it provides the ground truth for training and constraining ML models. Increased funding is needed to support and encourage comprehensive data collection efforts.
Extreme heat prediction
Details (click to expand)
Extreme heat is becoming more common in a changing climate, but predicting and accurately modeling extreme heat is difficult. ML can help by improving extreme heat prediction.
The major change involves managing the size of the data. While cloud platforms offer convenience, they come with costs. Additionally, handling large datasets requires specific techniques, such as distributed computing and occasionally large-memory computing nodes (for certain statistics).
Fault detection in low voltage distribution grids
Details (click to expand)
The low voltage distribution portion of the grid directly supplies power to consumers. As consumers integrate more distributed energy resources (DERs) and dynamic loads (such as electric vehicles), low voltage distribution systems are susceptible to power quality issues that can affect the stability and reliability of the grid. Fault inducing harmonics can be challenging to monitor, diagnose, and control due to the number of nodes/buses that connect various grid assets and the short distances between them. Traditional fault detection and localization utilize impedance-based or traveling-wave methods. Both methods assess deviations between two points with respect to line-specific thresholds and work well in cases where faults tend to have low fault resistance values and networks are limited in the number of branches. As low voltage distribution network topologies grow increasingly complex, line parameters can vary, making it increasingly difficult for traditional methods to accurately diagnose and isolate faults. . Machine learning methods can overcome these limitations as they can be trained on large amounts of data, extract relevant features, and recognize patterns to automate fault diagnoses agnostic to specific line thresholds and topologies. If integrated into advanced monitoring systems, detecting and localizing faults can accelerate adaptive protection and network reconfiguration efforts to ensure reliability and stability.
For effective fault localization using µPMU data, it is crucial that distribution circuit models supplied by utility partners or Distribution System Operators (DSOs) are accurate, though they often lack detailed phase identification and impedance values, leading to imprecise fault localization and time series contextualization. Noise and errors from transformers can degrade µPMU measurement accuracy. Integrating additional sensor data affects data quality and management due to high sampling rates and substantial data volumes, necessitating automated analysis strategies. Cross-verifying µPMU time series data, especially around fault occurrences, with other data sources is essential due to the continuous nature of µPMU data capture. Additionally, µPMU installations can be costly, limiting deployments to pilot projects over a small network. Furthermore, latencies in the monitoring system due to communication and computational delays challenge real-time data processing and analysis.
For effective fault localization using µPMU data, it is crucial that distribution circuit models supplied by utility partners or Distribution System Operators (DSOs) are accurate, though they often lack detailed phase identification and impedance values, leading to imprecise fault localization and time series contextualization. Noise and errors from transformers can degrade µPMU measurement accuracy. Integrating additional sensor data affects data quality and management due to high sampling rates and substantial data volumes, necessitating automated analysis strategies. Cross-verifying µPMU time series data, especially around fault occurrences, with other data sources is essential due to the continuous nature of µPMU data capture. Additionally, µPMU installations can be costly, limiting deployments to pilot projects over a small network. Furthermore, latencies in the monitoring system due to communication and computational delays challenge real-time data processing and analysis.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Typically the distribution circuit lacks notation with respect to the phase identification and impedance values, often providing rough approximations which can ultimately influence the accuracy of localization as well as time series contextualization of a fault. Decreased accuracy of localization can then affect downstream control mechanisms to ensure operational reliability. For µPMU data to be utilized for fault localization, the distribution circuit model must be provided by the partnering utility or DSO.
U5: Usability > Pre-processing
µPMU data is sensitive to noise especially from geomagnetic storms which can induce electric currents in the atmosphere and impact measurement accuracy. Data can also be compromised by errors introduced by current and potential transformers. One way to mitigate this error is to monitor and re-calibrate transformers or deploy redundant µPMUs to verify measurements.
Depending on whether additional data from other sensors or field reports is being used to classify µPMU time series data, creation of a joint sensor dataset may improve quality based on the overall sampling rate and format of the additional non-µPMU data.
U6: Usability > Large Volume
Due to the high sampling rates, data volume from each individual µPMU can be challenging to manage and analyze due to its continuous nature. Coupled with the number of µPMUs required to monitor a portion of the distribution network, the amount of data can easily exceed terabytes. Automation of indexing and mining time series by transient characteristics can facilitate domain specialist verification efforts.
R1: Reliability > Quality
Since µPMU data is continuously captured, time series data leading up to or even identifying a fault or potential fault requires verification from other data sources.
Digital Fault Recorders (DFRs) capture high resolution event driven data such as disturbances due to faults, switching and transients. They are able to detect rapid events like lightning strikes and breaker trips while also recording the current and voltage magnitude with respect to time. Additionally, system dynamics over a longer period following a disturbance can also be captured. When used in conjunction with µPMU data, DFR data can assist in verifying significant transients found in the µPMU data which can facilitate improved analysis of both signals leading up to and after an event from the perspective of distribution-side state.
S2: Sufficiency > Coverage
Currently µPMU installation to existing distribution grids have significant financial costs so most deployments have been in the form of pilot projects with utilities. Pilot studies include the Flexgrid testing facility at Lawrence Berkeley National Laboratory (LBNL), Philadelphia Navy Yard microgrid (2016-2017), the micro-synchrophasors for distribution systems plus-up project (2016-2018), resilient electricity grids in the Philippines (2016), the GMLC 1.2.5- sensing and measurement strategy (2016), the bi-national laboratory for the intelligent management of energy sustainability and technology education in Mexico City (2017-2018) based on North American Synchrophasor Initiative (NASPI) reports.
Coverage is also limited by acceptance to this technology due to a pre-existing reliance on SCADA systems which measure grid conditions on a 15 minute cadence. As transients become more common, especially on the low voltage distribution grid, transition to monitoring with higher resolution will become necessary. Multi-objective evaluation with respect to the value proposition of further µPMU sensor monitoring networks can provide utilities and DSOs with a framework for assessing the economic, environmental, and operational benefit to pursue larger scale studies.
S4: Sufficiency > Timeliness
µPMU data can suffer from multiple latencies within the monitoring system of the grid that are unable to keep up with the high sampling rate of the continuous measurements that µPMUs generate. Latencies occur in the context of the system communications surrounding signals as they are being recorded, processed, sent, and received. This can be due to the communication medium used, cable distance, amount of processing, and computational delay. More specifically, the named latencies are measurement, transmission, channel, receiver, and algorithm related. Identification of characteristics preceding fault events with lead times to overcome potential latencies through machine learning or other techniques can be of benefit.
The large-scale integration of renewable resources to the pre-existing power grid is key in the transition away from carbon-based energy. While solar and wind technology has made significant improvements in efficiency and affordability, the transmission of renewable generated power over long distances remains a major obstacle to large-scale integration efforts. This is especially true for the transport of energy from remote renewable generation facilities (e.g. wind farms, hydroelectric dams, solar farms) to populated areas. High voltage transmission lines are the most effective at transporting energy with minimal resistive losses. However, vegetation encroachment near these lines can lead to outages and pose major fire risks compromising the safety and reliability of the grid. Furthermore, ignition due to line contact in areas such as forests if not contained, can result in dangerous wildfires releasing carbon stored in trees, endangering wildlife, and further grid infrastructure damage. Manual inspection of transmission lines is key to risk prevention, but requires specialized knowledge of field workers to evaluate lines and towers over large distances in remote and potentially dangerous locations. To reduce risks, inspections can be assisted with the use of remote sensing imagery as well as historic management records. Machine learning, especially computer vision, can accelerate vegetation management by identifying areas of overgrowth in close proximity to towers, lines, and other grid assets. Data-driven models can also track the dynamic seasonal growth of vegetation near infrastructure forecasting pruning schedules to avoid encroaching into the minimum vegetation clearance distance (MVCD). Additionally, models have the added benefit of learning from diverse data which includes different types of terrain, transmission line topologies, and vegetation.
Grid inspection robot imagery may require coordination efforts with local utilities to gain access over multiple robot trips, image preprocessing to remove ambient artifacts, position and location calibration, as well as limitations in the identification of degradation patterns based on the resolution of the robot mounted camera.
Grid inspection robot imagery may require coordination efforts with local utilities to gain access over multiple robot trips, image preprocessing to remove ambient artifacts, position and location calibration, as well as limitations in the identification of degradation patterns based on the resolution of the robot mounted camera.
Data Gap Type
Data Gap Details
U2: Usability > Aggregation
Data needs to be aggregated and collated for multiple cable inspection robots for improved generalizability of detection models. To address this, multiple robot trips can be made over an area of interest. For example, an initial inspection can be used to identify target locations that need further data collection, followed by a second trip at target locations for camera capture. Additional cable inspection robots or external remote sensing data from may also be compiled.
U3: Usability > Usage Rights
Data is proprietary and requires coordination with utility.
U5: Usability > Pre-processing
Data may need significant preprocessing and thresholding to perform image segmentation tasks.
S2: Sufficiency > Coverage
It is necessary to supplement data with position orientation system data to better locate the cable inspection robot. One solution is to have the robot complete two inspections–a preliminary one to identify inspection targets, followed by a more detailed autonomous inspection of targets with additional high precision image capture data from an on-board or externally mounted pan-tilt-zoom (PTZ) camera.
S3: Sufficiency > Granularity
Spatial resolution depends on the type of cable inspection robot utilized. Data from multiple multispectral imagers, drones, cable-mounted sensors, and additional robots may be employed to improve the level of detail needed for specific obstructions.
Unmanned aerial vehicle (UAV) or drone imagery for vegetation management near transmission and distribution lines may require partnerships with private companies and utilities for access and usage. LiDAR data is sparse and may partially scan power transmission lines resulting in poor data quality. Coverage area is often relegated to right of way (RoWs) of interest which may require continuous monitoring for future vegetation growth and inspection.
Unmanned aerial vehicle (UAV) or drone imagery for vegetation management near transmission and distribution lines may require partnerships with private companies and utilities for access and usage. LiDAR data is sparse and may partially scan power transmission lines resulting in poor data quality. Coverage area is often relegated to right of way (RoWs) of interest which may require continuous monitoring for future vegetation growth and inspection.
Data Gap Type
Data Gap Details
O1: Obtainability > Findability
Must be involved in an active study with a partnering utility or transmission owner to get access to pre-existing drone data or to get permission to collect drone data.
U3: Usability > Usage Rights
Once collected, data is private as RoWs represent critical energy infrastructure. Private partnerships may allow for extended usage rights within a predefined scope.
U5: Usability > Pre-processing
LiDAR data is sparse if equipment partially scans power transmission lines resulting in weak features that may make it difficult for computer vision algorithms to detect and distinguish lines from pylons or towers that support overhead power lines. Pre-processing the data to identify partial scans may be helpful. Furthermore, supplemental data from multiple trips and/or external remote sensing equipment can also assist in identifying incomplete scans.
S2: Sufficiency > Coverage
Coverage can vary depending on the RoW examined. Often multiple datasets that contain multiple transmission RoW UAV image data would be necessary to increase the number of image examples in the dataset.
S4: Sufficiency > Timeliness
Measurements should be taken at multiple periods to examine transmission line characteristics to both vegetation growth and or line sag caused by overvoltage conditions.
Identification and mapping of climate policy
Details (click to expand)
Laws and regulations relevant to climate change mitigation and adaptation are essential for assessing progress on climate action and addressing various research and practical questions. ML can be employed to identify climate-related policies and categorize them according to different focus areas.
Data is not available in machine-readable formats and is limited to English-language literature from major journals.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Many data sources that should be open are not fully accessible. For instance, abstracts are generally expected to be openly available, even for proprietary data. However, in practice, for some papers, only a subset of abstracts is often accessible.
U1: Usability > Structure
Most of the data is in PDF format and should be converted to machine-readable formats.
S2: Sufficiency > Coverage
Research is currently limited to literature published in English (at least the abstracts) and from major journals. Many region-specific journals or literature indexed in other languages are not included. These should be translated into English and incorporated into the database.
Laws and regulations for climate action are published in various formats through national and subnational governments, and most are not labeled as a “climate policy”. There are a number of initiatives that take on the challenge of selecting, aggregating, and structuring the laws to provide a better overview of the global policy landscape. This, however, requires a great deal of work, needs to be permanently updated, and datasets are not complete.
Laws and regulations for climate action are published in various formats through national and subnational governments, and most are not labeled as a “climate policy”. There are a number of initiatives that take on the challenge of selecting, aggregating, and structuring the laws to provide a better overview of the global policy landscape. This, however, requires a great deal of work, needs to be permanently updated, and datasets are not complete.
Data Gap Type
Data Gap Details
U1: Usability > Structure
Much of the data are in PDF format and need to be structured into machine-readable format. Much of the data in original languages of the publishing country and needs to be translated into English.
U2: Usability > Aggregation
Legislation data is published through national and subnational governments, and often is not explicitly labeled as “climate policy”. Determing whether it is climate-related is not simple.
This information is usually published on local websites and must be downloaded or scraped manually. There are a number of initiatives, such as Climate Policy Radar, International Energy Agency, and New Climate Institute that are working to address this by selecting, aggregating, and structuring these data to provide a better overview of the global policy landscape. However, this process is labor-intensive, requires continuous updates, and often results in incomplete datasets.
Improving battery management systems
Details (click to expand)
With the shift from carbon based generation to renewable, energy storage becomes crucial to counter the intermittent nature of renewable energy availability. Battery efficiency and lifetime have a direct impact on the effectiveness of transportation electrification. Machine learning can be a valuable tool in accelerating operational efficiency by estimating state of charge (SoC), state of health (SoH), and remaining useful life (RUL). Techniques such as reinforcement learning can optimize and enhance charge/discharge strategies for battery management systems (BMS). ML can also process large real-world datasets that may contain battery health parameters, charge/discharge measurements, and load demand. If the load is a vehicle, the type of vehicle, and driving behavior may also be available.
Dataset
Data Gap Summary
Improving power grid optimization
Details (click to expand)
Traditionally optimal power flow (OPF) seeks to solve the objective of minimizing the cost of power generation to meet a given load (economic dispatch) such that line limits due to thermal, voltage, or stability along with generation limits are met while maintaining power balance at each bus in the transmission system. Traditional techniques formulate OPF as a non-linear, constrained, non-convex optimization problem which can be solved for AC and DC systems separately. Traditional OPF solvers use a linear program to determine generation needed to minimize cost and satisfy load demand while adhering to physical constraints of the system. However, as the grid integrates more renewable generation sources there are trends towards the development of hybrid AC/DC power grids to address the limitations of traditional AC transmission systems and the desire to access remote renewables. Such hybrid systems present new challenges to traditional OPF by enabling bidirectional power flow, requiring the adaptation of OPF objective function and constraints to account for new losses, increased costs and congestion. ML can be used to approximate OPF problems, in order to allow them to be solved at greater speed, scale, and fidelity.
Grid2Op is a reinforcement learning framework that builds an environment based on topologies, selected grid observations, a selected reward function, and selected actions for an agent to select from. The framework relies on control laws rather than direct system observations which are subject to multiple constraints and changing load demand. Time steps representing 5 minutes are unable to capture complex transients and can limit the effectiveness of certain actions within the action space over others. Furthermore, customization of the Grid2Op can be challenging as the platform does not allow for single to multi-agent conversion, and is not a suitable environment for cascading failure scenarios due to game over rules.
Grid2Op is a reinforcement learning framework that builds an environment based on topologies, selected grid observations, a selected reward function, and selected actions for an agent to select from. The framework relies on control laws rather than direct system observations which are subject to multiple constraints and changing load demand. Time steps representing 5 minutes are unable to capture complex transients and can limit the effectiveness of certain actions within the action space over others. Furthermore, customization of the Grid2Op can be challenging as the platform does not allow for single to multi-agent conversion, and is not a suitable environment for cascading failure scenarios due to game over rules.
Data Gap Type
Data Gap Details
U4: Usability > Documentation
In the customization of the reward function, there are several TODOs in place concerning the units and attributes of the reward function related to redispatching. Documentation and code comments can sometimes provide conflicting information. Modularity of reward, adversary, action, environment, and backend are non-intuitive, requiring pregenerated dictionaries rather than dynamic inputs or conversion from single agent to multi-agent functionality. Refactoring of documentation and comments to reflect updates can assist users and avoid having to cross-reference information from the Discord channel for “Learning to Run a Power Network” and github issues.
U5: Usability > Pre-processing
The game over rules and constraints are difficult to adapt when customizing the environment for cascading failure scenarios and more complex adversaries such as natural disasters. Code base variations between versions especially between the native and Gym formatted framework lose features present in the legacy version including topology graphics. Open source refactoring efforts can assist in updating the code base to run latest and previous versions without loss of features.
S2: Sufficiency > Coverage
Coverage is limited to the network topologies provided by the grid2op environment which is based on different IEEE bus topologies. While customization of the environment in terms of the “Backend,” “Parameters,” and “Rules” are possible, there may be dependent modules that may still enforce game-over rules. Furthermore, since backend modeling is not the focus of grid2op, verification that customization obeys physical laws or models is necessary.
S3: Sufficiency > Granularity
The time resolution of 5-minute increments may not represent realistic observation time series grid data or chronics. Furthermore, the granularity may limit the effectiveness of specific actions in the provided action space. For example, the use of energy storage devices in the presence of overvoltage has little effect on energy absorption, incentivizing the agent to select from grid topology actions such as line changing line status or changing bus rather than setting storage. Expansion of the framework with efforts from the open source community to include multiple time resolutions may allow for generalization of the tool for different forecasting time horizons as well as action evaluation.
R1: Reliability > Quality
The grid2op framework relies on mathematical robust control laws and rewards which train the RL agent based on set observation assumptions rather than actual system dynamics which are susceptible to noise, uncertainty, and disturbances not represented in the simulation environment. It has no internal modeling of the equations of the grids nor can it suggest which solver should be adopted to solve traditional nonlinear optimal power flow equations. Specifics concerning modeling and preferred solver require users to customize or create a new “Backend.” Additionally, such RL human-in-the-loop systems in practice require trustworthiness and quantification of risk. A library of open source contributed “Backends” from independent projects that customize the framework with supplemental documentation and paper references can assist in further development of the environment for different conditions. Human-in-the-loop studies can be completed by testing the environment scenario and control response of the system over a model of a real grid. Generated observations and control actions can then be compared to historical event sequences and grid operator responses.
Traditional OPF simulation software may require the purchase of licenses for advanced features and functionalities. To simulate more complex systems or regions, additional data regarding energy infrastructure, region-specific load demand, and renewable generation may be needed to conduct studies. OPF simulation output would require verification and performance evaluation to assess results in practice. Increasing the granularity of the simulation model by increasing the number of buses, limits, or additional parameters increases the complexity of the OPF problem, thereby increasing the computational time and resources required.
Traditional OPF simulation software may require the purchase of licenses for advanced features and functionalities. To simulate more complex systems or regions, additional data regarding energy infrastructure, region-specific load demand, and renewable generation may be needed to conduct studies. OPF simulation output would require verification and performance evaluation to assess results in practice. Increasing the granularity of the simulation model by increasing the number of buses, limits, or additional parameters increases the complexity of the OPF problem, thereby increasing the computational time and resources required.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
U2: Usability > Aggregation
In MATPOWER and PowerWorld outside data may be required to simulate conditions over a specific region with a given amount of DERs, generating sources, bus topology, and line limits. This will require collation of pre-existing synthetic grid data with additional data to model specific scenarios.
U3: Usability > Usage Rights
Depending on whether proprietary simulators are pursued (PowerWorld) there may be licensing costs for use of certain features.
R1: Reliability > Quality
Traditional OPF simulation software simplifies the power system and makes assumptions about the system behavior such as perfect power factor correction or constant system parameters. Simulation results may need to be verified with real-world results.
S3: Sufficiency > Granularity
In PowerWorld, bus topologies available may be simplified representations of actual grids to simplify the modeling and simulation techniques to represent overall system behavior. MATPOWER requires the user to define the bus matrix. As the number of buses in a power system increases the computational complexity of OPF increases, requiring more resources and time to solve. Additional parameters such as line limits, number of generating sources, number of DERs, and load demand also increase the complexity of the model as more constraints and assets are introduced.
While network datasets are open source, maintenance of the repository requires continuous curation and collection of more complex benchmark data to enable diverse AC-OPF simulation and scenario studies. Industry engagement can assist in developing more realistic data though such data without cooperative effort may be hard to find.
While network datasets are open source, maintenance of the repository requires continuous curation and collection of more complex benchmark data to enable diverse AC-OPF simulation and scenario studies. Industry engagement can assist in developing more realistic data though such data without cooperative effort may be hard to find.
Data Gap Type
Data Gap Details
O1: Obtainability > Findability
Industry engagement can assist in developing detailed and realistic networked datasets and operating conditions, limits, and constraints.
O2: Obtainability > Accessibility
U2: Usability > Aggregation
Repository maintenance requires continuous curation of more complex networked benchmark data for more realistic AC-OPF simulation studies.
Marine wildlife detection and species classification
Details (click to expand)
Marine wildlife detection and species classification are crucial for understanding the impacts of climate change on marine ecosystems. These processes involve identifying and categorizing different marine species. ML can significantly enhance these efforts by automatically processing large volumes of data from diverse sources, improving accuracy and efficiency in monitoring and analyzing marine life.
Data downloading is a bottleneck because it requires familiarity with APIs, which not all users possess.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
API is needed to download data but many ecologists are not familiar with scripting languages.
M: Misc/Other
It would be ideal if Copernicus also made biodiversity data available on its website. Having access to both biodiversity data and associated environmental ocean data on the same platform would significantly enhance efficiency and accessibility. This integration would eliminate the need to download massive datasets for local analysis, streamlining the process for users.
Same as terrestrial biodiversity data, the lack of good annotated data is biggest bottleneck. Regarding existing data, enabling broader data sharing is the most critical challenge to address. We should also be strategic data collection efforts, targeting places where biodiversity is large but currently available data is sparse.
Same as terrestrial biodiversity data, the lack of good annotated data is biggest bottleneck. Regarding existing data, enabling broader data sharing is the most critical challenge to address. We should also be strategic data collection efforts, targeting places where biodiversity is large but currently available data is sparse.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
A lot of ocean data is collected, but a persistent challenge is ensuring that this data is shared and utilized by those who need it. Currently, much of the data remains siloed within individual institutions, making it difficult to access for collaborative purposes. Despite numerous initiatives, such as Ocean Biodiversity Information System (OBIS), Integrated Ocean Observing System (IOOS), and others, data accessibility continues to be the biggest hurdle.
To facilitate large-scale data sharing, there is a need for incentives, robust platforms for data storage, clear guidelines, and straightforward pipelines for data sharing.
U1: Usability > Structure
There is a lack of data format standardization across the ocean science community.
U2: Usability > Aggregation
Much of the data is siloed within individual institutions and not easily accessible for collaboration.
U5: Usability > Pre-processing
The volume of raw data (e.g. from imaging and acoustics) is massive, and requires significant effort to annotate and extract insights. In order to accelerate that, some or all of the following solutions can be adopted.
Accelerate the data analysis pipeline, particularly for visual data, through the Ocean Vision AI initiative
Engage the broader community to participate in exploration, discovery, and annotation through initiatives like a mobile game
Target annotation efforts strategically to maximize the impact on model performance, such as focusing on the rare species.
S2: Sufficiency > Coverage
There are massive gaps in the coverage of ocean biodiversity data, with only about 7% of the upper 5 meters of the ocean being regularly monitored, and 30-60% of ocean life still unknown to science. The data is also heavily biased towards coastal regions, with much less coverage of the open ocean.
While collecting data from the deep ocean is technologically challenging, the primary issue is the lack of financial incentives. High seas fall outside national jurisdictions, so data collection often occurs only through mining companies, military operations, or ad hoc research expeditions. The absence of marine protected areas on high seas and the migratory nature of species like phytoplankton further complicate data collection. Financial tools or regulations could incentivize data collection.
Only data from 2019 to March 2022 is publicly available. Registration is required to access the data.
Modeling effects of soil processes on soil organic carbon
Details (click to expand)
Understanding the causal relationship between soil organic carbon and soil management or farming practices is crucial for enhancing agricultural productivity and evaluating agriculture-based climate mitigation strategies. ML can significantly contribute to this understanding by integrating data from diverse sources to provide more precise spatial and temporal analyses.
Data collection is extremely expensive for some variables, leading to the use of simulated variables. Unfortunately, simulated values have large uncertainties due to the assumptions and simplifications made in within simulation models.
Data collection is extremely expensive for some variables, leading to the use of simulated variables. Unfortunately, simulated values have large uncertainties due to the assumptions and simplifications made in within simulation models.
Data Gap Type
Data Gap Details
R1: Reliability > Quality
Soil carbon generated from simulators are not reliable because these process-based models might be obsolete, or might have certain kind of systematic bias which gets reflected in the simulated variables. But ML scientists who use those simulated variables usually don’t have the proper knowledge to properly calibrate these process based models.
In general, there is insufficient soil organic carbon data for training a well-generalized ML model (both in terms of coverage and granularity).
Data Gap Type
Data Gap Details
U1: Usability > Structure
Data is collected by different farmers on different farms, leading to consistency issues and a need to better structure the data.
S3: Sufficiency > Granularity
In general, there is insufficient soil organic carbon data for training a well-generalized ML model (both in terms of coverage and granularity). One reason is that collecting such data is very expensive – the hardware is costly and collecting data at a high frequency is even more expensive.
S2: Sufficiency > Coverage
In general, there is insufficient soil organic carbon data for training a well-generalized ML model (both in terms of coverage and granularity). One reason is that collecting such data is very expensive – the hardware is costly and collecting data at a high frequency is even more expensive.
Non-intrusive electricity load monitoring
Details (click to expand)
Non-intrusive load monitoring (NILM) is a strategy to disaggregate the total electricity consumption profile of a building into individual appliance load profiles. This strategy can provide insight to individual consumer behavior for the purposes of real-time electricity pricing, can help target customers who may be due for an appliance upgrade, and can enable building energy management systems (EMS) to enact demand response strategies such as load shifting for sheddable or curtailable loads. These strategies can foster energy efficiency, reduce peaks in electricity demand, and help increase the utilization of low-carbon power by enabling better supply/demand matching, thereby fostering grid decarbonization and maintaining grid stability.
Pecan Street DataPort requires non-academic and academic users to purchase access via licensing which varies depending on the building data features requested. Coverage area of data is primarily concentrated in the Mueller planned housing community in Austin, Texas–a modern built environment which is not representative of older historical buildings that may be in need of energy efficient upgrades and retrofits. In customer segmentation studies and consumer-in-the-loop load consumption modeling, annual socio-demographic survey data may be too coarse and not provide insight into behavioral effects of household members on consumption profiles with time.
Pecan Street DataPort requires non-academic and academic users to purchase access via licensing which varies depending on the building data features requested. Coverage area of data is primarily concentrated in the Mueller planned housing community in Austin, Texas–a modern built environment which is not representative of older historical buildings that may be in need of energy efficient upgrades and retrofits. In customer segmentation studies and consumer-in-the-loop load consumption modeling, annual socio-demographic survey data may be too coarse and not provide insight into behavioral effects of household members on consumption profiles with time.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Data is downloadable as a static file or accessible via the DataPort API. Based on the licensing agreement, a small dataset is available for free for academic individuals with pricing for larger datasets. Commercial use requires paid access based on requested features ranging from the standard to unlimited customer tier and plan.
U3: Usability > Usage Rights
Usage rights vary depending on the agreed upon licensing agreement.
S2: Sufficiency > Coverage
Data coverage primarily focuses on Texas with limited coverage in New York and California. Though there are efforts to include Puerto Rico data hinges on volunteer participation. This could introduce self-selection bias, as households who participate are likely more interested in energy conservation than the general population. Furthermore, a majority of the dataset covers the Mueller community in Austin, a planned community developed after 1999 with modern built types. Enrollment of older built environment homes and different temperate regions within the United States and globally, may provide greater insight into household appliance usage patterns as well as generation patterns which vary depending on temperate region as well as appliance age. Identifying high consumption older appliances can assist in identifying upgrades.
S6: Sufficiency > Missing Components
The data does not track real-time occupancy of individuals in the household which may provide insight into behavioral effects on energy consumption. Addition of this data, can allow for improved consumption based customer segmentation models, as patterns can change with respect to time and day of the week. The data would also be amenable for consumer-in-the-loop energy management studies with respect to comfort based on customer habitual activity, location in the house, and number of occupants.
S3: Sufficiency > Granularity
Disaggregated data may provide greater granular context for customer segmentation studies than those relying on aggregate data only. However, such segmentation studies ultimately depend on the number of household members that may be using appliances at a given time. Pecan Street data contains annual survey responses by participants with respect to household demographics and home features which may be too coarse in granularity to tracking how customer segments change over time as members move in or out of a building. Jointly taking occupancy data, can address the gap in granularity but can potentially limit volunteer engagement as concerns with respect to privacy will need to be evaluated.
For accurate NILM studies, benchmark datasets are required to include not only consumption but local power generation (e.g., from rooftop solar), as it can affect the overall aggregate load observed at the building level. While some datasets may include generation information, most studies do not take rooftop solar generation into account. Additionally, devices that can behave both as a load and generator such as electric vehicles or stationary batteries were also not included. The majority of building types are single family housing units limiting the diversity of representation. Furthermore, most datasets are no longer maintained following study close.
For accurate NILM studies, benchmark datasets are required to include not only consumption but local power generation (e.g., from rooftop solar), as it can affect the overall aggregate load observed at the building level. While some datasets may include generation information, most studies do not take rooftop solar generation into account. Additionally, devices that can behave both as a load and generator such as electric vehicles or stationary batteries were also not included. The majority of building types are single family housing units limiting the diversity of representation. Furthermore, most datasets are no longer maintained following study close.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
U1: Usability > Structure
When retrieving NILM data from a variety of sources from pre-existing studies as well as through custom data collection, the structure of the received data can vary. Testbed design, hardware, and variables monitored depend on sensor availability which can ultimately influence schema and data formats. Data structure may also differ based on the level of disaggregation at the plug level or the individual appliance level. When building future testbeds for data collection, it may help to to follow the standards set by APIs such as NILMTK which has successfully integrated multiple datasets from different sources. Using the REDD dataset format as inspiration, the toolkit developers created a standard energy disaggregation data format (NILMTK-DF) with common metadata and structure which requires manual dataset-specific preprocessing. When working with non-standardized data that may require aggregation, machine learning based data fusion strategies may automating schema matching and data integration.
U5: Usability > Pre-processing
Sub-metered data relies heavily on the sensor network installation used in monitoring the building. Depending on the technology used, some sensors require calibration or are prone to malfunctions and delays. Additionally, interference from other devices can be present in the aggregate building level readings, such as that experienced by REFIT, that need to be addressed manually to enhance the usability of the dataset. These may vary depending on the submeter dataset utilized, requiring a clear understanding of associated metadata and documentation specific to the testbed the study was built upon. Exploratory data analysis of the time series data may assist in identifying outliers that may be a result of sensor drift.
R1: Reliability > Quality
In the AMPds2 data, the sum of the sub-metered consumption data did not add up to the whole house consumption due to some rounding error in the meter measurements, highlighting not only the need for NILM studies with sub-metered data as ground truth, but also the type of building level meter. Future data collection efforts may want to not only focus on retrieval of utility-side building meter data but also supplemental aggregate meter data to detect mismatches in measurements.
Datasets or studies that require self-reporting by customers may introduce participant bias, as the resolution with which households update voluntary information may vary. For example, if the number of household members, occupancy schedule, and the addition of new plug loads is self-reported, frequency of updates vary depending on volunteer engagement. Additionally, volunteers who participate in NILM studies may have a particular propensity for energy efficient actions, and may not be representative of the general population. For example, some participants of UK-DALE were Imperial College graduate students were motivated to participate to advance their own projects. To make sure that electricity usage represents the general population, future case studies can recruit potential volunteer communities with diversity of socioeconomic background and location.
S2: Sufficiency > Coverage
Gaps in dataset coverage are specific to the sub-metered dataset. These gaps may be due to unaccounted loads, level of disaggregation (e.g. circuit level, plug level, or individual appliance level), or limited appliance types. Diversity of building types are limited as most studies take place in single family residences. Some dataset specific gaps are detailed below that may be addressed by collecting new data on existing testbeds or by augmenting already collected data with synthetic information. Future data collection efforts should be mindful of avoiding the kinds of gaps associated with existing datasets.
In AMPds2 data there were some missing data from electricity and water and natural gas readings. Additionally, there exist un-metered household loads which were not accounted for in the aggregate building level readings. With respect to dishwasher consumption, AMPds2 did not have a direct water meter monitoring. REFIT did not monitor appliances that could not be accessed through wall plugs such as electric ovens. Depending on the built environment and building type, larger loads may not be able to be connected to building level meters. For example, in the GREEND dataset, electric boilers in Austria were connected to separate external meters. In the UMass smart home dataset, gas furnaces, exhaust fans, and recirculator pump loads were not able to be monitored.
AMPds2, DEDDIAG, DRED, iAWE, REDD, REFIT, and UMass smart home dataset all gather data in single family homes which may not be representative of the diversity of buildings in terms of age, location, construction, and potential household demographics. REFIT data covers different single family home types such as detached, semi-detached, and mid-terrace homes that ranged from 2-6 bedrooms with builds from the 1850s to 2005. GREEND covers apartments in addition to single family homes but the number of households was 9. AMPds2, DRED, iAWE only cover a single household. Additionally, datasets are specific to the location where the measurements were taken which are susceptible to the environmental conditions of the region as well as the culture of the population. For example, REDD consists of data from 10 monitored homes which may not be representative of common appliances contributing to the overall load of the broader population outside of Boston.
COMBED contains complex load types that may rely on variable-speed drives as well as multi-state devices, which the other datasets do not contain which may be due to the difference in building type but could also be due to the lack of diversity of appliance representation.
ECO data relied on smart plugs for disaggregated load consumption measurements which varied between households depending on the smart plug appliance coverage. For all households the total consumption was not equal to the sum of the consumption measured from the plugs alone, indicative of a high proportion of non-attributed consumption.
S6: Sufficiency > Missing Components
While sub-metered data provides a means of verifying non-intrusive load monitoring techniques, it does not capture the hidden human motivators driving appliance usage (such as comfort, utility cost, and daily activities) as well as other important factors contributing to the aggregate load seen at the building level meter. The key to improving these studies is to provide greater context to the sub-metered data by taking additional joint measurements such as rooftop solar power production, electric vehicle load, occupancy related information, and battery storage. Some dataset-specific missing data components are highlighted below.
All datasets mentioned do not include electric vehicle loads. REDD, AMPds2, COMBED, DEDDIAG, DRED, GREEND, iAWE, UK-DALE do not include generation from rooftop solar. REFIT contains solar from three homes but they were not the focus of the study and were treated as solar interference to the aggregate load. The UMass smart home dataset only had representation of one home with solar and wind generation, though at a significantly larger square footage and build compared to the other two homes that were featured.
While DRED provided occupancy information through data collected from wearable devices with respect to the home and ECO and IDEAL through self-reporting and an infrared entryway sensor, all other studies did not.
The majority of datasets are not amenable to human in the loop user behavior analysis with respect to consumption patterns, response to feedback, and the effectiveness in load shifting to promote energy conserving behaviors due to their lack of representation.
While AMPds2 includes some utility data, most datasets do not incorporate billing or real time pricing. This type of data would be beneficial as it varies from time, season, region, and utility.
Battery storage was not taken into account in all building consumption datasets.
Offshore wind power forecasting: Long-term (3 hours-1 year)
Details (click to expand)
Long-term wind forecasting can allow for resource assessment studies for offshore energy production, wind resource mapping, and wind farm modeling.
Due to the location, FINO platform sensors are prone to failures of measurement sensors due to adverse outdoor conditions such as high wind and high waves. Often when sensors fail manual recalibration or repair is nontrivial requiring weather conditions to be amenable for human intervention. This can directly affect the data quality which can last for several weeks to a season. Coverage is also constrained to the platform and associated wind farm location.
Due to the location, FINO platform sensors are prone to failures of measurement sensors due to adverse outdoor conditions such as high wind and high waves. Often when sensors fail manual recalibration or repair is nontrivial requiring weather conditions to be amenable for human intervention. This can directly affect the data quality which can last for several weeks to a season. Coverage is also constrained to the platform and associated wind farm location.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
The data is free to use but requires sign up through a login account at: https://login.bsh.de/fachverfahren/
U5: Usability > Pre-processing
The dataset is prone to failures of measurement sensors. Issues with data loggers, power supplies, and effects of adverse conditions such as low aerosol concentrations can influence data quality. High wind and wave conditions impact the ability to correct or recalibrate sensors creating data gaps that can last for several weeks or seasons.
S2: Sufficiency > Coverage
Coverage of wind farms is relegated to the dimensions of the platform itself and the wind farm that it is built in proximity to. For locations with different offshore characteristics similar testbed platforms or buoys can be developed.
S5: Sufficiency > Proxy
Due to the nature of sensors exposed to environmental ocean conditions and storms, often FINO sensors may need maintenance and repair but are difficult to physically access. Gaps in the data from lack of data points can be addressed by utilizing mesoscale wind modeling output.
Spatiotemporal coverage of the offshore meteorological and windspeed platform data is restricted to the dimensions of the platform itself as well as the time of construction. Depending on the data provider access to the data may require the signing of a non-disclosure agreement.
Spatiotemporal coverage of the offshore meteorological and windspeed platform data is restricted to the dimensions of the platform itself as well as the time of construction. Depending on the data provider access to the data may require the signing of a non-disclosure agreement.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Access to data must be requested with different data providers having varying levels of restrictions. For data obtained from Orsted, access is only provided by signing a standard non-disclosure agreement. For more information mail R&D at datasharing@orsted.com.
S2: Sufficiency > Coverage
Spatiotemporal coverage of the dataset varies depending on the construction of the platform testbed and location but overall data is available from 2014 to the present. While measurements from LiDAR have higher resolution than wind mast data, sensor information is still restricted to the dimensions of the platform and the associated off-shore windfarm when present. Data provided by Orsted from LiDAR sensors includes 10 minute statistics.
Offshore wind power forecasting: Short-term (10 min)
Details (click to expand)
Short-term wind forecasting can enable estimation of active power generated by wind farms in the absence of curtailment.
Data obtainability is achieved by requesting access via the Orsted form. Sufficiency of the dataset is constrained by volume where only a finite amount of short term off-shore wind farms exist to which expanding the coverage area, volume and time granularity of data to under 10 minutes may enable transient detection from generated active power.
Data obtainability is achieved by requesting access via the Orsted form. Sufficiency of the dataset is constrained by volume where only a finite amount of short term off-shore wind farms exist to which expanding the coverage area, volume and time granularity of data to under 10 minutes may enable transient detection from generated active power.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Need to request access via form from Orsted.
S1: Sufficiency > Insufficient Volume
Would require data from multiple wind farms over variety of regions to get a more accurate comparison against weather model data
S2: Sufficiency > Coverage
Coverage is over Europe specifically, off-shore wind conditions will vary depending on environment and cannot scale or transfer to other temperate regions of the world
S3: Sufficiency > Granularity
Time granularity of 10min is too large to capture transients in active power generated.
S4: Sufficiency > Timeliness
Only two years worth of data from 2016-2018 is provided, through additional data collection from offshore wind farms or simulation.
Post-disaster damage assessment
Details (click to expand)
Post-disaster evaluations are crucial for identifying vulnerabilities exposed by climate-related events, which is essential for enhancing resilience and informing climate adaptation strategies. ML can help by rapidly identifying and quantifying damage, such as structural collapse or vegetation loss, thereby improving response and recovery efforts.
Financial loss data is typically proprietary and held by insurance and reinsurance companies, as well as financial and risk management firms. Some of the data should be made available for research purposes.
The resolution of publicly available datasets is insufficient for accurate damage assessments. To improve this, some commercial high-resolution images should be made accessible for research purposes.
The resolution of publicly available datasets is insufficient for accurate damage assessments. To improve this, some commercial high-resolution images should be made accessible for research purposes.
Data Gap Type
Data Gap Details
S4: Sufficiency > Timeliness
Both pre- and post-disaster imagery are needed, but pre-disaster imagery sometimes is outdated, not really reflecting the condition right before the disaster.
S3: Sufficiency > Granularity
Accurate damage assessment requires high-resolution images, but the resolution of current publicly open datasets is inadequate for this purpose. While companies offer high-resolution images, they are often prohibitively expensive. To address this, some commercial high-resolution images should be made available for research purposes at no cost.
Data is highly biased towards North America. Similar datasets but focusing on other parts of the world are needed. Additionally, the dataset should include more detailed information on the severity of the damage.
Data is highly biased towards North America. Similar datasets but focusing on other parts of the world are needed. Additionally, the dataset should include more detailed information on the severity of the damage.
Data Gap Type
Data Gap Details
S3: Sufficiency > Granularity
There is no differentiation of grades of damage. More granular information about the severity of damage is needed.
S2: Sufficiency > Coverage
Data is also highly biased towards North America. Data from the other parts of the world is highly needed.
Short-term electricity load forecasting
Details (click to expand)
Short-term load forecasting (STLF) is critical for utilities to balance demand with supply. Utilities need accurate forecasts (e.g. on the scale of hours, days, weeks, up to a month) to plan, schedule, and dispatch energy while decreasing costs and avoiding service interruptions. Furthermore, for grids that may have portions privatized, utilities rely on forecasts to procure (e.g. source and purchase) energy to meet demands. In peak conditions, where loads have been underestimated, utilities have limited options. One option is to utilize reserve capacity, or additional electric supply to ensure reliable power to customers. This usually entails recruitment of expensive peaker plants dependent on fossil-fuels in city centers to meet immediate demands over short distances. Another option is for the utility to initiate an outage to clip peaks. In the worst case, grid assets can be overloaded resulting in system failure and unplanned blackouts. Due to the reliance of historical electricity load data, weather forecasts, time with respect to the day, week, or month, and continuous streams of advanced metering infrastructure (AMI) data, machine learning models are well suited to handle large amounts of data and capture non-linearities which traditional linear models may struggle with.
AMI data is challenging to obtain without pilot study partnerships with utilities since data collection on individual building consumer behavior can infringe upon customer privacy especially at the residential level. Granularity of time series data can also vary depending on the level of access to the data whether it be aggregated and anonymized or based on the resolution of readings and system. Additionally, coverage of data will be limited to utility pilot test service areas thereby restricting the scope and scale of demand studies.
AMI data is challenging to obtain without pilot study partnerships with utilities since data collection on individual building consumer behavior can infringe upon customer privacy especially at the residential level. Granularity of time series data can also vary depending on the level of access to the data whether it be aggregated and anonymized or based on the resolution of readings and system. Additionally, coverage of data will be limited to utility pilot test service areas thereby restricting the scope and scale of demand studies.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Access to real AMI data can be difficult to obtain due to privacy concerns. Even when partnered with a utility, the AMI data can undergo anonymization and aggregation to protect individual customers. Some ISOs are able to distribute data provided that a written records request is submitted. If requesting personal consumption data, program pricing enrollment, may limit temporal resolution of data that a utility can provide. Open datasets, on the other hand, may only be available for academic research or teaching use (ISSDA CER data).
U2: Usability > Aggregation
AMI data when used jointly with other data that may influence demand such as weather, availability of rooftop solar, presence of electric vehicles, building specifications, and appliance inventory may require significant additional data collection or retrieval. Non-intrusive load monitoring techniques to disaggregate AMI data may be employed with some assumptions based on additional data. For example, the use of satellite imagery over a region of interest can assist in identifying buildings that have solar panels.
U3: Usability > Usage Rights
For ISSDA CER data use, a request form must be submitted for evaluation by issda@ucd.ie. For data obtained through utility collaborative partnerships, usage rights may vary. Please contact the data provider for more information.
U5: Usability > Pre-processing
Data cleanliness may vary depending on the data source. For individual private data collection through testbed development, cleanliness can depend on formats of data stream output from the sensor network system installed. When designing the testbed data format it is recommended to develop and structure comprehensive metadata with respect to the study to encourage further development.
R1: Reliability > Quality
Anonymized data may not be verifiable or useful once it is open-source. Further data collection for verification purposes is recommended.
S2: Sufficiency > Coverage
S3: Sufficiency > Granularity
Meter resolution can vary based on the hardware ranging from 1 hour, 30 minute, to 15 minute measurement intervals. Depending on the level of anonymization and aggregation of data, the granularity may be constrained to other factors such as the cadence of time of use pricing and other tiered demand response programs employed by the partnering utility. Interpolation may be used to combat issues with respect to resolution but may require uncertainty considerations when reporting results.
S4: Sufficiency > Timeliness
With respect to the CER Smart Metering Project and the associated Customer Behavior Trials (CBT), Electric Ireland and Bord Gais Energy smart meter installation and monitoring occurred from 2009-2010. This anonymized dataset may no longer be representative of current behavior usage as household compositions and associated loads change with time. Similarly, pilot programs through participating utilities are finite in nature. To address this data gap, in the context of previous pilot study locations, studies and testbeds can be reopened or revisited. In the context of new studies in different locations, previous data can still be utilized for pre-training models, however, fine-tuning would still require new data collection.
The building data genome project 2 compiles building data from public open datasets along with privately curated building data specific to university and higher education institutions. While the dataset has clear documentation, the lack of diversity in building types necessitates further development of the dataset to include other types of buildings as well as expansion to coverage areas and times beyond those currently available.
The building data genome project 2 compiles building data from public open datasets along with privately curated building data specific to university and higher education institutions. While the dataset has clear documentation, the lack of diversity in building types necessitates further development of the dataset to include other types of buildings as well as expansion to coverage areas and times beyond those currently available.
Data Gap Type
Data Gap Details
U2: Usability > Aggregation
Data was collated from 7 open access public data sources as well as 12 privately curated datasets from facilities management at different college sites requiring manual site visits which are not included in the data repository at this time.
U3: Usability > Usage Rights
S2: Sufficiency > Coverage
The dataset is curated from buildings on university campuses thereby limiting the diversity of building representation. To overcome the lack of diversity in building data, data sharing incentives and community open source contributions can allow for the expansion of the of the dataset.
S3: Sufficiency > Granularity
The granularity of the meter data is hourly which may not be adequate for short term load-forecasting and efficiency studies at a higher resolution. Assumptions on conditions would have to be made prior to interpolating.
S4: Sufficiency > Timeliness
The dataset covers hourly measurements from January 1, 2016 to December 31, 2018. While this may be adequate for pre-training models, further data collection through a reinitiation of the study may be needed to fine-tune models for more up to date periods of time.
Faraday synthetic AMI data is a response to the bottlenecks faced in retrieval of building level demand data based on consumer privacy. However, despite the model being trained on actual AMI data, synthetic data generation requires a priori knowledge with respect to building type, efficiency, and presence of low carbon technology. Furthermore, since the dataset is trained on UK building data, AMI time series generated may not be an accurate representation of load demand in regions outside the UK. Finally, since data is synthetically generated studies will require validation and verification using real data or aggregated per substation level data to assess its effectiveness. Faraday is currently accessible through the Centre for Net Zero’s API.
Faraday synthetic AMI data is a response to the bottlenecks faced in retrieval of building level demand data based on consumer privacy. However, despite the model being trained on actual AMI data, synthetic data generation requires a priori knowledge with respect to building type, efficiency, and presence of low carbon technology. Furthermore, since the dataset is trained on UK building data, AMI time series generated may not be an accurate representation of load demand in regions outside the UK. Finally, since data is synthetically generated studies will require validation and verification using real data or aggregated per substation level data to assess its effectiveness. Faraday is currently accessible through the Centre for Net Zero’s API.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
The Variational Autoencoder Model can generate synthetic AMI data based on several conditions. The presence of low carbon technology (LCT) for a given household or property type depends on access to battery storage solutions, solar rooftop panels, and presence of electric vehicles. This type of data may require curation of LCT purchases by type and household. Building type and efficiency at the residential and commercial/industrial level may also be difficult to access, requiring the user to set initial assumptions or seek additional datasets. Furthermore, data verification requires a performance metric based on actual readings. This may be done through access to substation level load demand data.
U3: Usability > Usage Rights
Faraday is open for alpha testing by request only.
R1: Reliability > Quality
Verification of AMI synthetic data requires verification which can be done in a bottom up grid modeling manner. For example, load demand at the substation level can be estimated based on the sum of individual building loads which the substation services. This value can then be compared to the actual substation load demand provided through private partnerships with distribution network operators (DNOs). However, accuracy of a specific demand profile per property or section of properties would require identification of a population of buildings, a connected real-world substation, and residential low carbon technology investment for the set of properties under study.
S2: Sufficiency > Coverage
Faraday is trained from utility provided AMI data from the UK which may not be representative of load demand and corresponding building type and temperate zone of other global regions. To generate similar synthetic data, custom data may be retrieved through a pilot test bed for private collection or the result of a partnership with a local utility. Additionally, pre-existing AMI data over an area of interest can be utilized to generate similar synthetic data.
Datasets are restricted to past pilot study coverage areas requiring further data collection for fine-tuning models to a different coverage area.
S3: Sufficiency > Granularity
Data granularity is limited to the granularity of data the model was trained on. Generative modeling approaches similar to Faraday, can be built using higher resolution data or interpolation methods could also be employed.
S4: Sufficiency > Timeliness
Timeliness of dataset would require continuous integration and development of the model using MLOps best practices to avoid data and model drift. By contributing to Linux Foundation Energy’s OpenSynth initiative, Centre for Net Zero hopes to build a global community of contributors to facilitate research.
Smart inverter management for distributed energy resources
Details (click to expand)
Distributed energy resources (DERs) such as solar photovoltaics and energy storage systems are a part of low-inertia power systems that do not rely on traditional rotating components. These DERs rely on distributed inverters to convert power from DC to AC which typically are configured to unity power factor. An alternative to unity power factor, inverters can be “smart” by dynamically managing effects of intermittancy prior to feeding power back to feeder circuits at the distribution substation level. Smart inverters can perform Volt-VAR (Voltage-VAR) and Volt-Watt (Voltage-Watt) operations, which involve adjusting the output voltage and frequency of the inverter to maintain grid stability. In other words, the DER inverter is controlled to dynamically adjust reactive power injection back into the grid. This is crucial for preventing voltage sags and swells that can occur due to the integration of DERs into the grid.
There is a need to enhance existing simulation tools to study inverter-based power systems rather than traditional machine-based based. Simulations should be able to represent a large number of distribution-connected inverters that incorporate DER models with the larger power system. Uncertainty surrounding model outputs will require verification and testing. Furthermore, accessibility to simulations and hardware in the loop facilities and systems requires user access proposal submission for NREL’s Energy Systems Integration Facility access. Similar testing laboratories may require access requests and funding.
There is a need to enhance existing simulation tools to study inverter-based power systems rather than traditional machine-based based. Simulations should be able to represent a large number of distribution-connected inverters that incorporate DER models with the larger power system. Uncertainty surrounding model outputs will require verification and testing. Furthermore, accessibility to simulations and hardware in the loop facilities and systems requires user access proposal submission for NREL’s Energy Systems Integration Facility access. Similar testing laboratories may require access requests and funding.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Contact NREL precise@nrel.gov for access to the PRECISE model
Submit an Energy Systems Integration Facility (ESIF) laboratory request form to userprogram.esif@nrel.gov to gain access to hardware in the loop inverter simulation systems. Access to particular hardware may require collaboration with inverter manufacturers which may have additional permission requirements.
R1: Reliability > Quality
The optimization routine of the simulation model may face challenges in determining the precise balance between grid operation criteria and impacts on customer PV generation. Generation may still require curtailment by the utility to prioritize grid stability. To circumvent this gap external data on distribution side operating conditions, load demand, solar generation, and utility-initiated generation curtailment can be collected and introduced into expanded simulation studies.
Smart inverter operational data is not publicly available and requires partnerships with research labs, utilities, and smart inverter manufacturers. However, the California Energy Commission maintains a database of UL 1741-SB compliant manufacturers and smart inverter models that can then be contacted for research partnerships. In terms of coverage area, while California and Hawaii are now moving towards standardizing smart inverter technology in their power systems, other regions outside of the United States may locate similar manufacturers through partnerships and collaborations.
Smart inverter operational data is not publicly available and requires partnerships with research labs, utilities, and smart inverter manufacturers. However, the California Energy Commission maintains a database of UL 1741-SB compliant manufacturers and smart inverter models that can then be contacted for research partnerships. In terms of coverage area, while California and Hawaii are now moving towards standardizing smart inverter technology in their power systems, other regions outside of the United States may locate similar manufacturers through partnerships and collaborations.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Particularly to the CEC database, one will need to contact the CEC or manufacturer to receive additional information for a particular smart inverter. Detailed studies using smart inverter hardware may require collaboration with a utility and research organization to perform advanced research studies.
U2: Usability > Aggregation
To retrieve additional data beyond the single entry model and manufacturer of a particular smart inverter, one may need to contact a variety of manufacturers to get access to datasets and specifications for operational smart inverter data, laboratories to get access to hardware in the loop test centers, and utilities or local energy commissions for smart inverter safety compliance and standards.
S2: Sufficiency > Coverage
New grid support functions defined by UL1741-SA and UL1741-SB are optional but will be required for California and Hawaii so as of now public manufacturing data is available only via the CEC website. Collaborations and contact with manufacturers outside the US may be necessary to compile a similar database and contact with utilities can provide a better understanding of similar UL 1741-SB criteria adoption.
Solar installation site assessment
Details (click to expand)
Statistical analysis on solar PV system components for pricing, logistics, planning, and site capacity studies is an important part of the process for siting solar PV systems. Spatiotemporal generation forecasting using pre-existing site data can be used to inform future site recommendations, policy, and decision making with respect to new developments.
The LBNL solar panel PV system dataset excluded third party owned systems and systems with battery backup. Since data was self-reported, component cost data may be imprecise. The dataset also has a data gap in timeliness as some of the data includes historical data which may not reflect current pricing and costs of PV systems.
The LBNL solar panel PV system dataset excluded third party owned systems and systems with battery backup. Since data was self-reported, component cost data may be imprecise. The dataset also has a data gap in timeliness as some of the data includes historical data which may not reflect current pricing and costs of PV systems.
Data Gap Type
Data Gap Details
S2: Sufficiency > Coverage
The dataset excluded third-party owned systems, systems with battery backup, self-installed systems, and data that was missing installation prices. Data was self reported and may be inconsistent based on the reporting of component costs. Furthermore, some state markets were under represented or missing which can be alleviated by new data collection or use of dataset jointly with simulation studies.
S4: Sufficiency > Timeliness
Dataset includes historical data which may not reflect current pricing for PV systems. To alleviate this, updated pricing may be incorporated in the form of external data or as additional synthetic data from simulation.
The USPVDB data must be accessed through the United States Geological Survey (USGS) mapper browser application or for download as GIS data in the form of shapefiles or geojsons. Tabular data and metadata are provided in CSV and XML format. Coverage of the dataset is isolated to the US specifically over densely populated regions. Enhancing the data by supplementing it with international large-scale photovoltaic satellite imagery can expand the coverage area of the dataset.
The USPVDB data must be accessed through the United States Geological Survey (USGS) mapper browser application or for download as GIS data in the form of shapefiles or geojsons. Tabular data and metadata are provided in CSV and XML format. Coverage of the dataset is isolated to the US specifically over densely populated regions. Enhancing the data by supplementing it with international large-scale photovoltaic satellite imagery can expand the coverage area of the dataset.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Data may be accessed through the USGS’s designated USPVDB mapper or downloaded as shapefiles for GIS data, tabular data, or as XML: metadata. Data is open and easily obtainable.
S2: Sufficiency > Coverage
Coverage is over the US and specifically over densely populated regions that may or may not correlate to areas of low cloud cover and high solar irradiance. Representation of smaller scale private PV systems could expand the current dataset to less populated areas as well as regions outside the US.
Solar power forecasting: Long-term (>24 hours)
Details (click to expand)
Longer-term solar forecasts are beneficial for energy market pricing, investment decisions, and integration with other renewable energy sources such as hydroelectric plants to allow for larger scale coordination and grid operational studies. Additionally, inclusion of energy storage systems to harvest solar energy on longer time scales can be better aligned with longer term demand forecasting and predicted solar peaks.
While the synthetic PV plant data is beneficial to perform forecasting and control simulation case studies when actual data is not present there are limitations with respect verification for site specific projects, representation of coverage areas outside of the US, and modeling assumptions based on data proxies that have to be taken into account when interpreting results.
While the synthetic PV plant data is beneficial to perform forecasting and control simulation case studies when actual data is not present there are limitations with respect verification for site specific projects, representation of coverage areas outside of the US, and modeling assumptions based on data proxies that have to be taken into account when interpreting results.
Data Gap Type
Data Gap Details
R1: Reliability > Quality
May not be suitable for site-specific projects because the simulation output may not be representative of a particular region requiring additional outside data to be utilized or adaptation of simulation to better represent the spatiotemporal region of interest.
Simulated PV is based on numeric weather prediction simulation and sub-hour irradience algorithms for day ahead and 5 minute data respectively which serves as a proxy for real PV data which may require verification with supplemental data or measurements from solar power inverter outputs.
S2: Sufficiency > Coverage
Synthetic data is based on solar conditions over the US for 2006 and may not be suitable for other locations.
Solar power forecasting: Medium-term (6-24 hours)
Details (click to expand)
Medium-term solar forecasts can be beneficial for simulation case studies in demand response, microgrid behavior, electricity markets, and solar site planning.
Depending on the region of interest, data can be retrieved from different open data satellites that are both geostationary as well as swath which may differ in spatial and temporal resolutions and coverage area. Additionally, multispectra data may have challenges with respect to preprocessing and preparing the data for analysis. Specifically for medium term solar forecasting, actual ground irradiance may differ from approximations made by models that utilize satellite derived cloud cover products. This is because different cloud types can have different impacts on irradiance. Supplementation with ground based measurements for verification and improvements in granularity are suggested solutions.
Depending on the region of interest, data can be retrieved from different open data satellites that are both geostationary as well as swath which may differ in spatial and temporal resolutions and coverage area. Additionally, multispectra data may have challenges with respect to preprocessing and preparing the data for analysis. Specifically for medium term solar forecasting, actual ground irradiance may differ from approximations made by models that utilize satellite derived cloud cover products. This is because different cloud types can have different impacts on irradiance. Supplementation with ground based measurements for verification and improvements in granularity are suggested solutions.
Data Gap Type
Data Gap Details
U2: Usability > Aggregation
Data can be retrieved from different open data satellites that are both geostationary and swath requiring collation of data if multiple regions of interest are selected.
U5: Usability > Pre-processing
Multispectral remote sensing data may require preprocessing depending on the wavelength, band combinations, and satellite products chosen for data analysis and model training. Solar forecasting may require band combinations in the visible and infrared spectra.
R1: Reliability > Quality
Different cloud types impact actual ground irradiance differently requiring verification and supplementation of measurements with ground based cloud cover sensor data.
S3: Sufficiency > Granularity
Spatial and temporal coverage can vary depending on the type of satellite selected for cloud cover and solar irradiance forecasting studies over a region of interest. Accurately forecasting changes in global irradiance during partly cloudy days can be difficult due to the variability of coverage in short time frames.
Solar power forecasting: Short-term (30 min-6 hours)
Details (click to expand)
Hourly site-specific solar forecasting can assist with solar energy estimates based on measured irradiance, photovoltaic inverter output energy, and turbine level output. Forecasting at this level can prove beneficial for joint distributed energy resource and energy storage microgrid scheduling studies, and system reliability studies.
While NOAA’s SOLRAD is an excellent data source for long term solar irradiance and climate studies, data gaps exist for the short term solar forecasting use case (which requires hourly averages). Data quality of hourly averages is lower than that of native resolution data impacting effective short-term forecasting for real-time energy management for grid stability, demand response, real-time market price predictions, and dispatch. Coverage area is also constrained to certain parts of the United States based on the SURFRAD network location.
While NOAA’s SOLRAD is an excellent data source for long term solar irradiance and climate studies, data gaps exist for the short term solar forecasting use case (which requires hourly averages). Data quality of hourly averages is lower than that of native resolution data impacting effective short-term forecasting for real-time energy management for grid stability, demand response, real-time market price predictions, and dispatch. Coverage area is also constrained to certain parts of the United States based on the SURFRAD network location.
Data Gap Type
Data Gap Details
S2: Sufficiency > Coverage
Coverage area is constrained to SOLRAD network locations in the United States, namely: Albuquerque, NM
Bismark, ND
Hanford, CA
Madison, WI
Oak Ridge, TN
Salt Lake City, UT
Seattle, WA
Sterling, VA
Tallahassee, FL
For dataset to be generalize to other regions, regions with similar climates and temperate zones would have to be identified.
S3: Sufficiency > Granularity
Data quality of the hourly averages is lower than that of the native resolution data which can impact effective short-term forecasting for real-time energy management for grid stability, demand response, real-time market price predictions, and dispatch. To solve this problem either using the data in the very short term may be better or utilization of additional data such as the data from sky imagers and other sensors with frequent measurement outputs.
While data coverage is global and based on data derived from satellite imagery as input to the Fast All-sky Radiation Model (FARM), a radiative transfer model, the output is calculated over specific time frames and would require to be calculated and updated for modern times. Furthermore, data is unbalanced as the region that has the longest temporal coverage is the United States. Satellite based estimation of solar resource information may be susceptible to cloud cover, snow, and bright surfaces which would require additional verification from ground based measurements and collation of outside data sources. Additionally, since data is derived from satellites, data may require preprocessing to account for parallax effects when looking at particular regions based on the field of view of the coverage satellite and the region of interest which may not be expressed in the FARM higher level tabular products.
While data coverage is global and based on data derived from satellite imagery as input to the Fast All-sky Radiation Model (FARM), a radiative transfer model, the output is calculated over specific time frames and would require to be calculated and updated for modern times. Furthermore, data is unbalanced as the region that has the longest temporal coverage is the United States. Satellite based estimation of solar resource information may be susceptible to cloud cover, snow, and bright surfaces which would require additional verification from ground based measurements and collation of outside data sources. Additionally, since data is derived from satellites, data may require preprocessing to account for parallax effects when looking at particular regions based on the field of view of the coverage satellite and the region of interest which may not be expressed in the FARM higher level tabular products.
Data Gap Type
Data Gap Details
U5: Usability > Pre-processing
Since data is derived from satellite imagery data may require pre-processing to account for pixel variability, parallax effects, and additional modeling using radiative transfer to improve solar radiation estimates.
R1: Reliability > Quality
Satellite based estimation of solar resource information for sites susceptible to cloud cover, snow, and bright surfaces may not be accurate thereby requiring verification from ground based measurements.
S4: Sufficiency > Timeliness
Data flow from satellite imagery to solar radiation measurement output from Fast All-sky Radiation Model for Solar applications needs to be recalculated and updated to expand beyond the coverage years of the represented global regions. For information with respect to the coverage area and years covered visit this link.
While NREL’S SRRL BMS provides real-time joint variable data from ground based sensors coverage is reserved to the sensor network in Golden, CO in the United States. Since the measurement system is comprised of diverse sensors, sensors may malfunction or go out of calibration requiring human intervention and maintenance following detection which may be delayed leading to inaccuracies in the data.
While NREL’S SRRL BMS provides real-time joint variable data from ground based sensors coverage is reserved to the sensor network in Golden, CO in the United States. Since the measurement system is comprised of diverse sensors, sensors may malfunction or go out of calibration requiring human intervention and maintenance following detection which may be delayed leading to inaccuracies in the data.
Data Gap Type
Data Gap Details
U5: Usability > Pre-processing
Instrument malfunction or calibration require human intervention leading to inaccuracies in measured data quantities especially if detection is delayed, affecting solar forecast accuracies. Despite this, the dataset continues to be maintained.
S2: Sufficiency > Coverage
Coverage is reserved to Golden, CO. Though other locations may additionally benefit from similar sensor monitoring systems, especially those with variations in weather patterns that could affect solar irradiance forecasting and thereby energy harvesting.
PV Anlage-Reinhart System information for PV systems collated and compiled by SMA with PV inverter data requires creating a user profile requests for specific system access, may lack clear instructions in languages outside of German, and have greater representation of systems located in Germany, Netherlands, and Australia, despite the presence of data globally. Furthermore, a subset of the systems cultivated contain joint energy storage data which may be valuable for DER specific load forecasting studies.
PV Anlage-Reinhart System information for PV systems collated and compiled by SMA with PV inverter data requires creating a user profile requests for specific system access, may lack clear instructions in languages outside of German, and have greater representation of systems located in Germany, Netherlands, and Australia, despite the presence of data globally. Furthermore, a subset of the systems cultivated contain joint energy storage data which may be valuable for DER specific load forecasting studies.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Need to utilize web interface or create a user profile/request membership to be granted access to additional data or data in desired format. For immediate use, data is unable to be downloaded in zip format or raw format, and must be scraped from the web browser portal to be used for free. Please contact SMA supplier for membership/usage rights.
U4: Usability > Documentation
Documentation is primarily in German, and lacks the same detail of data in the English version of the website. Companion works utilizing the data are not readily cited or linked. Since data can only be viewed via the portal, unless express permission is given to download the data, language barriers can be challenging in interpreting the displayed data values.
S2: Sufficiency > Coverage
Coverage is dependent on country of interest with greater representation of PV system data in Europe. In fact, of the 2018 countries represented, the system measurements vary from one system per country to 43,665 systems per country. Systems in Germany, the Netherlands, and Australia have more PV system testbeds than other global regions. Furthermore, depending on the testbed system selected some may have additional information on battery storage though inconsistent throughout most testbeds. This gap can be addressed by increasing the amount of private user contributed system data from diverse regions to supplement those already curated by SMA.
While SOLETE is advantageous to use for joint wind solar DER forecasting at the inverter level generation studies, the dataset can be improved by addressing several gaps in data sufficiency, namely expansion of the temporal coverage to include seasonal variations which may be addressed with additional outside data or simulation. Outside data or simulation may also improve scaling of the study to address multiple generation sources (more than one PV array and wind turbine) and the coordination between them to maintain grid reliability and stability. Additionally, a data wish for SOLETE includes the addition of maintenance schedules or system downtime data to more realistically model system dynamics with DERs.
While SOLETE is advantageous to use for joint wind solar DER forecasting at the inverter level generation studies, the dataset can be improved by addressing several gaps in data sufficiency, namely expansion of the temporal coverage to include seasonal variations which may be addressed with additional outside data or simulation. Outside data or simulation may also improve scaling of the study to address multiple generation sources (more than one PV array and wind turbine) and the coordination between them to maintain grid reliability and stability. Additionally, a data wish for SOLETE includes the addition of maintenance schedules or system downtime data to more realistically model system dynamics with DERs.
Data Gap Type
Data Gap Details
S1: Sufficiency > Insufficient Volume
SOLETE only covers a single wind turbine and PV array which may not take into account considerations when there are multiple generation sources and coordination is required between multiple generation sources of the same type. This gap can be alleviated by physically expanding the network to include multiple PV arrays and wind turbines, or by combining SOLETE with data from additional outside sources from utility, power electronics, and energy tech companies that may have similar datasets to perform larger coordination and grid control studies.
S2: Sufficiency > Coverage
The temporal coverage of SOLETE is limited to 15 months which is unable to capture long-term seasonal variations in joint wind and irradiance data.
S3: Sufficiency > Granularity
The resolution and sampling rate of the joint dataset can impact the precision of the analysis especially when fusing data of different temporal resolutions, from second to hourly together. Aggregation of second level data to hourly data may impact the outcomes of joint short-term solar and wind forecasting.
S6: Sufficiency > Missing Components
SOLETE does not include maintenance schedule data or system downtimes that occurred during data gathering. Retroactively assessing and supplementing the dataset with maintenance schedule data either by simulation or actual data collection from SYSLAB records may improve system forecasting and modeling to include uncertainties with respect to scheduled system maintenance.
Solar power forecasting: Very-short-term (0-30min)
Details (click to expand)
Very-short-term solar power forecasting is critical for time series irradiance forecasting and solar ramp event identification. Solar irradiance ramp events can be defined as sudden changes in solar irradiance within a short time interval. These events are often caused by transient clouds that can lead to abrupt fluctuations in the incoming solar energy. Cloud analysis using cloud segmentation and classification as a proxy to determining solar irradiance attentuation can assist in determining solar generation for photovoltaics and concentrated solar power towers. Solar generation predictions are important for real time electricity market and pricing studies, real-time dispatch of other generating sources, and energy storage control studies.
ARM dataset includes data from various DOE sites that include sensor information from sun-tracking photometers, radiometers, spectrometer data which is helpful in understanding hyperspectral solar irradiance and cloud dynamics. ARM sites generate large datasets which can be challenging to store, stream, analyze and archive, may be sensitive to sensor noise, and require further measurement verification especially with respect to aerosol composition. Additionally, ARM data coverage is limited to ARM sites motivating future collaboration with partner networks to enhance observational spatial coverage.
ARM dataset includes data from various DOE sites that include sensor information from sun-tracking photometers, radiometers, spectrometer data which is helpful in understanding hyperspectral solar irradiance and cloud dynamics. ARM sites generate large datasets which can be challenging to store, stream, analyze and archive, may be sensitive to sensor noise, and require further measurement verification especially with respect to aerosol composition. Additionally, ARM data coverage is limited to ARM sites motivating future collaboration with partner networks to enhance observational spatial coverage.
Data Gap Type
Data Gap Details
U6: Usability > Large Volume
ARM sites generate large datasets which can be challenging to to store, analyze, stream, and archive. Automating ingestion and analysis of data using artificial intelligence can alleviate volume by compressing/reducing data storage and provide novel ways to index and access the data.
R1: Reliability > Quality
Data quality from ARM site sensors can be sensitive to noise and calibration issues requiring field specialists to identify potential problems. Since data volume is large, ingestion of data and identification of measurement drift benefit from automation.
S2: Sufficiency > Coverage
Spatial coverage of radiation and associated ground based atmospheric phenomena are limited to ARM sites within the United States. To increase spatial context collaboration with partner sensor network sites with the DOE and ARM program can expand coverage within the United States. Similar initiatives outside the United States can enable better solar potential studies in regions with different environments.
S3: Sufficiency > Granularity
There is a need for the retrieval of enhanced aerosol composition measurements in addition to ice nucleating particle measurements for better understanding cloud and weather dynamics jointly with solar irradiance for DER site planning and solar potential surveying.
Data coverage is limited to Gaithersburg, MD NIST campus and is no longer being maintained after July 2017.
Data Gap Type
Data Gap Details
S2: Sufficiency > Coverage
Since testbeds are located on the NIST campus spatial coverage is limited to the institution’s site. Similar datasets outside which combine sensor information from the solar irradiance conditions, and the associated solar generated power at the output of the inverter would require investment in similar site-specific testbeds in different regions.
S4: Sufficiency > Timeliness
The dataset is no longer being maintained after July 2017 which given the investment in equipment for the project may be worth visiting to study the long term changes in the solar efficiency of panels with respect to time and operational degradation.
Data from Solcast is accessible via academic or research institution. Solcast uses course surface elevation models aligned with reanalysis data leading to significant elevation differences between ground data sites and cell height. While a global dataset, coverage is limited to 33 sites with 18 in tropical/subtropical locations and 15 in temperate locations. Time granularity is also between 5-60min.
Data from Solcast is accessible via academic or research institution. Solcast uses course surface elevation models aligned with reanalysis data leading to significant elevation differences between ground data sites and cell height. While a global dataset, coverage is limited to 33 sites with 18 in tropical/subtropical locations and 15 in temperate locations. Time granularity is also between 5-60min.
Data Gap Type
Data Gap Details
O1: Obtainability > Findability
Data is only accessible via collaborating academic or research institution.
R1: Reliability > Quality
significant changes in elevation between sites can impact the clear-sky solar irradiance estimation due to variations in the atmosphere that radiation travels through and interacts with. Furthermore, ground stations may be above clouds, which can affect the accuracy of satellite-derived solar irradiance estimates. Solcast uses a coarse surface elevation model aligned with the reanalysis data used, which may lead to significant elevation differences between ground data sites and the cell height.
S2: Sufficiency > Coverage
While data products cover global sites, of the sites 33 are covered with 18 representing tropical and subtropical regions and 15 temperate regions. Further work would need to be done to create data products for regions outside the 33 represented as well as for areas with different environmental conditions.
S3: Sufficiency > Granularity
The current time granularity of the dataset ranges from products taken at time resolultions of 5-60minutes. For shorter than 5minute term forecasting supplemental data would be needed.
Data coverage and granularity is limited by the location of the cameras and constrained to 10-minute increments. Resolution is also limited to 352x288 24bit jpeg images (see device specifications).
Data coverage and granularity is limited by the location of the cameras and constrained to 10-minute increments. Resolution is also limited to 352x288 24bit jpeg images (see device specifications).
Data Gap Type
Data Gap Details
S2: Sufficiency > Coverage
Coverage is constrained by the location of the sensor network, the amount of sensors within the network and the spatial distances between sensors. To improve coverage, similar sensor networks must be created in different environmental conditions with varying granularity.
S3: Sufficiency > Granularity
Resolution of images are limited to 352x288 24bit jpeg images taken at 10 minute increments based on device specifications. Studies fusing information from other sensors that have multispectral capabilities or measure additional quantities may provide additional information that facilitate better solar irradiance predictions and model the effect of water vapor.
S2: Sufficiency > Coverage
Current dataset is derived from two previous sky imager datasets in Singapore. Studies that extend beyond Singapore would require similar sensor testbed network, or would require the collation and cultivation of sky image datasets over different temperate environments in addition to the use of proxy data that may not be ground based (remote sensing).
There is a need for annotated labels sky image data for cloud detection and segmentation purposes for improved local and PV site-specific irradiance predictions. The data is ultimately constrained to the coverage area of Singapore and restricts users from its commercial use.
There is a need for annotated labels sky image data for cloud detection and segmentation purposes for improved local and PV site-specific irradiance predictions. The data is ultimately constrained to the coverage area of Singapore and restricts users from its commercial use.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Dataset is under the common creative license. Commercial use is not allowed and need to request access to data via the form.
S1: Sufficiency > Insufficient Volume
There is a need for a larger dataset that contains manually annotated cloud mask labels. The current dataset is also unbalanced with fewer nighttime data which though may not directly impact solar irradiance studies, it does affect cloud dynamics which may be crucial for forecasting irradiance at different times and timescales.
Terrestrial wildlife detection and species classification
Details (click to expand)
Terrestrial wildlife detection and species classification are essential for understanding the impacts of climate change on terrestrial ecosystems. Similarly to marine wildlife studies, ML can greatly improve these efforts by automatically processing large volumes of data from diverse sources, enhancing the accuracy and efficiency of monitoring and analyzing terrestrial species.
The first and foremost challenge of bioacoustic data is its sheer volume, which makes its data sharing especially difficult due to limited storage options and high costs. Urgent solutions are needed for cheaper and more reliable data hosting and sharing platforms.
Additionally, there’s a significant shortage of large and diverse annotated datasets, much more severe compared to image data like camera trap, drone, and crowd-sourced images.
The first and foremost challenge of bioacoustic data is its sheer volume, which makes its data sharing especially difficult due to limited storage options and high costs. Urgent solutions are needed for cheaper and more reliable data hosting and sharing platforms.
Additionally, there’s a significant shortage of large and diverse annotated datasets, much more severe compared to image data like camera trap, drone, and crowd-sourced images.
Data Gap Type
Data Gap Details
S1: Sufficiency > Insufficient Volume
One of the foremost challenges is the scarcity of publicly open and adequately annotated data for model training. It is worth mentioning that the lack of annotated datasets is a common and major challenge applied to almost every modality of biodiversity data and is particularly acute for bioacoustic data, where the availability of sufficiently large and diverse datasets, encompassing a wide array of species, remains limited for model training purposes.
This scarcity of publicly open and well annotated datasets arises from the labor-intensive nature of data annotation and the dearth of expertise necessary to accurately identify species. This is the case even for relatively well-studied taxa such as birds, and is even more notable for taxa such as Orthoptera (grasshoppers and relatives). The incompleteness of our current taxonomy also compounds this issue.
Though some data cannot be shared with the public because of concerns about poaching or other legal restrictions imposed by local governments, there are still considerable sources of well-annotated data that currently reside within the confines of individual researchers’ or organizations’ hard drives. Unlocking these data will partially address the data gap.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
S2: Sufficiency > Coverage
There exists a significant geographic imbalance in data collected, with insufficient representation from highly biodiverse regions. The data also does not cover a diverse spectrum of species, intertwining with taxonomy insufficiencies.
Addressing these gaps involves several strategies:
Data sharing: Unlocking existing datasets scattered across individual laboratories and organizations can partially alleviate the geographic and taxonomic gaps in data coverage.
Annotation Efforts:
Leveraging crowdsourcing platforms enables collaborative data collection and facilitates volunteer-driven annotation, playing a pivotal role in expanding annotated datasets.
Collaborating with domain experts, including ecologists and machine learning practitioners, is crucial. Incorporating local ecological knowledge, particularly from indigenous communities in target regions, enriches data annotation efforts.
Developing AI-powered methods for automatic cataloging, searching, and data conversion is imperative to streamline data processing and extract relevant information efficiently.
Data Collection:
Prioritize continuous data collection efforts, especially in underrepresented regions and for less documented species and taxonomic groups.
Explore innovative approaches like Automated Multisensor Stations for Monitoring of Species Diversity (AMMODs) to enhance data collection capabilities.
Strategically plan data collection initiatives to target species at risk of extinction, ensuring data capture before irreversible loss occurs.
U2: Usability > Aggregation
A lot of well-annotated datasets are currently scattered within the confines of individual researchers’ or organizations’ hard drives. Aggregating and sharing these datasets would largely address the lack of publicly open annotated data needed for model training. This initiative would also mitigate geographic and taxonomic imbalances in existing data.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
U6: Usability > Large Volume
One of the biggest challenges in bioacoustic data lies in its sheer volume, stemming from continuous monitoring processes. Researchers face significant hurdles in sharing and hosting this data, as existing online platforms often don’t provide sufficient long-term storage capacity or they are very expensive. Urgent solutions are needed to provide cheaper and more reliable hosting options. Moreover, accessing these extensive datasets demands advanced computing infrastructure and solutions. If there were enough funding sources for this, a lot of people would like to start sharing their bioacoustic data.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
Data Gap Type
Data Gap Details
S1: Sufficiency > Insufficient Volume
One of the foremost challenges is the scarcity of publicly open and adequately annotated data for model training. It is worth mentioning that the lack of annotated datasets is a common and major challenge applied to almost every modality of biodiversity data and is particularly acute for bioacoustic data, where the availability of sufficiently large and diverse datasets, encompassing a wide array of species, remains limited for model training purposes.
This scarcity of publicly open and well annotated datasets arises from the labor-intensive nature of data annotation and the dearth of expertise necessary to accurately identify species. This is the case even for relatively well-studied taxa such as birds, and is even more notable for taxa such as Orthoptera (grasshoppers and relatives). The incompleteness of our current taxonomy also compounds this issue.
Though some data cannot be shared with the public because of concerns about poaching or other legal restrictions imposed by local governments, there are still considerable sources of well-annotated data that currently reside within the confines of individual researchers’ or organizations’ hard drives. Unlocking these data will partially address the data gap.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
S2: Sufficiency > Coverage
There exists a significant geographic imbalance in data collected, with insufficient representation from highly biodiverse regions. The data also does not cover a diverse spectrum of species, intertwining with taxonomy insufficiencies.
Addressing these gaps involves several strategies:
Data sharing: Unlocking existing datasets scattered across individual laboratories and organizations can partially alleviate the geographic and taxonomic gaps in data coverage.
Annotation Efforts:
Leveraging crowdsourcing platforms enables collaborative data collection and facilitates volunteer-driven annotation, playing a pivotal role in expanding annotated datasets.
Collaborating with domain experts, including ecologists and machine learning practitioners, is crucial. Incorporating local ecological knowledge, particularly from indigenous communities in target regions, enriches data annotation efforts.
Developing AI-powered methods for automatic cataloging, searching, and data conversion is imperative to streamline data processing and extract relevant information efficiently.
Data Collection:
Prioritize continuous data collection efforts, especially in underrepresented regions and for less documented species and taxonomic groups.
Explore innovative approaches like Automated Multisensor Stations for Monitoring of Species Diversity (AMMODs) to enhance data collection capabilities.
Strategically plan data collection initiatives to target species at risk of extinction, ensuring data capture before irreversible loss occurs.
U2: Usability > Aggregation
A lot of well-annotated datasets are currently scattered within the confines of individual researchers’ or organizations’ hard drives. Aggregating and sharing these datasets would largely address the lack of publicly open annotated data needed for model training. This initiative would also mitigate geographic and taxonomic imbalances in existing data.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
The main challenge with community science data is its lack of diversity. Data tends to be concentrated in accessible areas and primarily focuses on charismatic or commonly encountered species.
The main challenge with community science data is its lack of diversity. Data tends to be concentrated in accessible areas and primarily focuses on charismatic or commonly encountered species.
Data Gap Type
Data Gap Details
S2: Sufficiency > Coverage
Data is often concentrated in easily accessible areas and focuses on more charismatic or easily identifiable species. Data is also biased towards more densely populated species.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
Data Gap Type
Data Gap Details
S1: Sufficiency > Insufficient Volume
One of the foremost challenges is the scarcity of publicly open and adequately annotated data for model training. It is worth mentioning that the lack of annotated datasets is a common and major challenge applied to almost every modality of biodiversity data and is particularly acute for bioacoustic data, where the availability of sufficiently large and diverse datasets, encompassing a wide array of species, remains limited for model training purposes.
This scarcity of publicly open and well annotated datasets arises from the labor-intensive nature of data annotation and the dearth of expertise necessary to accurately identify species. This is the case even for relatively well-studied taxa such as birds, and is even more notable for taxa such as Orthoptera (grasshoppers and relatives). The incompleteness of our current taxonomy also compounds this issue.
Though some data cannot be shared with the public because of concerns about poaching or other legal restrictions imposed by local governments, there are still considerable sources of well-annotated data that currently reside within the confines of individual researchers’ or organizations’ hard drives. Unlocking these data will partially address the data gap.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
S2: Sufficiency > Coverage
There exists a significant geographic imbalance in data collected, with insufficient representation from highly biodiverse regions. The data also does not cover a diverse spectrum of species, intertwining with taxonomy insufficiencies.
Addressing these gaps involves several strategies:
Data sharing: Unlocking existing datasets scattered across individual laboratories and organizations can partially alleviate the geographic and taxonomic gaps in data coverage.
Annotation Efforts:
Leveraging crowdsourcing platforms enables collaborative data collection and facilitates volunteer-driven annotation, playing a pivotal role in expanding annotated datasets.
Collaborating with domain experts, including ecologists and machine learning practitioners, is crucial. Incorporating local ecological knowledge, particularly from indigenous communities in target regions, enriches data annotation efforts.
Developing AI-powered methods for automatic cataloging, searching, and data conversion is imperative to streamline data processing and extract relevant information efficiently.
Data Collection:
Prioritize continuous data collection efforts, especially in underrepresented regions and for less documented species and taxonomic groups.
Explore innovative approaches like Automated Multisensor Stations for Monitoring of Species Diversity (AMMODs) to enhance data collection capabilities.
Strategically plan data collection initiatives to target species at risk of extinction, ensuring data capture before irreversible loss occurs.
U2: Usability > Aggregation
A lot of well-annotated datasets are currently scattered within the confines of individual researchers’ or organizations’ hard drives. Aggregating and sharing these datasets would largely address the lack of publicly open annotated data needed for model training. This initiative would also mitigate geographic and taxonomic imbalances in existing data.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
One gap in data is the incomplete barcoding reference databases.
Data Gap Type
Data Gap Details
S2: Sufficiency > Coverage
eDNA is an emerging new technique in biodiversity monitoring. There are still a lot of issues impeding the application of eDNA-based tools. One gap in data is the incomplete barcoding reference databases. However, a lot of attention and effort are being devoted to filling this data gap. For example, the BIOSCAN project. It is worth mentioning that BIOSCAN-5M is a comprehensive dataset containing multi-modal information, including DNA barcode sequenceses and taxonomic labels for over 5 million insect specimens, presenting as a large reference library on species- and genus-level classification tasks.
While GBIF provides a common standard, the accuracy of species classification in the data can vary, and classifications may change over time.
Data Gap Type
Data Gap Details
U5: Usability > Pre-processing
Though a common standard is provided at Gbif, the species classification of the data is not always accurate and consistent. The species were even classified into different groups over time.
Some commercial high-resolution satellite images can also be used to identify large animals such as whales, but those images are not open to the public.
Some commercial high-resolution satellite images can also be used to identify large animals such as whales, but those images are not open to the public.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
The resolution of publicly open satellite images is not high enough. High-resolution images are usually commercial and not open for free.
Variability analysis of wind power generation
Details (click to expand)
The shift from high-inertia generation sources such as thermal plants to low inertia distributed inverter-coupled generation from distributed energy resources introduces new stability and reliability issues. It is imperative to maintain the frequency of the system at a nominal level to prevent damage, instability, and blackouts. Wind generation from turbines can contribute some frequency response and inertia that may benefit the grid by providing a combination of synthetic inertial and primary frequency response to the grid system.
To gain access, particularly to NREL’s FESTIV model, permission must be requested. Since FESTIV is a simulation model, it may not account for all real-time system dynamics and complexities requiring validation and verification from real-world data. Furthermore, since the granularity of the model is hourly, it may not be able to account for very short-term impacts, frequencies, and reactive power flows that can affect power system stability.
To gain access, particularly to NREL’s FESTIV model, permission must be requested. Since FESTIV is a simulation model, it may not account for all real-time system dynamics and complexities requiring validation and verification from real-world data. Furthermore, since the granularity of the model is hourly, it may not be able to account for very short-term impacts, frequencies, and reactive power flows that can affect power system stability.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
To gain access to the FESTIV model, contact the group manager at: rui.yang@nrel.gov
R1: Reliability > Quality
The model may not account for all real-time system dynamics and complexities and may need verification from operational data. Additionally, since data relies on scenario-based forecasting it may not capture real-world uncertainties. Furthermore, operating reserve values may be inaccurate and need validation in practice.
S3: Sufficiency > Granularity
FESTIV is based on houly unit commitment time resolution which may not consider reliability impacts that occur on a subhourly scale. The focus on short-term operational impacts rather than very short term (sub-hourly) does not take into account the frequency response, voltage magnitudes, and reactive power flows which impact system stability and reliability.
Weather forecasting: Near-term (< 24 hours)
Details (click to expand)
Near-term weather forecasting (< 24 hours ahead) of temperature, precipitation, etc. at km-level spatial and minute-level temporal resolution, in an accurate and computationally-efficient manner, has implications for many climate change mitigation and adaptation applications. ML can help provide more accurate near-term weather forecasts.
Data volume is large and only data specific to the US is available.
Data Gap Type
Data Gap Details
U6: Usability > Large Volume
The large data volume, resulting from its high spatio-temporal resolution, makes transferring and processing the data very challenging. It would be beneficial if the data were accessible remotely and if computational resources were provided alongside it.
S2: Sufficiency > Coverage
This assimilated dataset currently covers only the continental US. It would be highly beneficial to have a similar dataset that includes global coverage.
Data volume is large, and only data covering the US is available.
Data Gap Type
Data Gap Details
U6: Usability > Large Volume
The large data volume, resulting from its high spatio-temporal resolution, makes transferring and processing the data very challenging. It would be beneficial if the data were accessible remotely and if computational resources were provided alongside it.
Data volume is large due to the high spatio-temporal resolution, which makes transfering and processing the data very difficult.
S2: Sufficiency > Coverage
This assimilated dataset currently covers only the continental US. It would be highly beneficial to have a similar dataset that includes global coverage.
Obtaining and integrating radar data from various sources is challenging.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Radar data from many countries are not open to the public. They have to be purchased or applied for. Also, different agencies tend to apply differing quality control protocols, posing challenges to scaling up the data analysis to global scale.
U1: Usability > Structure
Radar data from different sources vary in format, spatial resolution, and temporal resolution, making assimilation challenging.
U3: Usability > Usage Rights
Many radar data are rescrited for academic and research purpose only.
S2: Sufficiency > Coverage
There is no sufficient data or even no data from global south.
An enhanced version of ERA5 with higher granularity and fidelity is needed. In fact, a lot of surface observations and remote sensing data are in place for developing such a dataset.
An enhanced version of ERA5 with higher granularity and fidelity is needed. In fact, a lot of surface observations and remote sensing data are in place for developing such a dataset.
Data Gap Type
Data Gap Details
S6: Sufficiency > Missing Components
ERA5 is currently widely used in ML-based weather forecast and climate modeling because of its high resolution and ready-for-analysis characteristics. But large volumes of observations, e.g. data from radiosonde, balloons, and weather stations are largely under-utilized. It would be great to create a dataset well-structured like ERA5 but from more observations.
Weather forecasting: Short-to-medium term (1-14 days)
Details (click to expand)
Weather forecasting at 1-14 days ahead has implications for real-time response and planning applications within both climate change mitigation and adaptation. ML can help improve short-to-medium-term weather forecasts.
The biggest challenge of ENS is that only a portion of it is available to the public for free.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
The high-resolution real-time forecast is for purchase only and it is expensive to get them.
U6: Usability > Large Volume
Due to its high spatio-temporal resolution, HRES data is quite large, creating significant challenges for downloading, transferring, storing, and processing with standard computational infrastructure. To address this, the WeatherBench 2 benchmark dataset provides a solution for certain machine learning tasks involving ENS data.
ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.
ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Downloading ERA5 data from the Copernicus Climate Data Store can be very time-consuming, taking anywhere from days to months. This delay is due to the large size of the dataset, which results from its high spatio-temporal resolution, its high demand, and the fact that the data is stored on tape.
U6: Usability > Large Volume
The large volume of data poses significant challenges for downloading, transferring, storing, and processing with standard computational infrastructure. Although cloud-based access is available from various providers, it often comes with high costs and is not free. To address these challenges, it would be beneficial to make additional computational resources available alongside the storage of data.
R1: Reliability > Quality
ERA5 is often used as “ground truth” in bias-correction tasks, but it has its own biases and uncertainties. ERA5 is derived from observations combined with physics-based models to estimate conditions in areas with sparse observational data. Consequently, biases in ERA5 can stem from the limitations of these physics-based models. It is worth noting that precipitation and cloud-related fields in ERA5 are less accurate compared to other fields and are not suitable for validating ML models. A higher-fidelity atmospheric dataset, such as an enhanced version of ERA5, is greatly needed. Machine learning can play a significant role in this area by improving the assimilation of atmospheric observation data from various sources.
The biggest challenge of HRES is that only a portion of it is available to the public for free.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
The high-resolution real-time forecast is for purchase only and it is expensive to get them.
U6: Usability > Large Volume
Due to its high spatio-temporal resolution, HRES data is quite large, creating significant challenges for downloading, transferring, storing, and processing with standard computational infrastructure. To address this, the WeatherBench 2 benchmark dataset provides a solution for certain machine learning tasks involving HRES data.
Weather Bench 2 is based on ERA5, so the issues of ERA5 are also inherent here, that is, data has biases over regions where there are no observations.
Data Gap Type
Data Gap Details
R1: Reliability > Quality
ERA5 is often used as “ground truth” in bias-correction tasks, but it has its own biases and uncertainties. ERA5 is derived from observations combined with physics-based models to estimate conditions in areas with sparse observational data. Consequently, biases in ERA5 can stem from the limitations of these physics-based models. It is worth noting that precipitation and cloud-related fields in ERA5 are less accurate compared to other fields and are not suitable for validating ML models. A higher-fidelity atmospheric dataset, such as an enhanced version of ERA5, is greatly needed. Machine learning can play a significant role in this area by improving the assimilation of atmospheric observation data from various sources.
Weather forecasting: Subseasonal horizon
Details (click to expand)
High-fidelity weather forecasts at subseasonal to seasonal scales (3-4 weeks ahead) are important for a variety of climate change-related applications. ML can be used to postprocess outputs from physics-based numerical forecast models in order to generate higher-fidelity forecasts.
ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.
ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Downloading ERA5 data from the Copernicus Climate Data Store can be very time-consuming, taking anywhere from days to months. This delay is due to the large size of the dataset, which results from its high spatio-temporal resolution, its high demand, and the fact that the data is stored on tape.
U6: Usability > Large Volume
The large volume of data poses significant challenges for downloading, transferring, storing, and processing with standard computational infrastructure. Although cloud-based access is available from various providers, it often comes with high costs and is not free. To address these challenges, it would be beneficial to make additional computational resources available alongside the storage of data.
R1: Reliability > Quality
ERA5 is often used as “ground truth” in bias-correction tasks, but it has its own biases and uncertainties. ERA5 is derived from observations combined with physics-based models to estimate conditions in areas with sparse observational data. Consequently, biases in ERA5 can stem from the limitations of these physics-based models. It is worth noting that precipitation and cloud-related fields in ERA5 are less accurate compared to other fields and are not suitable for validating ML models. A higher-fidelity atmospheric dataset, such as an enhanced version of ERA5, is greatly needed. Machine learning can play a significant role in this area by improving the assimilation of atmospheric observation data from various sources.
More data is needed to develop a more accurate and robust ML model. It is also important to note that SUBX data contains biases and uncertainties, which can be inherited by ML models trained with this data.
More data is needed to develop a more accurate and robust ML model. It is also important to note that SUBX data contains biases and uncertainties, which can be inherited by ML models trained with this data.
Data Gap Type
Data Gap Details
S1: Sufficiency > Insufficient Volume
Larger models generally offer improved performance for developing data-driven sub-seasonal forecast models. However, with only a limited number of models contributing to the SUBX dataset, there is a scarcity of training data. To enhance ML model performance, more SUBX data generated by physics-based numerical weather forecast models is required.
Weather forecasting: Subseasonal-to-seasonal horizon
Details (click to expand)
High-fidelity weather forecasts at the subseasonal-to-seasonal (S2S) scale (i.e., 10-46 days ahead) are important for a variety of climate change-related applications. ML can be used to postprocess outputs from physics-based numerical forecast models in order to generate higher-fidelity forecasts.
CPC Global Unified gauge-based analysis of daily precipitation https://psl.noaa.gov/data/gridded/data.cpc.globalprecip.html
Data Gap Type
Data Gap Details
R1: Reliability > Quality
There is large uncertainty in data as data is derived via interpolating station data. There are large biases over areas where rain gauge stations are sparse
S3: Sufficiency > Granularity
Resolution is 0.5 deg (roughly 50km) and not sufficiently fine for many applications.
More data is needed to take advantage of the large ML models.
Data Gap Type
Data Gap Details
S1: Sufficiency > Insufficient Volume
Currently available data is not sufficient for training large ML models. More data is needed.
Wildfire prediction: Short-term (3-7 days)
Details (click to expand)
Wildfires are becoming more frequent and severe due to climate change. Accurate early prediction of these events enables timely evacuations and resource allocation, thereby reducing risks to lives and property. ML can enhance fire prediction by providing more precise forecasts quickly and efficiently.
A huge data gap is that there is no active fire data in the afternoon (1-5 pm) when most fires ignite.
Data Gap Type
Data Gap Details
S6: Sufficiency > Missing Components
There is currently no active fire data available for the afternoon period (1-5 pm), when most fires tend to ignite, due to a lack of satellite coverage during these hours (after 1:30 pm). Some companies are developing their own satellites to address this gap and provide crucial afternoon data.
U5: Usability > Pre-processing
Available active fire data are of various qualities and with false positives or negatives. They have to be cleaned, validated, and corrected before use. Some research companies are assimilating all available satellite data or active fire products to create their own dataset of active fire.
R1: Reliability > Quality
Available active fire data are of various qualities and with false positives or negatives. They have to be cleaned, validated, and corrected before use. Some research companies are assimilating all available satellite data or active fire products to create their own dataset of active fire.
ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.
ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months.
Socioeconomic data, eg. human behaviors are significant predictors of fire. Other than the inherent challenges and gaps of socioeconomic data, aggregating those datasets and harmonizing them with other predictors of fire data in the spatial domain is especially tricky.
Socioeconomic data, eg. human behaviors are significant predictors of fire. Other than the inherent challenges and gaps of socioeconomic data, aggregating those datasets and harmonizing them with other predictors of fire data in the spatial domain is especially tricky.
Data Gap Type
Data Gap Details
U2: Usability > Aggregation
Socio-economic data, e.g. population, building types, are usually in a different format and structure with other fire predictors and fire hazard data. Aggregating different kinds of socio-economic data and harmonizing them with other predictors of fire and fire hazard data is challenging, especially in the spatial dimension.
Data is not available in machine-readable formats and is limited to English-language literature from major journals.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Many data sources that should be open are not fully accessible. For instance, abstracts are generally expected to be openly available, even for proprietary data. However, in practice, for some papers, only a subset of abstracts is often accessible.
U1: Usability > Structure
Most of the data is in PDF format and should be converted to machine-readable formats.
S2: Sufficiency > Coverage
Research is currently limited to literature published in English (at least the abstracts) and from major journals. Many region-specific journals or literature indexed in other languages are not included. These should be translated into English and incorporated into the database.
Active fire data
Details (click to expand)
Active fire data derived from images taken by satellites such as MODIS, VIRRS, LANDSAT. They are at different spatial resolutions and temporal coverage. Data can be downloaded here: https://firms.modaps.eosdis.nasa.gov/active_fire.
A huge data gap is that there is no active fire data in the afternoon (1-5 pm) when most fires ignite.
Data Gap Type
Data Gap Details
S6: Sufficiency > Missing Components
There is currently no active fire data available for the afternoon period (1-5 pm), when most fires tend to ignite, due to a lack of satellite coverage during these hours (after 1:30 pm). Some companies are developing their own satellites to address this gap and provide crucial afternoon data.
U5: Usability > Pre-processing
Available active fire data are of various qualities and with false positives or negatives. They have to be cleaned, validated, and corrected before use. Some research companies are assimilating all available satellite data or active fire products to create their own dataset of active fire.
R1: Reliability > Quality
Available active fire data are of various qualities and with false positives or negatives. They have to be cleaned, validated, and corrected before use. Some research companies are assimilating all available satellite data or active fire products to create their own dataset of active fire.
Advanced metering infrastructure data
Details (click to expand)
Advanced Metering Infrastructure (AMI) facilitates communication between utilities and customers through smart meter device systems which collect, store, and analyze per building energy consumption.
AMI data can be retrieved through public data portals, individual data collection, or research partnerships with local utilities. Some examples of utility research partnerships include the Irvine Smart Grid Demonstration (ISGD) project conducted by Southern California Edison (SCE) and the smart meter pilot test from the Sacramento Municipal Utility. An example of publicly available data which is aggregated and anonymized is the Commission for Energy Regulation (CER) Smart Metering Project hosted by the Irish Social Science Data Archive (ISSDA).
AMI data is challenging to obtain without pilot study partnerships with utilities since data collection on individual building consumer behavior can infringe upon customer privacy especially at the residential level. Granularity of time series data can also vary depending on the level of access to the data whether it be aggregated and anonymized or based on the resolution of readings and system. Additionally, coverage of data will be limited to utility pilot test service areas thereby restricting the scope and scale of demand studies.
AMI data is challenging to obtain without pilot study partnerships with utilities since data collection on individual building consumer behavior can infringe upon customer privacy especially at the residential level. Granularity of time series data can also vary depending on the level of access to the data whether it be aggregated and anonymized or based on the resolution of readings and system. Additionally, coverage of data will be limited to utility pilot test service areas thereby restricting the scope and scale of demand studies.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Access to real AMI data can be difficult to obtain due to privacy concerns. Even when partnered with a utility, the AMI data can undergo anonymization and aggregation to protect individual customers. Some ISOs are able to distribute data provided that a written records request is submitted. If requesting personal consumption data, program pricing enrollment, may limit temporal resolution of data that a utility can provide. Open datasets, on the other hand, may only be available for academic research or teaching use (ISSDA CER data).
U2: Usability > Aggregation
AMI data when used jointly with other data that may influence demand such as weather, availability of rooftop solar, presence of electric vehicles, building specifications, and appliance inventory may require significant additional data collection or retrieval. Non-intrusive load monitoring techniques to disaggregate AMI data may be employed with some assumptions based on additional data. For example, the use of satellite imagery over a region of interest can assist in identifying buildings that have solar panels.
U3: Usability > Usage Rights
For ISSDA CER data use, a request form must be submitted for evaluation by issda@ucd.ie. For data obtained through utility collaborative partnerships, usage rights may vary. Please contact the data provider for more information.
U5: Usability > Pre-processing
Data cleanliness may vary depending on the data source. For individual private data collection through testbed development, cleanliness can depend on formats of data stream output from the sensor network system installed. When designing the testbed data format it is recommended to develop and structure comprehensive metadata with respect to the study to encourage further development.
R1: Reliability > Quality
Anonymized data may not be verifiable or useful once it is open-source. Further data collection for verification purposes is recommended.
S2: Sufficiency > Coverage
S3: Sufficiency > Granularity
Meter resolution can vary based on the hardware ranging from 1 hour, 30 minute, to 15 minute measurement intervals. Depending on the level of anonymization and aggregation of data, the granularity may be constrained to other factors such as the cadence of time of use pricing and other tiered demand response programs employed by the partnering utility. Interpolation may be used to combat issues with respect to resolution but may require uncertainty considerations when reporting results.
S4: Sufficiency > Timeliness
With respect to the CER Smart Metering Project and the associated Customer Behavior Trials (CBT), Electric Ireland and Bord Gais Energy smart meter installation and monitoring occurred from 2009-2010. This anonymized dataset may no longer be representative of current behavior usage as household compositions and associated loads change with time. Similarly, pilot programs through participating utilities are finite in nature. To address this data gap, in the context of previous pilot study locations, studies and testbeds can be reopened or revisited. In the context of new studies in different locations, previous data can still be utilized for pre-training models, however, fine-tuning would still require new data collection.
Automatic surface observation (ASOS)
Details (click to expand)
Data volume is large and only data specific to the US is available.
Data Gap Type
Data Gap Details
U6: Usability > Large Volume
The large data volume, resulting from its high spatio-temporal resolution, makes transferring and processing the data very challenging. It would be beneficial if the data were accessible remotely and if computational resources were provided alongside it.
S2: Sufficiency > Coverage
This assimilated dataset currently covers only the continental US. It would be highly beneficial to have a similar dataset that includes global coverage.
Benchmark datasets for short-term wildfire prediction
Details (click to expand)
Benchmark datasets for wildfire prediction are standardized collections of data that include historical and real-time wildfire occurrences, remote sensing imagery, fuel information, and meteorological data. These datasets provide a common framework for training, validating, and testing machine learning models. By integrating various modalities and sources of data, benchmark datasets simplify the process of data collection, integration, and preprocessing, ensuring consistency and efficiency in developing and evaluating wildfire prediction models.
Use Case
Data Gap Summary
Benchmark datasets of building environmental conditions and occupancy
Details (click to expand)
The US Office of Energy Efficiency and Renewable Energy hosts 15 building datasets for 10 states covering 7 climate zones and 11 different building types. The data covers energy, indoor air quality, occupancy, environment, HVAC, lighting, and energy consumption to name a few. Datasets are organized by name and points of contact.
All data featured on the platform is open access with standardization on metadata format to allow for ease of use and information specific to buildings based on type, location, and climate zone. Data quality and guidance on curation and cleaning in addition to access restrictions are specified in the metadata of each hosted dataset. Licensing information for each individual featured dataset is provided.
Datasets featured can vary in types of data gaps depending on the content, coverage area, location, building type, building spatial plan, quantity measured, ambient environment, or power consumption or metered data availability.
Datasets featured can vary in types of data gaps depending on the content, coverage area, location, building type, building spatial plan, quantity measured, ambient environment, or power consumption or metered data availability.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
U3: Usability > Usage Rights
S3: Sufficiency > Granularity
Dataset time resolution and period of temporal coverage vary depending on the dataset selected. To overcome this gap, interpolation techniques may be employed and recorded.
S6: Sufficiency > Missing Components
Building data typically does not include grid interactive data, or signals from the utility side with respect to control or demand side management. Such data can be difficult to obtain or require special permissions. By enabling the collection of utility side signals, utility-initiated auto-demand response (auto-DR) and load shifting could be better assessed.
S2: Sufficiency > Coverage
Featured datasets are from test-beds, buildings, and contributing households from the United States. Similar data from other regions would require data collection as household usage behavior may differ depending on culture, location, building age, and weather.
Bioacoustic recordings
Details (click to expand)
Passive acoustic recording provides continuous monitoring of both the environment and the species.
There is a lack of standardized protocol to guide data collection for restoration projects, including which variables to measure and how to measure them. Developing such protocols would standardize data collection practices, enabling consistent assessment and comparison of restoration successes and failures on a global scale.
There is a lack of standardized protocol to guide data collection for restoration projects, including which variables to measure and how to measure them. Developing such protocols would standardize data collection practices, enabling consistent assessment and comparison of restoration successes and failures on a global scale.
Data Gap Type
Data Gap Details
U1: Usability > Structure
For restoration projects, there is an urgent need of standardized protocol to guide data collection and curation for measuring biodiversity and other complex ecological outcomes of restoration projects. For example, there is an urgent need of a clearly written guidance on what variables to collect and how to collect them.
Turning raw data into usable insights is also a significant challenge. There is a lack of standardized tools and workflows to turn the raw data into some analysis-ready data and analyze the data in a consistent way across projects.
U4: Usability > Documentation
For restoration projects, there is an urgent need of standardized protocol to guide data collection and curation for measuring biodiversity and other complex ecological outcomes of restoration projects. For example, there is an urgent need of a clearly written guidance on what variables to collect and how to collect them.
Turning raw data into usable insights is also a significant challenge. There is a lack of standardized tools and workflows to turn the raw data into some analysis-ready data and analyze the data in a consistent way across projects.
The major data gap is the limited institutional capacity to process and analyze data promptly to inform decision-making.
Data Gap Type
Data Gap Details
M: Misc/Other
There is a significant institutional challenge in processing and analyzing data promptly to inform decision-making. To enhance institutional capacity for leveraging global data sources and analytical methods effectively, a strategic, ecosystem-building approach is essential, rather than solely focusing on individual researcher skill development. This approach should prioritize long-term sustainability over short-term project-based funding.
Funding presents a major bottleneck for ecosystem monitoring initiatives. While most funding allocations are short-term, there is a critical need for sustained and adequate funding to support ongoing monitoring efforts and maintain data processing capabilities.
The first and foremost challenge of bioacoustic data is its sheer volume, which makes its data sharing especially difficult due to limited storage options and high costs. Urgent solutions are needed for cheaper and more reliable data hosting and sharing platforms.
Additionally, there’s a significant shortage of large and diverse annotated datasets, much more severe compared to image data like camera trap, drone, and crowd-sourced images.
The first and foremost challenge of bioacoustic data is its sheer volume, which makes its data sharing especially difficult due to limited storage options and high costs. Urgent solutions are needed for cheaper and more reliable data hosting and sharing platforms.
Additionally, there’s a significant shortage of large and diverse annotated datasets, much more severe compared to image data like camera trap, drone, and crowd-sourced images.
Data Gap Type
Data Gap Details
S1: Sufficiency > Insufficient Volume
One of the foremost challenges is the scarcity of publicly open and adequately annotated data for model training. It is worth mentioning that the lack of annotated datasets is a common and major challenge applied to almost every modality of biodiversity data and is particularly acute for bioacoustic data, where the availability of sufficiently large and diverse datasets, encompassing a wide array of species, remains limited for model training purposes.
This scarcity of publicly open and well annotated datasets arises from the labor-intensive nature of data annotation and the dearth of expertise necessary to accurately identify species. This is the case even for relatively well-studied taxa such as birds, and is even more notable for taxa such as Orthoptera (grasshoppers and relatives). The incompleteness of our current taxonomy also compounds this issue.
Though some data cannot be shared with the public because of concerns about poaching or other legal restrictions imposed by local governments, there are still considerable sources of well-annotated data that currently reside within the confines of individual researchers’ or organizations’ hard drives. Unlocking these data will partially address the data gap.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
S2: Sufficiency > Coverage
There exists a significant geographic imbalance in data collected, with insufficient representation from highly biodiverse regions. The data also does not cover a diverse spectrum of species, intertwining with taxonomy insufficiencies.
Addressing these gaps involves several strategies:
Data sharing: Unlocking existing datasets scattered across individual laboratories and organizations can partially alleviate the geographic and taxonomic gaps in data coverage.
Annotation Efforts:
Leveraging crowdsourcing platforms enables collaborative data collection and facilitates volunteer-driven annotation, playing a pivotal role in expanding annotated datasets.
Collaborating with domain experts, including ecologists and machine learning practitioners, is crucial. Incorporating local ecological knowledge, particularly from indigenous communities in target regions, enriches data annotation efforts.
Developing AI-powered methods for automatic cataloging, searching, and data conversion is imperative to streamline data processing and extract relevant information efficiently.
Data Collection:
Prioritize continuous data collection efforts, especially in underrepresented regions and for less documented species and taxonomic groups.
Explore innovative approaches like Automated Multisensor Stations for Monitoring of Species Diversity (AMMODs) to enhance data collection capabilities.
Strategically plan data collection initiatives to target species at risk of extinction, ensuring data capture before irreversible loss occurs.
U2: Usability > Aggregation
A lot of well-annotated datasets are currently scattered within the confines of individual researchers’ or organizations’ hard drives. Aggregating and sharing these datasets would largely address the lack of publicly open annotated data needed for model training. This initiative would also mitigate geographic and taxonomic imbalances in existing data.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
U6: Usability > Large Volume
One of the biggest challenges in bioacoustic data lies in its sheer volume, stemming from continuous monitoring processes. Researchers face significant hurdles in sharing and hosting this data, as existing online platforms often don’t provide sufficient long-term storage capacity or they are very expensive. Urgent solutions are needed to provide cheaper and more reliable hosting options. Moreover, accessing these extensive datasets demands advanced computing infrastructure and solutions. If there were enough funding sources for this, a lot of people would like to start sharing their bioacoustic data.
Building data genome project
Details (click to expand)
The Building Data Genome Project 2 dataset contains hourly whole building data from 3,053 energy meters from 1,636 non-residential buildings covering two years worth of metered data with respect to electricity, water, and solar in addition to logistical metadata with respect to area, primary building use category, floor area, time zone, weather, and smart meter type. The goal of the dataset to to allow for the development of generalizable building models for energy efficiency analysis studies.
The building data genome project 2 compiles building data from public open datasets along with privately curated building data specific to university and higher education institutions. While the dataset has clear documentation, the lack of diversity in building types necessitates further development of the dataset to include other types of buildings as well as expansion to coverage areas and times beyond those currently available.
The building data genome project 2 compiles building data from public open datasets along with privately curated building data specific to university and higher education institutions. While the dataset has clear documentation, the lack of diversity in building types necessitates further development of the dataset to include other types of buildings as well as expansion to coverage areas and times beyond those currently available.
Data Gap Type
Data Gap Details
U2: Usability > Aggregation
Data was collated from 7 open access public data sources as well as 12 privately curated datasets from facilities management at different college sites requiring manual site visits which are not included in the data repository at this time.
U3: Usability > Usage Rights
S2: Sufficiency > Coverage
The dataset is curated from buildings on university campuses thereby limiting the diversity of building representation. To overcome the lack of diversity in building data, data sharing incentives and community open source contributions can allow for the expansion of the of the dataset.
S3: Sufficiency > Granularity
The granularity of the meter data is hourly which may not be adequate for short term load-forecasting and efficiency studies at a higher resolution. Assumptions on conditions would have to be made prior to interpolating.
S4: Sufficiency > Timeliness
The dataset covers hourly measurements from January 1, 2016 to December 31, 2018. While this may be adequate for pre-training models, further data collection through a reinitiation of the study may be needed to fine-tune models for more up to date periods of time.
More information, such as age of the building, should be included in the dataset.
Data Gap Type
Data Gap Details
U1: Usability > Structure
Building footprint datasets are usually in different formats other than the format/coordinate system used by the government. To ensure these datasets are usable for local government applications, it would be helpful to have them align with the government’s referred format and coordinate system.
S6: Sufficiency > Missing Components
More information about the building, such as the age of the building and the source of the data should be included in the dataset.
R1: Reliability > Quality
The building footprint data can contain errors due to detection inaccuracies in the models used to generate the dataset, as well as limitations of satellite imagery. These limitations include outdated images that may not reflect recent developments and visibility issues such as cloud cover or obstructions that can prevent accurate identification of buildings.
CMIP6
Details (click to expand)
Climate simulations from a consortium of state-of-art climate models. Data can be found here.
The large uncertainties in future climate projection is a big problem of CMIP6. The large volume of data and the lack of uniform structure—such as inconsistent variable names, data formats, and resolutions across different CMIP6 models—make it challenging to utilize data from multiple models effectively.
The large uncertainties in future climate projection is a big problem of CMIP6. The large volume of data and the lack of uniform structure—such as inconsistent variable names, data formats, and resolutions across different CMIP6 models—make it challenging to utilize data from multiple models effectively.
Data Gap Type
Data Gap Details
U6: Usability > Large Volume
The large volume size of CMIP6 data poses significant challenges for downloading, transferring, storing, and processing with standard computational infrastructure. Although cloud-based access is available from various providers, it often comes with high costs and is not free. To address these challenges, it would be beneficial to make additional computational resources available alongside the storage of data.
U1: Usability > Structure
Data from different models is of different resolutions and variable names, which makes assimilating data from multiple models challenging.
R1: Reliability > Quality
There are large biases and uncertainties in the data, which can be improved by improving the climate models used to generate the simulations.
The large data volume and lack of uniform structure (no consistent variable names, data strucuture, and data resolution across all models) makes it difficult to use data from more than one model of CMIP6.
The large data volume and lack of uniform structure (no consistent variable names, data strucuture, and data resolution across all models) makes it difficult to use data from more than one model of CMIP6.
Data Gap Type
Data Gap Details
U6: Usability > Large Volume
The large volume size of CMIP6 data poses significant challenges for downloading, transferring, storing, and processing with standard computational infrastructure. Although cloud-based access is available from various providers, it often comes with high costs and is not free. To address these challenges, it would be beneficial to make additional computational resources available alongside the storage of data.
U1: Usability > Structure
Data from different models is of different resolutions and variable names, which makes assimilating data from multiple models challenging.
R1: Reliability > Quality
There are large biases and uncertainties in the data, which can be improved by improving the climate models used to generate the simulations.
CPC Precipitation
Details (click to expand)
CPC Global Unified gauge-based analysis of daily precipitation https://psl.noaa.gov/data/gridded/data.cpc.globalprecip.html
High-fidelity weather forecasts at the subseasonal-to-seasonal (S2S) scale (i.e., 10-46 days ahead) are important for a variety of climate change-related applications. ML can be used to postprocess outputs from physics-based numerical forecast models in order to generate higher-fidelity forecasts.
High-fidelity weather forecasts at the subseasonal-to-seasonal (S2S) scale (i.e., 10-46 days ahead) are important for a variety of climate change-related applications. ML can be used to postprocess outputs from physics-based numerical forecast models in order to generate higher-fidelity forecasts.
Data Gap Type
Data Gap Details
R1: Reliability > Quality
There is large uncertainty in data as data is derived via interpolating station data. There are large biases over areas where rain gauge stations are sparse
S3: Sufficiency > Granularity
Resolution is 0.5 deg (roughly 50km) and not sufficiently fine for many applications.
Cable inspection robot data
Details (click to expand)
Cable inspection robot LiDAR data is beneficial for Specific Power Line (SPL) partitions which include dampers, insulators, broken strands, and attachments which may have degraded due to exposure to natural elements. Specific Fitting Detection partition data focuses on assessing risk at the lowest part of the power line near trees, roofs, and other power lines that may cross. Since the robots physically crawl on the lines, degradation detection of high voltage transmission lines are useful for maintenance scheduling and obstruction detection at the lower levels of the power line.
Grid inspection robot imagery may require coordination efforts with local utilities to gain access over multiple robot trips, image preprocessing to remove ambient artifacts, position and location calibration, as well as limitations in the identification of degradation patterns based on the resolution of the robot mounted camera.
Grid inspection robot imagery may require coordination efforts with local utilities to gain access over multiple robot trips, image preprocessing to remove ambient artifacts, position and location calibration, as well as limitations in the identification of degradation patterns based on the resolution of the robot mounted camera.
Data Gap Type
Data Gap Details
U2: Usability > Aggregation
Data needs to be aggregated and collated for multiple cable inspection robots for improved generalizability of detection models. To address this, multiple robot trips can be made over an area of interest. For example, an initial inspection can be used to identify target locations that need further data collection, followed by a second trip at target locations for camera capture. Additional cable inspection robots or external remote sensing data from may also be compiled.
U3: Usability > Usage Rights
Data is proprietary and requires coordination with utility.
U5: Usability > Pre-processing
Data may need significant preprocessing and thresholding to perform image segmentation tasks.
S2: Sufficiency > Coverage
It is necessary to supplement data with position orientation system data to better locate the cable inspection robot. One solution is to have the robot complete two inspections–a preliminary one to identify inspection targets, followed by a more detailed autonomous inspection of targets with additional high precision image capture data from an on-board or externally mounted pan-tilt-zoom (PTZ) camera.
S3: Sufficiency > Granularity
Spatial resolution depends on the type of cable inspection robot utilized. Data from multiple multispectral imagers, drones, cable-mounted sensors, and additional robots may be employed to improve the level of detail needed for specific obstructions.
Camera trap images
Details (click to expand)
Camera traps are likely the most widely used sensors in automated biodiversity monitoring due to their low cost and simple installation. This medium offers close-range monitoring over long-time scales. The image sequences can be used to not only classify species but to identify specifics about the individual, e.g. sex, age, health, behavior, and predator-prey interactions. Camera trap data has been used to estimate species occurrence, richness, distribution, and density.
In general, the raw images from camera traps need to be annotated before they can be used to train ML models. Some of the available annotated camera trap images are shared via Wildlife Insights (www.wildlifeinsights.org) and LILA BC (www.lila.science), while others are listed on GBIF (https://www.gbif.org/dataset/search?q=). However, the majority of camera trap data is likely scattered across individual research labs or organizations and not publicly available. Sharing such images shared could provide significant progress towards fill the gaps associated with the lack of annotated data that currently hinders the progress of efficiently using ML in biodiversity studies. This is what initiatives like Wildlife Insights are looking to do.
There is a lack of standardized protocol to guide data collection for restoration projects, including which variables to measure and how to measure them. Developing such protocols would standardize data collection practices, enabling consistent assessment and comparison of restoration successes and failures on a global scale.
There is a lack of standardized protocol to guide data collection for restoration projects, including which variables to measure and how to measure them. Developing such protocols would standardize data collection practices, enabling consistent assessment and comparison of restoration successes and failures on a global scale.
Data Gap Type
Data Gap Details
U1: Usability > Structure
For restoration projects, there is an urgent need of standardized protocol to guide data collection and curation for measuring biodiversity and other complex ecological outcomes of restoration projects. For example, there is an urgent need of a clearly written guidance on what variables to collect and how to collect them.
Turning raw data into usable insights is also a significant challenge. There is a lack of standardized tools and workflows to turn the raw data into some analysis-ready data and analyze the data in a consistent way across projects.
U4: Usability > Documentation
For restoration projects, there is an urgent need of standardized protocol to guide data collection and curation for measuring biodiversity and other complex ecological outcomes of restoration projects. For example, there is an urgent need of a clearly written guidance on what variables to collect and how to collect them.
Turning raw data into usable insights is also a significant challenge. There is a lack of standardized tools and workflows to turn the raw data into some analysis-ready data and analyze the data in a consistent way across projects.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
Data Gap Type
Data Gap Details
S1: Sufficiency > Insufficient Volume
One of the foremost challenges is the scarcity of publicly open and adequately annotated data for model training. It is worth mentioning that the lack of annotated datasets is a common and major challenge applied to almost every modality of biodiversity data and is particularly acute for bioacoustic data, where the availability of sufficiently large and diverse datasets, encompassing a wide array of species, remains limited for model training purposes.
This scarcity of publicly open and well annotated datasets arises from the labor-intensive nature of data annotation and the dearth of expertise necessary to accurately identify species. This is the case even for relatively well-studied taxa such as birds, and is even more notable for taxa such as Orthoptera (grasshoppers and relatives). The incompleteness of our current taxonomy also compounds this issue.
Though some data cannot be shared with the public because of concerns about poaching or other legal restrictions imposed by local governments, there are still considerable sources of well-annotated data that currently reside within the confines of individual researchers’ or organizations’ hard drives. Unlocking these data will partially address the data gap.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
S2: Sufficiency > Coverage
There exists a significant geographic imbalance in data collected, with insufficient representation from highly biodiverse regions. The data also does not cover a diverse spectrum of species, intertwining with taxonomy insufficiencies.
Addressing these gaps involves several strategies:
Data sharing: Unlocking existing datasets scattered across individual laboratories and organizations can partially alleviate the geographic and taxonomic gaps in data coverage.
Annotation Efforts:
Leveraging crowdsourcing platforms enables collaborative data collection and facilitates volunteer-driven annotation, playing a pivotal role in expanding annotated datasets.
Collaborating with domain experts, including ecologists and machine learning practitioners, is crucial. Incorporating local ecological knowledge, particularly from indigenous communities in target regions, enriches data annotation efforts.
Developing AI-powered methods for automatic cataloging, searching, and data conversion is imperative to streamline data processing and extract relevant information efficiently.
Data Collection:
Prioritize continuous data collection efforts, especially in underrepresented regions and for less documented species and taxonomic groups.
Explore innovative approaches like Automated Multisensor Stations for Monitoring of Species Diversity (AMMODs) to enhance data collection capabilities.
Strategically plan data collection initiatives to target species at risk of extinction, ensuring data capture before irreversible loss occurs.
U2: Usability > Aggregation
A lot of well-annotated datasets are currently scattered within the confines of individual researchers’ or organizations’ hard drives. Aggregating and sharing these datasets would largely address the lack of publicly open annotated data needed for model training. This initiative would also mitigate geographic and taxonomic imbalances in existing data.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
The major data gap is the limited institutional capacity to process and analyze data promptly to inform decision-making.
Data Gap Type
Data Gap Details
M: Misc/Other
There is a significant institutional challenge in processing and analyzing data promptly to inform decision-making. To enhance institutional capacity for leveraging global data sources and analytical methods effectively, a strategic, ecosystem-building approach is essential, rather than solely focusing on individual researcher skill development. This approach should prioritize long-term sustainability over short-term project-based funding.
Funding presents a major bottleneck for ecosystem monitoring initiatives. While most funding allocations are short-term, there is a critical need for sustained and adequate funding to support ongoing monitoring efforts and maintain data processing capabilities.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
Data Gap Type
Data Gap Details
S1: Sufficiency > Insufficient Volume
One of the foremost challenges is the scarcity of publicly open and adequately annotated data for model training. It is worth mentioning that the lack of annotated datasets is a common and major challenge applied to almost every modality of biodiversity data and is particularly acute for bioacoustic data, where the availability of sufficiently large and diverse datasets, encompassing a wide array of species, remains limited for model training purposes.
This scarcity of publicly open and well annotated datasets arises from the labor-intensive nature of data annotation and the dearth of expertise necessary to accurately identify species. This is the case even for relatively well-studied taxa such as birds, and is even more notable for taxa such as Orthoptera (grasshoppers and relatives). The incompleteness of our current taxonomy also compounds this issue.
Though some data cannot be shared with the public because of concerns about poaching or other legal restrictions imposed by local governments, there are still considerable sources of well-annotated data that currently reside within the confines of individual researchers’ or organizations’ hard drives. Unlocking these data will partially address the data gap.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
S2: Sufficiency > Coverage
There exists a significant geographic imbalance in data collected, with insufficient representation from highly biodiverse regions. The data also does not cover a diverse spectrum of species, intertwining with taxonomy insufficiencies.
Addressing these gaps involves several strategies:
Data sharing: Unlocking existing datasets scattered across individual laboratories and organizations can partially alleviate the geographic and taxonomic gaps in data coverage.
Annotation Efforts:
Leveraging crowdsourcing platforms enables collaborative data collection and facilitates volunteer-driven annotation, playing a pivotal role in expanding annotated datasets.
Collaborating with domain experts, including ecologists and machine learning practitioners, is crucial. Incorporating local ecological knowledge, particularly from indigenous communities in target regions, enriches data annotation efforts.
Developing AI-powered methods for automatic cataloging, searching, and data conversion is imperative to streamline data processing and extract relevant information efficiently.
Data Collection:
Prioritize continuous data collection efforts, especially in underrepresented regions and for less documented species and taxonomic groups.
Explore innovative approaches like Automated Multisensor Stations for Monitoring of Species Diversity (AMMODs) to enhance data collection capabilities.
Strategically plan data collection initiatives to target species at risk of extinction, ensuring data capture before irreversible loss occurs.
U2: Usability > Aggregation
A lot of well-annotated datasets are currently scattered within the confines of individual researchers’ or organizations’ hard drives. Aggregating and sharing these datasets would largely address the lack of publicly open annotated data needed for model training. This initiative would also mitigate geographic and taxonomic imbalances in existing data.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
Changes in marine ecosystems
Details (click to expand)
Annual data on changes (e.g. extent) in marine ecosystems such as mangroves, seagrasses, salt marshes, and wetlands due to various factors including coastal erosion, aquaculture, and others.
Use Case
Data Gap Summary
ClimSim
Details (click to expand)
An ML-ready benchmark dataset designed for hybrid ML-physics research, e.g. emulation of subgrid clouds and convection processes.
Physics-based climate models incorporate numerous complex components, such as radiative transfer, subgrid-scale cloud processes, deep convection, and subsurface ocean eddy dynamics. These components are computationally intensive, which limits the spatial resolution achievable in climate simulations. ML models can emulate these physical processes, providing a more efficient alternative to traditional methods. By integrating ML-based emulations into climate models, we can achieve faster simulations and enhanced model performance.
Physics-based climate models incorporate numerous complex components, such as radiative transfer, subgrid-scale cloud processes, deep convection, and subsurface ocean eddy dynamics. These components are computationally intensive, which limits the spatial resolution achievable in climate simulations. ML models can emulate these physical processes, providing a more efficient alternative to traditional methods. By integrating ML-based emulations into climate models, we can achieve faster simulations and enhanced model performance.
Data Gap Type
Data Gap Details
U6: Usability > Large Volume
A common challenge for emulating climate model components, especially subgrid scale processes is the large data volume, which makes data downloading, transferring, processing, and storing challenging. Computation resources, including GPUs and storage, are urgently needed for most ML practitioners. Technical help on optimizing code for large volumes of data would also be appreciated.
S3: Sufficiency > Granularity
The current resolution is still sufficient to resolve some physical processes, e.g. turbulence, and tornado. Extremely high-resolution simulations, like large-eddy-simulations, are what needed.
Climate-related laws and regulations
Details (click to expand)
Laws and regulations for climate action that are published through national and subnational governments. There are some centralized databases, such as Climate Policy Radar, International Energy Agency, and New Climate Institute that have selected, aggregated, and structured these data into comprehensive resources.
Laws and regulations for climate action are published in various formats through national and subnational governments, and most are not labeled as a “climate policy”. There are a number of initiatives that take on the challenge of selecting, aggregating, and structuring the laws to provide a better overview of the global policy landscape. This, however, requires a great deal of work, needs to be permanently updated, and datasets are not complete.
Laws and regulations for climate action are published in various formats through national and subnational governments, and most are not labeled as a “climate policy”. There are a number of initiatives that take on the challenge of selecting, aggregating, and structuring the laws to provide a better overview of the global policy landscape. This, however, requires a great deal of work, needs to be permanently updated, and datasets are not complete.
Data Gap Type
Data Gap Details
U1: Usability > Structure
Much of the data are in PDF format and need to be structured into machine-readable format. Much of the data in original languages of the publishing country and needs to be translated into English.
U2: Usability > Aggregation
Legislation data is published through national and subnational governments, and often is not explicitly labeled as “climate policy”. Determing whether it is climate-related is not simple.
This information is usually published on local websites and must be downloaded or scraped manually. There are a number of initiatives, such as Climate Policy Radar, International Energy Agency, and New Climate Institute that are working to address this by selecting, aggregating, and structuring these data to provide a better overview of the global policy landscape. However, this process is labor-intensive, requires continuous updates, and often results in incomplete datasets.
ClimateBench v1.0
Details (click to expand)
A benchmark dataset derived from a full complexity Earth System Model (NorESM2; participant of CMIP 6) for for emulation of key climate variables https://zenodo.org/records/7064308.
The dataset currently includes simulations from only one model. To enhance accuracy and reliability, it is important to include simulations from multiple models.
The dataset currently includes simulations from only one model. To enhance accuracy and reliability, it is important to include simulations from multiple models.
Data Gap Type
Data Gap Details
S2: Sufficiency > Coverage
Currently, the dataset includes information from only one model. Training a machine learning model with this single source of data may result in limited generalization capabilities. To improve the model’s robustness and accuracy, it is essential to incorporate data from multiple models. This approach not only enhances the model’s ability to generalize across different scenarios but also helps reduce uncertainties associated with relying on a single model.
Community science data
Details (click to expand)
Images and recordings contributed by citizen scientists and volunteers represent another significant source of data in biodiversity and ecosystem. Crowdsourcing platforms, such as iNaturalist, eBird, Zooniverse, and Wildbook, facilitate the sharing of community science data. Many of these platforms also serve as hubs for collating and annotating datasets.
The main challenge with community science data is its lack of diversity. Data tends to be concentrated in accessible areas and primarily focuses on charismatic or commonly encountered species.
The main challenge with community science data is its lack of diversity. Data tends to be concentrated in accessible areas and primarily focuses on charismatic or commonly encountered species.
Data Gap Type
Data Gap Details
S2: Sufficiency > Coverage
Data is often concentrated in easily accessible areas and focuses on more charismatic or easily identifiable species. Data is also biased towards more densely populated species.
Computational fluid dynamics simulation
Details (click to expand)
Computational fluid dynamics (CFD) simulation output is a means of assessing natural ventilation for new building construction in relation to layout geometry, terrain, presence of neighboring buildings and infrastructure, as well as materials. Multi-directional CFD simulations are often run to account for different times in the year where wind can vary with season. Given the building geometry, terrain, presence of neighboring buildings, and boundary conditions Navier-Stokes or Reynolds-Averaged Navier-Stokes equations can be solved over a lattice or grid superimposed on the layout.
Despite its usefulness in ventilation studies for new construction, CFD simulations are computationally expensive making it difficult to include in the early phase of the design process where building morphosis can be optimized to reduce future operational consumption associated with building lighting, heating, and cooling. Simulations require accurate input information with respect to material properties that may not be present in traditional urban building types. Output of models require the integration of domain knowledge to interpret results from large volumes of synthetic data for different wind directions becoming challenging to manage. Future data collection with respect to simulation output verification can benefit surrogate or proxy approaches to computationally expensive Navier-Stokes equations, and coverage is often restricted to modern building approaches, leaving out passive building techniques known as vernacular architecture from indigenous communities from being taken into design consideration.
Despite its usefulness in ventilation studies for new construction, CFD simulations are computationally expensive making it difficult to include in the early phase of the design process where building morphosis can be optimized to reduce future operational consumption associated with building lighting, heating, and cooling. Simulations require accurate input information with respect to material properties that may not be present in traditional urban building types. Output of models require the integration of domain knowledge to interpret results from large volumes of synthetic data for different wind directions becoming challenging to manage. Future data collection with respect to simulation output verification can benefit surrogate or proxy approaches to computationally expensive Navier-Stokes equations, and coverage is often restricted to modern building approaches, leaving out passive building techniques known as vernacular architecture from indigenous communities from being taken into design consideration.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
The computational overhead of running CFD simulations as well as the time can become prohibitive when used in the early stages of building design thereby limiting its use as a tool. To overcome this, surrogate models, such as GANs or physics constrained deep neural network architectures have been shown to provide promising results though further research with respect to turbulence representation needs to be taken into account.
U2: Usability > Aggregation
While the simulation framework is not difficult to acquire, aggregating and collating the input information regarding boundary conditions require predefined material properties to model heat transfer. Typically, traditional urban building materials are represented, however, in consideration of non-traditional materials additional data collecting and simulation adaptation may be necessary.
U6: Usability > Large Volume
CFD simulations can generate large amounts of data with respect to flow rates, surface temperatures, and turbulence which can be difficult to interpret manually by domain experts. By setting clear objectives at the planning stage to recognize specific flow or thermal phenomena in the synthetic data, model outputs and their associated visualizations can be better interpreted.
R1: Reliability > Quality
As with all simulations, verification with real world data after the building has been completed can allow architects to build a dataset that could be used to improve better surrogate models as well as validate the benefit of incorporating CFD in building planning and early prototype design.
S2: Sufficiency > Coverage
CFD simulations assume static rather than dynamic conditions for doors, windows, and vents in a building layout. In reality, these objects can significantly impact room ventilation as well as thermal conditions which can in turn affect operational consumption assumptions. Furthermore, vernacular architectural techniques such as use of vegetation to protect against winds, thermal chimneys, courtyards, stilted building design, and rooftop wind catchers (badgirs), are not considered in simulation frameworks which tend to focus on walls and external wind directions. Creation or expansion of current simulation frameworks to include scenarios created by passive building strategies can be beneficial in designing buildings that have less dependence on HVAC systems for thermal comfort.
Copernicus Marine Data Store
Details (click to expand)
https://data.marine.copernicus.eu/products Free-of-charge state-of-the-art data on the state of the Blue (physical), White (sea ice) and Green (biogeochemical) ocean, on a global and regional scale.
Data downloading is a bottleneck because it requires familiarity with APIs, which not all users possess.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
API is needed to download data but many ecologists are not familiar with scripting languages.
M: Misc/Other
It would be ideal if Copernicus also made biodiversity data available on its website. Having access to both biodiversity data and associated environmental ocean data on the same platform would significantly enhance efficiency and accessibility. This integration would eliminate the need to download massive datasets for local analysis, streamlining the process for users.
DOE Atmospheric Radiation Measurement (ARM) research facility data products
Details (click to expand)
ARM represents data from various field measurement programs sponsored by the US Department of Energy with a focus on ground-based pyrheliometer and spectrometer data which is useful for solar radiation time series forecasting and solar potential assessment.
ARM dataset includes data from various DOE sites that include sensor information from sun-tracking photometers, radiometers, spectrometer data which is helpful in understanding hyperspectral solar irradiance and cloud dynamics. ARM sites generate large datasets which can be challenging to store, stream, analyze and archive, may be sensitive to sensor noise, and require further measurement verification especially with respect to aerosol composition. Additionally, ARM data coverage is limited to ARM sites motivating future collaboration with partner networks to enhance observational spatial coverage.
ARM dataset includes data from various DOE sites that include sensor information from sun-tracking photometers, radiometers, spectrometer data which is helpful in understanding hyperspectral solar irradiance and cloud dynamics. ARM sites generate large datasets which can be challenging to store, stream, analyze and archive, may be sensitive to sensor noise, and require further measurement verification especially with respect to aerosol composition. Additionally, ARM data coverage is limited to ARM sites motivating future collaboration with partner networks to enhance observational spatial coverage.
Data Gap Type
Data Gap Details
U6: Usability > Large Volume
ARM sites generate large datasets which can be challenging to to store, analyze, stream, and archive. Automating ingestion and analysis of data using artificial intelligence can alleviate volume by compressing/reducing data storage and provide novel ways to index and access the data.
R1: Reliability > Quality
Data quality from ARM site sensors can be sensitive to noise and calibration issues requiring field specialists to identify potential problems. Since data volume is large, ingestion of data and identification of measurement drift benefit from automation.
S2: Sufficiency > Coverage
Spatial coverage of radiation and associated ground based atmospheric phenomena are limited to ARM sites within the United States. To increase spatial context collaboration with partner sensor network sites with the DOE and ARM program can expand coverage within the United States. Similar initiatives outside the United States can enable better solar potential studies in regions with different environments.
S3: Sufficiency > Granularity
There is a need for the retrieval of enhanced aerosol composition measurements in addition to ice nucleating particle measurements for better understanding cloud and weather dynamics jointly with solar irradiance for DER site planning and solar potential surveying.
DYAMOND (DYnamics of the Atmospheric general circulation Modeled On Non-hydrostatic Domains)
Details (click to expand)
Intercomparison of global storm-resolving (5km or less) model simulations; used as the target of the emulator. Data can be found here.
Physics-based climate models incorporate numerous complex components, such as radiative transfer, subgrid-scale cloud processes, deep convection, and subsurface ocean eddy dynamics. These components are computationally intensive, which limits the spatial resolution achievable in climate simulations. ML models can emulate these physical processes, providing a more efficient alternative to traditional methods. By integrating ML-based emulations into climate models, we can achieve faster simulations and enhanced model performance.
Physics-based climate models incorporate numerous complex components, such as radiative transfer, subgrid-scale cloud processes, deep convection, and subsurface ocean eddy dynamics. These components are computationally intensive, which limits the spatial resolution achievable in climate simulations. ML models can emulate these physical processes, providing a more efficient alternative to traditional methods. By integrating ML-based emulations into climate models, we can achieve faster simulations and enhanced model performance.
Data Gap Type
Data Gap Details
U6: Usability > Large Volume
A common challenge for emulating climate model components, especially subgrid scale processes is the large data volume, which makes data downloading, transferring, processing, and storing challenging. Computation resources, including GPUs and storage, are urgently needed for most ML practitioners. Technical help on optimizing code for large volumes of data would also be appreciated.
S3: Sufficiency > Granularity
The current resolution is still sufficient to resolve some physical processes, e.g. turbulence, and tornado. Extremely high-resolution simulations, like large-eddy-simulations, are what needed.
Direct measurement of methane emission of rice paddies
Details (click to expand)
Direct measurement of methane emission of rice paddies by instruments and sampling systems placed in rice paddies to directly measure methane concentrations in the air above the fields or in the soil.
There is a lack of direct observation of methane emissions from rice paddies.
Data Gap Type
Data Gap Details
W: Wish
Direct measurement of methane emissions is often expensive and labor-intensive. But this data is essential as it provides the ground truth for training and constraining ML models. Increased funding is needed to support and encourage comprehensive data collection efforts.
Distribution system simulators
Details (click to expand)
Distribution system simulators such as OpenDSS and GridLab-D are crucial for understanding the hosting capacity of distribution level substation feeders because they allow for the analysis of various factors that can affect the stability and reliability of the power grid. These factors include voltage limits, thermal capability, control parameters, and fault current, among others. By simulating different scenarios and conditions, such as the integration of distributed energy resources (DERs) such as photovoltaic (PV) solar panels, these tools can provide insights into how the grid can be optimized to accommodate these resources without compromising safety and reliability. OpenDSS is free to use as an alternative when distribution utility real circuit feeder data is unavailable.
While OpenDSS and GridLab-D are free to use as an alternative when real distribution substation circuit feeder data is unavailable, to perform site-specific or scenario studies, data from different sources may be needed to verify simulation results. Actual hosting capacity may vary from simulation results due to differences in load, environmental conditions, and the level of DER penetration.
While OpenDSS and GridLab-D are free to use as an alternative when real distribution substation circuit feeder data is unavailable, to perform site-specific or scenario studies, data from different sources may be needed to verify simulation results. Actual hosting capacity may vary from simulation results due to differences in load, environmental conditions, and the level of DER penetration.
Data Gap Type
Data Gap Details
U2: Usability > Aggregation
To perform a realistic distribution system-level study for a particular region of interest, data concerning topology, loads, and penetration of DERs needs to be aggregated and collated from external sources.
U3: Usability > Usage Rights
Rights to external data for use with OpenDSS or GridLab-D may require purchase or partnerships with utilities and/or the Distribution System Operator (DSO) to perform scenario studies with high DER penetration and load demand.
R1: Reliability > Quality
OpenDSS and GridLab-D studies require real deployment data for verification of results from substations. Additionally, distribution level substation feeder hosting capacity may vary based on load, environmental conditions, and the level of DER penetration in a service area.
Drone images
Details (click to expand)
Drone images have revolutionized various fields by providing high-resolution, aerial perspectives that were previously difficult to obtain. Equipped with advanced cameras and sensors, drones capture detailed visual data from above, offering insights into landscapes, infrastructure, and environmental changes.
Thermal images captured by drones have high value but the cost of high-resolution sensors is high.
Data Gap Type
Data Gap Details
S3: Sufficiency > Granularity
Thermal images are highly valuable, but their resolution is often too low (commonly 120x90 pixels) and their field of vision is limited. Commercially available sensors can achieve 640x480 pixels, but they are much more expensive (~$10K). There are even Higher-resolution sensors are available but are currently restricted to military use due to security, ethical, and privacy concerns. Those seeking such high-resolution sensors should carefully weigh the benefits and drawbacks of their request.
U6: Usability > Large Volume
Data volume is a concern for those collecting drone images and seeking to share them with the public. Finding a platform that offers adequate storage for hosting the data is challenging, as it must ensure that users can download the data efficiently without issues.
Drone images for biodiversity
Details (click to expand)
Like camera traps, drone images can offer high-resolution and relatively close-range images for species identification, individual identification, and environment reconstruction. As with camera traps, most drone images are scattered across disparate sources. Some such data is hosted on www.lila.science。
There is a lack of standardized protocol to guide data collection for restoration projects, including which variables to measure and how to measure them. Developing such protocols would standardize data collection practices, enabling consistent assessment and comparison of restoration successes and failures on a global scale.
There is a lack of standardized protocol to guide data collection for restoration projects, including which variables to measure and how to measure them. Developing such protocols would standardize data collection practices, enabling consistent assessment and comparison of restoration successes and failures on a global scale.
Data Gap Type
Data Gap Details
U1: Usability > Structure
For restoration projects, there is an urgent need of standardized protocol to guide data collection and curation for measuring biodiversity and other complex ecological outcomes of restoration projects. For example, there is an urgent need of a clearly written guidance on what variables to collect and how to collect them.
Turning raw data into usable insights is also a significant challenge. There is a lack of standardized tools and workflows to turn the raw data into some analysis-ready data and analyze the data in a consistent way across projects.
U4: Usability > Documentation
For restoration projects, there is an urgent need of standardized protocol to guide data collection and curation for measuring biodiversity and other complex ecological outcomes of restoration projects. For example, there is an urgent need of a clearly written guidance on what variables to collect and how to collect them.
Turning raw data into usable insights is also a significant challenge. There is a lack of standardized tools and workflows to turn the raw data into some analysis-ready data and analyze the data in a consistent way across projects.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
Data Gap Type
Data Gap Details
S1: Sufficiency > Insufficient Volume
One of the foremost challenges is the scarcity of publicly open and adequately annotated data for model training. It is worth mentioning that the lack of annotated datasets is a common and major challenge applied to almost every modality of biodiversity data and is particularly acute for bioacoustic data, where the availability of sufficiently large and diverse datasets, encompassing a wide array of species, remains limited for model training purposes.
This scarcity of publicly open and well annotated datasets arises from the labor-intensive nature of data annotation and the dearth of expertise necessary to accurately identify species. This is the case even for relatively well-studied taxa such as birds, and is even more notable for taxa such as Orthoptera (grasshoppers and relatives). The incompleteness of our current taxonomy also compounds this issue.
Though some data cannot be shared with the public because of concerns about poaching or other legal restrictions imposed by local governments, there are still considerable sources of well-annotated data that currently reside within the confines of individual researchers’ or organizations’ hard drives. Unlocking these data will partially address the data gap.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
S2: Sufficiency > Coverage
There exists a significant geographic imbalance in data collected, with insufficient representation from highly biodiverse regions. The data also does not cover a diverse spectrum of species, intertwining with taxonomy insufficiencies.
Addressing these gaps involves several strategies:
Data sharing: Unlocking existing datasets scattered across individual laboratories and organizations can partially alleviate the geographic and taxonomic gaps in data coverage.
Annotation Efforts:
Leveraging crowdsourcing platforms enables collaborative data collection and facilitates volunteer-driven annotation, playing a pivotal role in expanding annotated datasets.
Collaborating with domain experts, including ecologists and machine learning practitioners, is crucial. Incorporating local ecological knowledge, particularly from indigenous communities in target regions, enriches data annotation efforts.
Developing AI-powered methods for automatic cataloging, searching, and data conversion is imperative to streamline data processing and extract relevant information efficiently.
Data Collection:
Prioritize continuous data collection efforts, especially in underrepresented regions and for less documented species and taxonomic groups.
Explore innovative approaches like Automated Multisensor Stations for Monitoring of Species Diversity (AMMODs) to enhance data collection capabilities.
Strategically plan data collection initiatives to target species at risk of extinction, ensuring data capture before irreversible loss occurs.
U2: Usability > Aggregation
A lot of well-annotated datasets are currently scattered within the confines of individual researchers’ or organizations’ hard drives. Aggregating and sharing these datasets would largely address the lack of publicly open annotated data needed for model training. This initiative would also mitigate geographic and taxonomic imbalances in existing data.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
Data Gap Type
Data Gap Details
S1: Sufficiency > Insufficient Volume
One of the foremost challenges is the scarcity of publicly open and adequately annotated data for model training. It is worth mentioning that the lack of annotated datasets is a common and major challenge applied to almost every modality of biodiversity data and is particularly acute for bioacoustic data, where the availability of sufficiently large and diverse datasets, encompassing a wide array of species, remains limited for model training purposes.
This scarcity of publicly open and well annotated datasets arises from the labor-intensive nature of data annotation and the dearth of expertise necessary to accurately identify species. This is the case even for relatively well-studied taxa such as birds, and is even more notable for taxa such as Orthoptera (grasshoppers and relatives). The incompleteness of our current taxonomy also compounds this issue.
Though some data cannot be shared with the public because of concerns about poaching or other legal restrictions imposed by local governments, there are still considerable sources of well-annotated data that currently reside within the confines of individual researchers’ or organizations’ hard drives. Unlocking these data will partially address the data gap.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
S2: Sufficiency > Coverage
There exists a significant geographic imbalance in data collected, with insufficient representation from highly biodiverse regions. The data also does not cover a diverse spectrum of species, intertwining with taxonomy insufficiencies.
Addressing these gaps involves several strategies:
Data sharing: Unlocking existing datasets scattered across individual laboratories and organizations can partially alleviate the geographic and taxonomic gaps in data coverage.
Annotation Efforts:
Leveraging crowdsourcing platforms enables collaborative data collection and facilitates volunteer-driven annotation, playing a pivotal role in expanding annotated datasets.
Collaborating with domain experts, including ecologists and machine learning practitioners, is crucial. Incorporating local ecological knowledge, particularly from indigenous communities in target regions, enriches data annotation efforts.
Developing AI-powered methods for automatic cataloging, searching, and data conversion is imperative to streamline data processing and extract relevant information efficiently.
Data Collection:
Prioritize continuous data collection efforts, especially in underrepresented regions and for less documented species and taxonomic groups.
Explore innovative approaches like Automated Multisensor Stations for Monitoring of Species Diversity (AMMODs) to enhance data collection capabilities.
Strategically plan data collection initiatives to target species at risk of extinction, ensuring data capture before irreversible loss occurs.
U2: Usability > Aggregation
A lot of well-annotated datasets are currently scattered within the confines of individual researchers’ or organizations’ hard drives. Aggregating and sharing these datasets would largely address the lack of publicly open annotated data needed for model training. This initiative would also mitigate geographic and taxonomic imbalances in existing data.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
Data Gap Type
Data Gap Details
S1: Sufficiency > Insufficient Volume
One of the foremost challenges is the scarcity of publicly open and adequately annotated data for model training. It is worth mentioning that the lack of annotated datasets is a common and major challenge applied to almost every modality of biodiversity data and is particularly acute for bioacoustic data, where the availability of sufficiently large and diverse datasets, encompassing a wide array of species, remains limited for model training purposes.
This scarcity of publicly open and well annotated datasets arises from the labor-intensive nature of data annotation and the dearth of expertise necessary to accurately identify species. This is the case even for relatively well-studied taxa such as birds, and is even more notable for taxa such as Orthoptera (grasshoppers and relatives). The incompleteness of our current taxonomy also compounds this issue.
Though some data cannot be shared with the public because of concerns about poaching or other legal restrictions imposed by local governments, there are still considerable sources of well-annotated data that currently reside within the confines of individual researchers’ or organizations’ hard drives. Unlocking these data will partially address the data gap.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
S2: Sufficiency > Coverage
There exists a significant geographic imbalance in data collected, with insufficient representation from highly biodiverse regions. The data also does not cover a diverse spectrum of species, intertwining with taxonomy insufficiencies.
Addressing these gaps involves several strategies:
Data sharing: Unlocking existing datasets scattered across individual laboratories and organizations can partially alleviate the geographic and taxonomic gaps in data coverage.
Annotation Efforts:
Leveraging crowdsourcing platforms enables collaborative data collection and facilitates volunteer-driven annotation, playing a pivotal role in expanding annotated datasets.
Collaborating with domain experts, including ecologists and machine learning practitioners, is crucial. Incorporating local ecological knowledge, particularly from indigenous communities in target regions, enriches data annotation efforts.
Developing AI-powered methods for automatic cataloging, searching, and data conversion is imperative to streamline data processing and extract relevant information efficiently.
Data Collection:
Prioritize continuous data collection efforts, especially in underrepresented regions and for less documented species and taxonomic groups.
Explore innovative approaches like Automated Multisensor Stations for Monitoring of Species Diversity (AMMODs) to enhance data collection capabilities.
Strategically plan data collection initiatives to target species at risk of extinction, ensuring data capture before irreversible loss occurs.
U2: Usability > Aggregation
A lot of well-annotated datasets are currently scattered within the confines of individual researchers’ or organizations’ hard drives. Aggregating and sharing these datasets would largely address the lack of publicly open annotated data needed for model training. This initiative would also mitigate geographic and taxonomic imbalances in existing data.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
ENS
Details (click to expand)
Ensemble forecast up to 15 days ahead, generated by ECMWF numerical weather prediction model; used as a benchmark/baseline for evaluating ML-based weather forecasts. Data can be found here.
Same as HRES, the biggest challenge of ENS is that only a portion of it is available to the public for free.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
The high-resolution real-time forecast is for purchase only and it is expensive to get them.
U6: Usability > Large Volume
Due to its high spatio-temporal resolution, HRES data is quite large, creating significant challenges for downloading, transferring, storing, and processing with standard computational infrastructure. To address this, the WeatherBench 2 benchmark dataset provides a solution for certain machine learning tasks involving ENS data.
The biggest challenge of ENS is that only a portion of it is available to the public for free.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
The high-resolution real-time forecast is for purchase only and it is expensive to get them.
U6: Usability > Large Volume
Due to its high spatio-temporal resolution, HRES data is quite large, creating significant challenges for downloading, transferring, storing, and processing with standard computational infrastructure. To address this, the WeatherBench 2 benchmark dataset provides a solution for certain machine learning tasks involving ENS data.
EPRI10: Transmission control center alarm and operational data set
Details (click to expand)
Supervisory Control and Data Acquisition (SCADA) systems collect data from sensors throughout the power grid. Alarm operational data, a portion of the data received by SCADA, provides discrete event-based information on the status of protection and monitoring devices in a tabular format which includes semi-structured text descriptions of individual alarm events. Often the data is formatted based on timestamp (in milliseconds), station, signal identification information, location, description, and action. Encoded within the identification information is the alarm message.
Access to EPRI10 grid alarm data is currently limited within EPRI. Data gaps with respect to usability are the result of redundancies in grid alarm codes, requiring significant preprocessing and analysis of code ids, alarm priority, location, and timestamps. Alarm codes can vary based on sensor, asset and line. Resulting actions as a consequence of alarm trigger events need field verifications to assess the presence of fault or non-fault events.
Access to EPRI10 grid alarm data is currently limited within EPRI. Data gaps with respect to usability are the result of redundancies in grid alarm codes, requiring significant preprocessing and analysis of code ids, alarm priority, location, and timestamps. Alarm codes can vary based on sensor, asset and line. Resulting actions as a consequence of alarm trigger events need field verifications to assess the presence of fault or non-fault events.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Data access is limited within EPRI due to restrictions with respect to data provided by utilities. Anonymization and aggregation of data to a benchmark or toy dataset by EPRI to the wider community can be a means of circumventing the security issues at the cost of operational context.
U1: Usability > Structure
Grid alarm codes may be non-unique for different lines and grid assets. In other words, two different codes could represent equivalent information due to differences in naming conventions requiring significant alarm data pre-processing and analysis in identifying unique labels from over 2000 code words. Additional labels expressing alarm priority, for example high alarm type indicative of events such as fire, gas, or lightning, are also encoded into the grid alarm trigger event code. Creation of a standard structure for operational text data such as those already utilized in operational systems by companies such as General Electric or Siemens can avoid inconsistencies in data.
U3: Usability > Usage Rights
Usage rights are currently constrained to those working within EPRI at this time.
U4: Usability > Documentation
Remote signaling identification information from monitoring sensors and devices encode data with respect to the alarm trigger event in the context of fault priority. Based on the asset, line, or sensor, this identification code can vary depending on naming conventions used. Documentation on remote signal ids associated with a dictionary of finite alarm code types can facilitate pre-processing of alarm data and assessment on the diversity of fault events occurring in real-time systems (as different alarm trigger codes may correspond to redundant events similar in nature).
U5: Usability > Pre-processing
In addition to challenges with respect to the decoding of remote signal identification data, the description fields associated with alarm trigger events are unstructured and vary in the amount of text detail provided. Typically the details cover information with respect to the grid asset and its action. For example, a text description from a line monitoring device may describe the power, temperature, and the action taken in response to the grid alarm trigger event. Often in real world systems the majority of grid alarm trigger events are short circuit faults and non-fault events, limiting the diversity of fault types found in the data.
To combat these issues, data pre-processing becomes necessary. For remote signal identification data this includes parsing and hashing through text codes, assessing code components for redundancies, and building an associated reduced dictionary of alarm codes. For textual description fields, and post-fault field reports, the use of natural language processing techniques to extract key information can provide more consistency between sensor data. Additionally, techniques like diverse sampling can account for the class imbalance with respect to the associated fault that can trigger the alarm.
U6: Usability > Large Volume
Operational alarm data volume is large given the cadence of measurements made in the system at every millisecond. This can result in high data volume that is tabular in nature, but also unstructured with respect to text details associated with alarm trigger events, sensor measurements, and controller actions. Since the data also contains locations and grid asset information, spatio-temporal analysis can be made with respect to a single sensor and the conditions over which that sensor is operating. Therefore indexing and mining time series data can be an approach for facilitating faster search over alarm data leading up to a fault event. Additionally natural language processing and text mining techniques can also be utilized to facilitate search over alarm text and details.
R1: Reliability > Quality
Alarm trigger events and the corresponding action taken by the events, require post assessment by field workers especially in cases of faults or perceived faults for verification.
U2: Usability > Aggregation
Reports on location, asset, and time can result in false alarm triggers requiring operators to send field workers to investigate, fix, and recalibrate field sensors. The data with respect to field assessments can be incorporated into the original data to provide greater context resulting in compilation of multimodal datasets which can enhance alarm data understanding.
ERA5
Details (click to expand)
Atmospheric reanalysis data integrates both in-situ and remote sensing observations, including data from weather stations, satellites, and radar. This comprehensive dataset can be downloaded from the provided link.
ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.
ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Downloading ERA5 data from the Copernicus Climate Data Store can be very time-consuming, taking anywhere from days to months. This delay is due to the large size of the dataset, which results from its high spatio-temporal resolution, its high demand, and the fact that the data is stored on tape.
U6: Usability > Large Volume
The large volume of data poses significant challenges for downloading, transferring, storing, and processing with standard computational infrastructure. Although cloud-based access is available from various providers, it often comes with high costs and is not free. To address these challenges, it would be beneficial to make additional computational resources available alongside the storage of data.
R1: Reliability > Quality
ERA5 is often used as “ground truth” in bias-correction tasks, but it has its own biases and uncertainties. ERA5 is derived from observations combined with physics-based models to estimate conditions in areas with sparse observational data. Consequently, biases in ERA5 can stem from the limitations of these physics-based models. It is worth noting that precipitation and cloud-related fields in ERA5 are less accurate compared to other fields and are not suitable for validating ML models. A higher-fidelity atmospheric dataset, such as an enhanced version of ERA5, is greatly needed. Machine learning can play a significant role in this area by improving the assimilation of atmospheric observation data from various sources.
ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.
ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Downloading ERA5 data from the Copernicus Climate Data Store can be very time-consuming, taking anywhere from days to months. This delay is due to the large size of the dataset, which results from its high spatio-temporal resolution, its high demand, and the fact that the data is stored on tape.
U6: Usability > Large Volume
The large volume of data poses significant challenges for downloading, transferring, storing, and processing with standard computational infrastructure. Although cloud-based access is available from various providers, it often comes with high costs and is not free. To address these challenges, it would be beneficial to make additional computational resources available alongside the storage of data.
R1: Reliability > Quality
ERA5 is often used as “ground truth” in bias-correction tasks, but it has its own biases and uncertainties. ERA5 is derived from observations combined with physics-based models to estimate conditions in areas with sparse observational data. Consequently, biases in ERA5 can stem from the limitations of these physics-based models. It is worth noting that precipitation and cloud-related fields in ERA5 are less accurate compared to other fields and are not suitable for validating ML models. A higher-fidelity atmospheric dataset, such as an enhanced version of ERA5, is greatly needed. Machine learning can play a significant role in this area by improving the assimilation of atmospheric observation data from various sources.
ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.
ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Downloading ERA5 data from the Copernicus Climate Data Store can be very time-consuming, taking anywhere from days to months. This delay is due to the large size of the dataset, which results from its high spatio-temporal resolution, its high demand, and the fact that the data is stored on tape.
U6: Usability > Large Volume
The large volume of data poses significant challenges for downloading, transferring, storing, and processing with standard computational infrastructure. Although cloud-based access is available from various providers, it often comes with high costs and is not free. To address these challenges, it would be beneficial to make additional computational resources available alongside the storage of data.
R1: Reliability > Quality
ERA5 is often used as “ground truth” in bias-correction tasks, but it has its own biases and uncertainties. ERA5 is derived from observations combined with physics-based models to estimate conditions in areas with sparse observational data. Consequently, biases in ERA5 can stem from the limitations of these physics-based models. It is worth noting that precipitation and cloud-related fields in ERA5 are less accurate compared to other fields and are not suitable for validating ML models. A higher-fidelity atmospheric dataset, such as an enhanced version of ERA5, is greatly needed. Machine learning can play a significant role in this area by improving the assimilation of atmospheric observation data from various sources.
ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.
ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months.
Wildfires are becoming more frequent and severe due to climate change. Accurate early prediction of these events enables timely evacuations and resource allocation, thereby reducing risks to lives and property. ML can enhance fire prediction by providing more precise forecasts quickly and efficiently.
Wildfires are becoming more frequent and severe due to climate change. Accurate early prediction of these events enables timely evacuations and resource allocation, thereby reducing risks to lives and property. ML can enhance fire prediction by providing more precise forecasts quickly and efficiently.
Data Gap Type
Data Gap Details
ESRI land cover map
Details (click to expand)
Sentinel-2 10-m annual map of Earth’s land surface from 2017-2023.
There are also other land cover maps available: https://gisgeography.com/free-global-land-cover-land-use-data/.
Use Case
Data Gap Summary
Emission dataset compiled from FAO statistics
Details (click to expand)
Dataset taken from FAO statistics and extrapolated spatially
Data is extrapolated from statistics on a national level. It is unknown how accurate this data is when focusing on local information.
Data Gap Type
Data Gap Details
R1: Reliability > Quality
Data is extrapolated from statistics on a national level. It is unknown how accurate this data is when focusing on local information.
Exposure data
Details (click to expand)
Exposure is defined as the representative value of assests potentially exposed to a natural hazard occurrence. It can be described by a wide range of features, such as GDP, population, buildings, agriculture, depending on the risk exposed to.
There are global open data as well as proprietary data with more detailed information coming from well estabilished insurance markets.
It can be socio-economic data or structural (building occupancy and construction class) data. Two open-source structural data are OpenStreetMap and OpenQuake GEM project.
Country-specific exposure data can range from extensive and detailed to almost completely unavailable, even if they exist as hard copies in government offices.
U3: Usability > Usage Rights
OpenQuake GEM project provides comprehensive data on the residential, commercial, and industrial building stock worldwide, but it is restricted to non-commercial use only.
R1: Reliability > Quality
For some data, e.g. population data, there are several datasets available and they all differ from each other by a lot. Validation is needed before the data can be used comfortably and confidently.
Some data, e.g. geospatial socioeconomic data provided by the UNEP Global Resource Information Database, are not always current or complete.
S3: Sufficiency > Granularity
For open global data, the resolution and completeness are usually not sufficient for desired purposes, e.g. GDP data from the World Bank or US CIA is not sufficiently detailed for assessing risks from natural hazards.
Faraday: Synthetic smart meter data
Details (click to expand)
Due to consumer privacy protections, advanced metering infrastructure (AMI) data is unavailable for realistic demand response studies. In an effort to open smart meter data, Octopus Energy’s Centre for Net Zero, has generated a synthetic dataset conditioned on presence of low carbon technologies, energy efficiency, and property type from a model trained on 300 million actual smart meter readings from a United Kingdom (UK) energy supplier.
Faraday synthetic AMI data is a response to the bottlenecks faced in retrieval of building level demand data based on consumer privacy. However, despite the model being trained on actual AMI data, synthetic data generation requires a priori knowledge with respect to building type, efficiency, and presence of low carbon technology. Furthermore, since the dataset is trained on UK building data, AMI time series generated may not be an accurate representation of load demand in regions outside the UK. Finally, since data is synthetically generated studies will require validation and verification using real data or aggregated per substation level data to assess its effectiveness. Faraday is currently accessible through the Centre for Net Zero’s API.
Faraday synthetic AMI data is a response to the bottlenecks faced in retrieval of building level demand data based on consumer privacy. However, despite the model being trained on actual AMI data, synthetic data generation requires a priori knowledge with respect to building type, efficiency, and presence of low carbon technology. Furthermore, since the dataset is trained on UK building data, AMI time series generated may not be an accurate representation of load demand in regions outside the UK. Finally, since data is synthetically generated studies will require validation and verification using real data or aggregated per substation level data to assess its effectiveness. Faraday is currently accessible through the Centre for Net Zero’s API.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
The Variational Autoencoder Model can generate synthetic AMI data based on several conditions. The presence of low carbon technology (LCT) for a given household or property type depends on access to battery storage solutions, solar rooftop panels, and presence of electric vehicles. This type of data may require curation of LCT purchases by type and household. Building type and efficiency at the residential and commercial/industrial level may also be difficult to access, requiring the user to set initial assumptions or seek additional datasets. Furthermore, data verification requires a performance metric based on actual readings. This may be done through access to substation level load demand data.
U3: Usability > Usage Rights
Faraday is open for alpha testing by request only.
R1: Reliability > Quality
Verification of AMI synthetic data requires verification which can be done in a bottom up grid modeling manner. For example, load demand at the substation level can be estimated based on the sum of individual building loads which the substation services. This value can then be compared to the actual substation load demand provided through private partnerships with distribution network operators (DNOs). However, accuracy of a specific demand profile per property or section of properties would require identification of a population of buildings, a connected real-world substation, and residential low carbon technology investment for the set of properties under study.
S2: Sufficiency > Coverage
Faraday is trained from utility provided AMI data from the UK which may not be representative of load demand and corresponding building type and temperate zone of other global regions. To generate similar synthetic data, custom data may be retrieved through a pilot test bed for private collection or the result of a partnership with a local utility. Additionally, pre-existing AMI data over an area of interest can be utilized to generate similar synthetic data.
Datasets are restricted to past pilot study coverage areas requiring further data collection for fine-tuning models to a different coverage area.
S3: Sufficiency > Granularity
Data granularity is limited to the granularity of data the model was trained on. Generative modeling approaches similar to Faraday, can be built using higher resolution data or interpolation methods could also be employed.
S4: Sufficiency > Timeliness
Timeliness of dataset would require continuous integration and development of the model using MLOps best practices to avoid data and model drift. By contributing to Linux Foundation Energy’s OpenSynth initiative, Centre for Net Zero hopes to build a global community of contributors to facilitate research.
FathomNet
Details (click to expand)
FathomNet is an open-source image database that standardizes and aggregates expertly curated labeled data. It can be used to train, test, and validate state-of-the-art artificial intelligence algorithms to help us understand our ocean and its inhabitants.
The biggest challenge for the developer of FathomNet is enticing different institutions to contribute their data.
Data Gap Type
Data Gap Details
M: Misc/Other
The biggest challenge for the developer of FathomNet is enticing different institutions to contribute their data.
Financial loss datasets related to the impacts of disasters
Details (click to expand)
Financial loss datasets related to disasters track the economic impacts of catastrophic events, including insurance claims and damages to infrastructure. They help assess financial repercussions and guide risk management and preparedness strategies.
The financial loss data is usually proprietary and not open to the public.
Data Gap Type
Data Gap Details
O1: Obtainability > Findability
Data tends to be proprietary, as the most consistent loss data is produced by the insurance industry.
O2: Obtainability > Accessibility
Even for a single event, collecting a robust set of homogeneous loss data poses a significant challenge.
U4: Usability > Documentation
With existing data, determining whether the data is complete can be a challenge as it is common that little or no metadata is associated with the loss data.
Financial loss data is typically proprietary and held by insurance and reinsurance companies, as well as financial and risk management firms. Some of the data should be made available for research purposes.
Floating INfrastructure for Ocean observations FINO3
Details (click to expand)
FINO3 is an off-shore wind mast based wind speed and wind direction research platform datasets which include time series data with respect to temperature, air pressure, relative humidity, global radiation, and precipitation. Images from the perspective of the platform provide a snapshot of of environmental conditions directly. The platform is located in the northern part of the German Bight, 80km northwest of the island of Sylt in the midst of wind farms. Wind measurements are taken between 32 to 102 meters above sea level with wind speed measurements taken every 10meters. Data is collected from August 2009 until the present day.
Due to the location, FINO platform sensors are prone to failures of measurement sensors due to adverse outdoor conditions such as high wind and high waves. Often when sensors fail manual recalibration or repair is nontrivial requiring weather conditions to be amenable for human intervention. This can directly affect the data quality which can last for several weeks to a season. Coverage is also constrained to the platform and associated wind farm location.
Due to the location, FINO platform sensors are prone to failures of measurement sensors due to adverse outdoor conditions such as high wind and high waves. Often when sensors fail manual recalibration or repair is nontrivial requiring weather conditions to be amenable for human intervention. This can directly affect the data quality which can last for several weeks to a season. Coverage is also constrained to the platform and associated wind farm location.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
The data is free to use but requires sign up through a login account at: https://login.bsh.de/fachverfahren/
U5: Usability > Pre-processing
The dataset is prone to failures of measurement sensors. Issues with data loggers, power supplies, and effects of adverse conditions such as low aerosol concentrations can influence data quality. High wind and wave conditions impact the ability to correct or recalibrate sensors creating data gaps that can last for several weeks or seasons.
S2: Sufficiency > Coverage
Coverage of wind farms is relegated to the dimensions of the platform itself and the wind farm that it is built in proximity to. For locations with different offshore characteristics similar testbed platforms or buoys can be developed.
S5: Sufficiency > Proxy
Due to the nature of sensors exposed to environmental ocean conditions and storms, often FINO sensors may need maintenance and repair but are difficult to physically access. Gaps in the data from lack of data points can be addressed by utilizing mesoscale wind modeling output.
GBIF
Details (click to expand)
GBIF—the Global Biodiversity Information Facility—is an international network and data infrastructure funded by the world’s governments. It offers open access to global biodiversity data. It sets common standards for sharing species records collected from various sources, like museum specimens and modern technologies. Using standards like Darwin Core, GBIF.org indexes millions of species records, accessible under open licenses, supporting scientific research and policy-making.
While GBIF provides a common standard, the accuracy of species classification in the data can vary, and classifications may change over time.
Data Gap Type
Data Gap Details
U5: Usability > Pre-processing
Though a common standard is provided at Gbif, the species classification of the data is not always accurate and consistent. The species were even classified into different groups over time.
GEDI lidar
Details (click to expand)
The Global Ecosystem Dynamics Investigation (GEDI) is a joint mission between NASA and the University of Maryland. It uses three lasers to capture and then construct detailed three-dimensional (3D) maps of forest canopy height and the distribution of branches and leaves. By accurately measuring forests in 3D, GEDI data play an important role in estimating the forest height as well as canopy height, and thus understanding the amounts of biomass and carbon forests store and how much they lose when disturbed.
GEDI is globally available but has some intricacies, e.g. geolocation errors, and weak return signal if the forest is dense, which bring uncertainties and errors into the estimate of canopy height.
Grid event signature library
Details (click to expand)
Grid2Op is a power systems simulation framework to perform reinforcement learning for electricity network operation that focuses on the use of topology to control the flows of the grid. Grid2Op allows users to control voltages by manipulating shunts or changing setpoint values of generators, influence active generation by use of redispatching, and manipulate storage units such as batteries or pumped storage to produce or absorb energy from the grid when needed. The grid is represented as a graph with nodes being buses and edges corresponding to power lines and transformers. Grid2Op has several available environments with different network topologies as well as variables that can be monitored as observations. The environment is designed for reinforcement learning agents to act upon with a variety of actions some of which are binary or continuous. This includes changes in topology such as changing bus, changing line status, setting storage, curtailment, redispatching, setting bus values, and setting line status. Multiple reward functions are also available in the platform for experimentation with different agents. It is important to note that Grid2Op has no internal modeling of equations of the grids or what kind of solver is necessary to adopt. Data on how the power grid is evolving is represented by the “Chronics.” The solver that computes the state of the grid is represented by the “Backend” which utilizes PandaPower to compute power flows.
Grid2Op is a reinforcement learning framework that builds an environment based on topologies, selected grid observations, a selected reward function, and selected actions for an agent to select from. The framework relies on control laws rather than direct system observations which are subject to multiple constraints and changing load demand. Time steps representing 5 minutes are unable to capture complex transients and can limit the effectiveness of certain actions within the action space over others. Furthermore, customization of the Grid2Op can be challenging as the platform does not allow for single to multi-agent conversion, and is not a suitable environment for cascading failure scenarios due to game over rules.
Grid2Op is a reinforcement learning framework that builds an environment based on topologies, selected grid observations, a selected reward function, and selected actions for an agent to select from. The framework relies on control laws rather than direct system observations which are subject to multiple constraints and changing load demand. Time steps representing 5 minutes are unable to capture complex transients and can limit the effectiveness of certain actions within the action space over others. Furthermore, customization of the Grid2Op can be challenging as the platform does not allow for single to multi-agent conversion, and is not a suitable environment for cascading failure scenarios due to game over rules.
Data Gap Type
Data Gap Details
U4: Usability > Documentation
In the customization of the reward function, there are several TODOs in place concerning the units and attributes of the reward function related to redispatching. Documentation and code comments can sometimes provide conflicting information. Modularity of reward, adversary, action, environment, and backend are non-intuitive, requiring pregenerated dictionaries rather than dynamic inputs or conversion from single agent to multi-agent functionality. Refactoring of documentation and comments to reflect updates can assist users and avoid having to cross-reference information from the Discord channel for “Learning to Run a Power Network” and github issues.
U5: Usability > Pre-processing
The game over rules and constraints are difficult to adapt when customizing the environment for cascading failure scenarios and more complex adversaries such as natural disasters. Code base variations between versions especially between the native and Gym formatted framework lose features present in the legacy version including topology graphics. Open source refactoring efforts can assist in updating the code base to run latest and previous versions without loss of features.
S2: Sufficiency > Coverage
Coverage is limited to the network topologies provided by the grid2op environment which is based on different IEEE bus topologies. While customization of the environment in terms of the “Backend,” “Parameters,” and “Rules” are possible, there may be dependent modules that may still enforce game-over rules. Furthermore, since backend modeling is not the focus of grid2op, verification that customization obeys physical laws or models is necessary.
S3: Sufficiency > Granularity
The time resolution of 5-minute increments may not represent realistic observation time series grid data or chronics. Furthermore, the granularity may limit the effectiveness of specific actions in the provided action space. For example, the use of energy storage devices in the presence of overvoltage has little effect on energy absorption, incentivizing the agent to select from grid topology actions such as line changing line status or changing bus rather than setting storage. Expansion of the framework with efforts from the open source community to include multiple time resolutions may allow for generalization of the tool for different forecasting time horizons as well as action evaluation.
R1: Reliability > Quality
The grid2op framework relies on mathematical robust control laws and rewards which train the RL agent based on set observation assumptions rather than actual system dynamics which are susceptible to noise, uncertainty, and disturbances not represented in the simulation environment. It has no internal modeling of the equations of the grids nor can it suggest which solver should be adopted to solve traditional nonlinear optimal power flow equations. Specifics concerning modeling and preferred solver require users to customize or create a new “Backend.” Additionally, such RL human-in-the-loop systems in practice require trustworthiness and quantification of risk. A library of open source contributed “Backends” from independent projects that customize the framework with supplemental documentation and paper references can assist in further development of the environment for different conditions. Human-in-the-loop studies can be completed by testing the environment scenario and control response of the system over a model of a real grid. Generated observations and control actions can then be compared to historical event sequences and grid operator responses.
Ground survey of building information
Details (click to expand)
On-site collection of data to accurately map and measure the physical dimensions and boundaries of buildings. This survey is typically conducted using a variety of methods and tools to ensure precise and detailed mapping.
Use Case
Data Gap Summary
Ground survey of land use and land management
Details (click to expand)
The direct collection of data through field observations to understand how land is utilized and managed.
Data access is restricted due to institutional barriers and other restrictions.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Access to the data is restricted, with limited availability to the public. Users often find themselves unable to access the comprehensive information they require and must settle for suboptimal or outdated data. Addressing this challenge necessitates a legislative process to facilitate broader access to data.
Ground-survey based forest inventory data
Details (click to expand)
Forest information collected directly from forested areas through on-the-ground observations and measurements serves as ground truth for training and validating estimates. This data is crucial for accurate assessments, such as estimating forest canopy height using machine learning models.
The data is manually collected and recorded, resulting in errors, missing values, and duplicates. Additionally, it is limited in coverage and collection frequency.
The data is manually collected and recorded, resulting in errors, missing values, and duplicates. Additionally, it is limited in coverage and collection frequency.
Data Gap Type
Data Gap Details
U5: Usability > Pre-processing
There is a lot of missing data and duplicates.
S2: Sufficiency > Coverage
Since data is collected manually, it is hard to scale and limited to certain regions only.
HRES
Details (click to expand)
Single high-resolution forecast up to 10 days ahead generated by ECMWF numerical weather prediction model, the Integrated Forecasting system (IFS). It is usually used as a benchmark/baseline for evaulating ML-based weather forecast. Data can be found here.
The biggest challenge with using HRES data is that only a portion of it is available to the public for free.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
The high-resolution real-time forecast is for purchase only and it is expensive to get them.
U6: Usability > Large Volume
Due to its high spatio-temporal resolution, HRES data is quite large, creating significant challenges for downloading, transferring, storing, and processing with standard computational infrastructure. To address this, the WeatherBench 2 benchmark dataset provides a solution for certain machine learning tasks involving HRES data.
The biggest challenge of HRES is that only a portion of it is available to the public for free.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
The high-resolution real-time forecast is for purchase only and it is expensive to get them.
U6: Usability > Large Volume
Due to its high spatio-temporal resolution, HRES data is quite large, creating significant challenges for downloading, transferring, storing, and processing with standard computational infrastructure. To address this, the WeatherBench 2 benchmark dataset provides a solution for certain machine learning tasks involving HRES data.
Hazard data
Details (click to expand)
Hazard data used for risk assessments usually are presented in the form of a catalog of hypothetical events with characteristics derived from, and statistically consistent with, the observational record. Some hazard data catalog can be found here https://sedac.ciesin.columbia.edu/theme/hazards/data/sets/browse, as well as from the Risk Data Library of the World Bank.
Resolution of current hazard data is not sufficient for effective physical risk assessment
Data Gap Type
Data Gap Details
S3: Sufficiency > Granularity
Climate hazard data (e.g., floods, tropical cyclones, droughts) is often too coarse for effective physical risk assessments, which focus on evaluating damage to infrastructure such as buildings and power grids. While exposure data, including information on buildings and power grids, is available at resolutions ranging from 25 meters to 250 meters, climate hazard projections, especially those extending beyond a year, are typically at resolutions of 25 kilometers or more.
To provide meaningful risk assessments, more granular data is required. This necessitates downscaling efforts, both dynamical and statistical, to refine the resolution of climate hazard data. Machine learning (ML) can play a valuable role in these downscaling processes. Additionally, the downscaled data should be made publicly available, and a dedicated portal should be established to facilitate access and sharing of this refined information.
R1: Reliability > Quality
Projecting future climate hazards is crucial for assessing long-term risks. Climate simulations from CMIP models are currently our primary source for future climate projections. However, these simulations come with significant uncertainties due to both uncertainties in model and emission scenarios. To improve their utility for disaster risk assessment and other applications, increased funding and efforts are needed to advance climate model development for greater accuracy. Additionally, machine learning methods can help mitigate some of these uncertainties by bias-correcting the simulations.
S6: Sufficiency > Missing Components
Seasonal climate hazard forecasts are crucial for disaster risk assessment, management, and preparation. However, high-resolution data at this scale is often lacking for many hazards. This challenge is likely due to the difficulty in generating accurate seasonal weather forecasts. ML has the potential to address this gap by improving forecast accuracy and granularity.
Health data
Details (click to expand)
Health data refers to information related to individuals’ physical and mental well-being. This can include a wide range of data, such as medical records, health surveys, healthcare utilization, and epidemiological data.
The biggest issue for health data is its limited and restricted access.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
There is, in general, not a lot of datasets one can use to cover the spectrum of population, age, gender, economic, etc. To make good use of available data, there should be more efforts to integrate available data from disparate data sources, such as the creation of data repositories and the open community data standard.
U4: Usability > Documentation
There are some data repositories available. The existing issues are that data is not always accompanied by the source code that created the data or other types of good documentation.
U2: Usability > Aggregation
Integrating climate data and health data is challenging. Climate data is usually in raster files or gridded format, whereas health data is usually in tabular format. Mapping climate data to the same geospatial entity of health data is also computationally expensive.
High-resolution weather forecast (HRRR)
Details (click to expand)
Data volume is large, and only data covering the US is available.
Data Gap Type
Data Gap Details
U6: Usability > Large Volume
The large data volume, resulting from its high spatio-temporal resolution, makes transferring and processing the data very challenging. It would be beneficial if the data were accessible remotely and if computational resources were provided alongside it.
Data volume is large due to the high spatio-temporal resolution, which makes transfering and processing the data very difficult.
S2: Sufficiency > Coverage
This assimilated dataset currently covers only the continental US. It would be highly beneficial to have a similar dataset that includes global coverage.
Historical climate observations
Details (click to expand)
Climate observations of the past. Reanalysis dataset like ERA5 provides a global-scale data at coarse-resolution. Climate data aggregated from local weather station observations offer a more granular view.
Processing climate data and Integrating climate data with health data is a big challenge.
Data Gap Type
Data Gap Details
O1: Obtainability > Findability
For people without expertise in climate data only, it is hard to find the data right for their needs, as there is no centralized platform where they can turn for all available climate data.
U1: Usability > Structure
Datasets are of different formats and structures.
O1: Obtainability > Findability
For people without expertise in climate data only, it is hard to find the data right for their needs, as there is no centralized platform where they can turn for all available climate data.
Integrating climate data and health data is challenging. Climate data is usually in raster files or gridded format, whereas health data is usually in tabular format. Mapping climate data to the same geospatial entity of health data is also computationally expensive.
For highly biodiverse regions, like tropical rainforests, there is a lack of high-resolution data that captures the spatial heterogeneity of climate, hydrology, soil, and the relationship with biodiversity.
For highly biodiverse regions, like tropical rainforests, there is a lack of high-resolution data that captures the spatial heterogeneity of climate, hydrology, soil, and the relationship with biodiversity.
Data Gap Type
Data Gap Details
S3: Sufficiency > Granularity
There is a lack of high-resolution data that captures the spatial heterogeneity of climate, hydrology, soil, and so on, which are important for biodiversity patterns. This is because of a lack of observation systems that are dense enough to capture the subtleties in those variables caused by terrain. It would be helpful to establish decentralized monitoring networks to cost-effectively collect and maintain high-quality data over time, which cannot be done by one single country.
LBNL: Solar panel PV system dataset
Details (click to expand)
Lawrence Berkeley National Lab (LBNL) Solar Panel PV System Dataset is a small tabular dataset that includes specific feature data on PV system size, rebate, construction, tracking, mounting, module types, number of inverters and types, capacity, electricity pricing, and battery rated capacity. The LBNL solar panel PV system dataset was created by collecting and cleaning data for 1.6 million individual PV systems, representing 81% of all U.S. distributed PV systems installed through 2018. The analysis of installed prices focused on a subset of roughly 680,000 host-owned systems with available installed price data, of which 127,000 were installed in 2018. The dataset was sourced primarily from state agencies, utilities, and organizations administering PV incentive programs, solar renewable energy credit registration systems, or interconnection processes.
The LBNL solar panel PV system dataset excluded third party owned systems and systems with battery backup. Since data was self-reported, component cost data may be imprecise. The dataset also has a data gap in timeliness as some of the data includes historical data which may not reflect current pricing and costs of PV systems.
The LBNL solar panel PV system dataset excluded third party owned systems and systems with battery backup. Since data was self-reported, component cost data may be imprecise. The dataset also has a data gap in timeliness as some of the data includes historical data which may not reflect current pricing and costs of PV systems.
Data Gap Type
Data Gap Details
S2: Sufficiency > Coverage
The dataset excluded third-party owned systems, systems with battery backup, self-installed systems, and data that was missing installation prices. Data was self reported and may be inconsistent based on the reporting of component costs. Furthermore, some state markets were under represented or missing which can be alleviated by new data collection or use of dataset jointly with simulation studies.
S4: Sufficiency > Timeliness
Dataset includes historical data which may not reflect current pricing for PV systems. To alleviate this, updated pricing may be incorporated in the form of external data or as additional synthetic data from simulation.
Large-eddy simulations
Details (click to expand)
Very high resolution (finer than 150 m) atmospheric simulations where atmospheric turbulence is explicitly resolved in the model.
Extremely high-resolution simulations, such as large-eddy simulations, are needed. By explicitly resolving processes that are not resolved in current climate models, these simulations may serve as better ground truth for training machine learning models that emulate the physical processes and offer a more accurate basis for understanding and predicting climate phenomena.
Extremely high-resolution simulations, such as large-eddy simulations, are needed. By explicitly resolving processes that are not resolved in current climate models, these simulations may serve as better ground truth for training machine learning models that emulate the physical processes and offer a more accurate basis for understanding and predicting climate phenomena.
Data Gap Type
Data Gap Details
S6: Sufficiency > Missing Components
The resolution of current high-resolution simulations is still insufficient for resolving many physical processes, such as turbulence. To address this, extremely high-resolution simulations, like large-eddy simulations (with sub-kilometer or even tens of meter resolution), are needed. By explicitly resolving those turbulent processes, these simulations represent a more realistic realization of the atmosphere and therefore theoretically give better model results. These simulations may serve as ground truth for training machine learning models and offer a more accurate basis for understanding and predicting climate phenomena. Long-term climate simulations at this ultra-high resolution would significantly enhance both hybrid climate modeling and climate emulation, providing deeper insights into global warming scenarios.
Given the high computational cost of running such simulations, creating and sharing benchmark datasets based on these simulations is essential for the research community. This would facilitate model development and validation, promoting more accurate and efficient climate studies.