Artificial intelligence (AI) and machine learning (ML) offer a powerful suite of tools to accelerate climate change mitigation and adaptation across different sectors. However, the lack of high-quality, easily accessible, and standardized data often hinders the impactful use of AI/ML for climate change applications.
In this project, Climate Change AI, with the support of Google DeepMind, aims to identify and catalog critical data gaps that impede AI/ML applications in addressing climate change, and lay out pathways for filling these gaps. In particular, we identify candidate improvements to existing datasets, as well as "wishes" for new datasets whose creation would enable specific ML-for-climate use cases. We hope that researchers, practitioners, data providers, funders, policymakers, and others will join the effort to address these critical data gaps.
This project is currently in its beta phase, with ongoing improvements to content and usability. We encourage you to provide input and contributions via the routes listed below, or by emailing us at datagaps@climatechange.ai. We are grateful to the many stakeholders and interviewees who have already provided input.
Analysis of grid reliability events
Details (click to expand)
Power grid control centers receive multiple streams of data (with respect to alarms, sensors, and field reports) that are semi-structured in nature at a high volume. ML can assist in interpreting these data to better understand the sequence of events leading up to an incident as well as to identify and detect the causes behind system disturbances affecting grid reliability.
Access to EPRI10 grid alarm data is currently limited within EPRI. Data gaps with respect to usability are the result of redundancies in grid alarm codes, requiring significant preprocessing and analysis of code ids, alarm priority, location, and timestamps. Alarm codes can vary based on sensor, asset and line. Resulting actions as a consequence of alarm trigger events need field verifications to assess the presence of fault or non-fault events.
Access to EPRI10 grid alarm data is currently limited within EPRI. Data gaps with respect to usability are the result of redundancies in grid alarm codes, requiring significant preprocessing and analysis of code ids, alarm priority, location, and timestamps. Alarm codes can vary based on sensor, asset and line. Resulting actions as a consequence of alarm trigger events need field verifications to assess the presence of fault or non-fault events.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Data access is limited within EPRI due to restrictions with respect to data provided by utilities.
Anonymization and aggregation of data to a benchmark or toy dataset by EPRI to the wider community can be a means of circumventing the security issues at the cost of operational context.
U1: Usability > Structure
Grid alarm codes may be non-unique for different lines and grid assets. In other words, two different codes could represent equivalent information due to differences in naming conventions requiring significant alarm data pre-processing and analysis in identifying unique labels from over 2000 code words. Additional labels expressing alarm priority, for example high alarm type indicative of events such as fire, gas, or lightning, are also encoded into the grid alarm trigger event code.
Creation of a standard structure for operational text data such as those already utilized in operational systems by companies such as General Electric or Siemens can avoid inconsistencies in data.
U3: Usability > Usage Rights
Usage rights are currently constrained to those working within EPRI at this time.
U4: Usability > Documentation
Remote signaling identification information from monitoring sensors and devices encode data with respect to the alarm trigger event in the context of fault priority. Based on the asset, line, or sensor, this identification code can vary depending on naming conventions used. Documentation on remote signal ids associated with a dictionary of finite alarm code types can facilitate pre-processing of alarm data and assessment on the diversity of fault events occurring in real-time systems (as different alarm trigger codes may correspond to redundant events similar in nature).
U5: Usability > Pre-processing
In addition to challenges with respect to the decoding of remote signal identification data, the description fields associated with alarm trigger events are unstructured and vary in the amount of text detail provided. Typically the details cover information with respect to the grid asset and its action. For example, a text description from a line monitoring device may describe the power, temperature, and the action taken in response to the grid alarm trigger event. Often in real world systems the majority of grid alarm trigger events are comprised of short circuit faults and non-fault events, limiting the diversity of fault types found in the data.
To combat these issues, data pre-processing becomes necessary. For remote signal identification data this includes parsing and hashing through text codes, assessing code components for redundancies, and building an associated reduced dictionary of alarm codes. For textual description fields, and post-fault field reports, the use of natural language processing techniques to extract key information can provide more consistency between sensor data. Additionally, techniques like diverse sampling can account for the class imbalance with respect to the associated fault that can trigger the alarm.
U6: Usability > Large Volume
Operational alarm data volume is large given the cadence of measurements made in the system at every millisecond. This can result in high data volume that is tabular in nature, but also unstructured with respect to text details associated with alarm trigger events, sensor measurements, and controller actions. Since the data also contains locations and grid asset information, spatiotemporal analysis can be made with respect to a single sensor and the conditions over which that sensor is operating. Therefore indexing and mining time series data can be an approach for facilitating faster search over alarm data leading up to a fault event. Additionally natural language processing and text mining techniques can also be utilized to facilitate search over alarm text and details.
R1: Reliability > Quality
Alarm trigger events and the corresponding action taken by the events, require post assessment by field workers especially in cases of faults or perceived faults for verification.
U2: Usability > Aggregation
Reports on location, asset, and time can result in false alarm triggers requiring operators to send field workers to investigate, fix, and recalibrate field sensors. The data with respect to field assessments can be incorporated into the data to provide greater context.
Assessing forest restoration outcomes
Details (click to expand)
Efforts are being made to restore ecosystems like forests and mangroves. ML can be used to monitor biodiversity changes before and after restoration efforts, in order to quantify their effectiveness and outcomes.
There is a lack of standardized protocol to guide data collection for restoration projects, including which variables to measure and how to measure them. Developing such protocols would standardize data collection practices, enabling consistent assessment and comparison of restoration successes and failures on a global scale.
There is a lack of standardized protocol to guide data collection for restoration projects, including which variables to measure and how to measure them. Developing such protocols would standardize data collection practices, enabling consistent assessment and comparison of restoration successes and failures on a global scale.
Data Gap Type
Data Gap Details
U1: Usability > Structure
For restoration projects, there is an urgent need of standardized protocol to guide data collection and curation for measuring biodiversity and other complex ecological outcomes of restoration projects. For example, there is an urgent need of a clearly written guidance on what variables to collect and how to collect them.
Turning raw data into usable insights is also a significant challenge. There is a lack of standardized tools and workflows to turn the raw data into some analysis-ready data and analyze the data in a consistent way across projects.
U4: Usability > Documentation
For restoration projects, there is an urgent need of standardized protocol to guide data collection and curation for measuring biodiversity and other complex ecological outcomes of restoration projects. For example, there is an urgent need of a clearly written guidance on what variables to collect and how to collect them.
Turning raw data into usable insights is also a significant challenge. There is a lack of standardized tools and workflows to turn the raw data into some analysis-ready data and analyze the data in a consistent way across projects.
There is a lack of standardized protocol to guide data collection for restoration projects, including which variables to measure and how to measure them. Developing such protocols would standardize data collection practices, enabling consistent assessment and comparison of restoration successes and failures on a global scale.
There is a lack of standardized protocol to guide data collection for restoration projects, including which variables to measure and how to measure them. Developing such protocols would standardize data collection practices, enabling consistent assessment and comparison of restoration successes and failures on a global scale.
Data Gap Type
Data Gap Details
U1: Usability > Structure
For restoration projects, there is an urgent need of standardized protocol to guide data collection and curation for measuring biodiversity and other complex ecological outcomes of restoration projects. For example, there is an urgent need of a clearly written guidance on what variables to collect and how to collect them.
Turning raw data into usable insights is also a significant challenge. There is a lack of standardized tools and workflows to turn the raw data into some analysis-ready data and analyze the data in a consistent way across projects.
U4: Usability > Documentation
For restoration projects, there is an urgent need of standardized protocol to guide data collection and curation for measuring biodiversity and other complex ecological outcomes of restoration projects. For example, there is an urgent need of a clearly written guidance on what variables to collect and how to collect them.
Turning raw data into usable insights is also a significant challenge. There is a lack of standardized tools and workflows to turn the raw data into some analysis-ready data and analyze the data in a consistent way across projects.
There is a lack of standardized protocol to guide data collection for restoration projects, including which variables to measure and how to measure them. Developing such protocols would standardize data collection practices, enabling consistent assessment and comparison of restoration successes and failures on a global scale.
There is a lack of standardized protocol to guide data collection for restoration projects, including which variables to measure and how to measure them. Developing such protocols would standardize data collection practices, enabling consistent assessment and comparison of restoration successes and failures on a global scale.
Data Gap Type
Data Gap Details
U1: Usability > Structure
For restoration projects, there is an urgent need of standardized protocol to guide data collection and curation for measuring biodiversity and other complex ecological outcomes of restoration projects. For example, there is an urgent need of a clearly written guidance on what variables to collect and how to collect them.
Turning raw data into usable insights is also a significant challenge. There is a lack of standardized tools and workflows to turn the raw data into some analysis-ready data and analyze the data in a consistent way across projects.
U4: Usability > Documentation
For restoration projects, there is an urgent need of standardized protocol to guide data collection and curation for measuring biodiversity and other complex ecological outcomes of restoration projects. For example, there is an urgent need of a clearly written guidance on what variables to collect and how to collect them.
Turning raw data into usable insights is also a significant challenge. There is a lack of standardized tools and workflows to turn the raw data into some analysis-ready data and analyze the data in a consistent way across projects.
Assessment of climate impacts on public health
Details (click to expand)
Climate change has major implications for public health. ML can help analyze the relationships between climate variables and health outcomes to assess how changes in climate conditions affect public health.
The biggest issue for health data is its limited and restricted access.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
There is, in general, not a lot of datasets one can use to cover the spectrum of population, age, gender, economic, etc. To make good use of available data, there should be more efforts to integrate available data from disparate data sources, such as the creation of data repositories and the open community data standard.
U4: Usability > Documentation
There are some data repositories available. The existing issues are that data is not always accompanied by the source code that created the data or other types of good documentation.
U2: Usability > Aggregation
Integrating climate data and health data is challenging. Climate data is usually in raster files or gridded format, whereas health data is usually in tabular format. Mapping climate data to the same geospatial entity of health data is also computationally expensive.
Processing climate data and Integrating climate data with health data is a big challenge.
Data Gap Type
Data Gap Details
O1: Obtainability > Findability
For people without expertise in climate data only, it is hard to find the data right for their needs, as there is no centralized platform where they can turn for all available climate data.
U1: Usability > Structure
Datasets are of different formats and structures.
O1: Obtainability > Findability
For people without expertise in climate data only, it is hard to find the data right for their needs, as there is no centralized platform where they can turn for all available climate data.
Integrating climate data and health data is challenging. Climate data is usually in raster files or gridded format, whereas health data is usually in tabular format. Mapping climate data to the same geospatial entity of health data is also computationally expensive.
Automatic individual re-identification for wildlife
Details (click to expand)
Identification of individuals in wildlife (e.g., individual animals) refers to the process of recognizing and confirming the identity of an animal during subsequent encounters. It is crucial for identifying and monitoring endangered species to better understand their needs and threats, and to aid in conservation efforts. Computer vision related ML techniques are widely used for automatic individual identification.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
Data Gap Type
Data Gap Details
S1: Sufficiency > Insufficient Volume
One of the foremost challenges is the scarcity of publicly open and adequately annotated data for model training. It is worth mentioning that the lack of annotated datasets is a common and major challenge applied to almost every modality of biodiversity data and is particularly acute for bioacoustic data, where the availability of sufficiently large and diverse datasets, encompassing a wide array of species, remains limited for model training purposes.
This scarcity of publicly open and well annotated datasets arises from the labor-intensive nature of data annotation and the dearth of expertise necessary to accurately identify species. This is the case even for relatively well-studied taxa such as birds, and is even more notable for taxa such as Orthoptera (grasshoppers and relatives). The incompleteness of our current taxonomy also compounds this issue.
Though some data cannot be shared with the public because of concerns about poaching or other legal restrictions imposed by local governments, there are still considerable sources of well-annotated data that currently reside within the confines of individual researchers’ or organizations’ hard drives. Unlocking these data will partially address the data gap.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
S2: Sufficiency > Coverage
There exists a significant geographic imbalance in data collected, with insufficient representation from highly biodiverse regions. The data also does not cover a diverse spectrum of species, intertwining with taxonomy insufficiencies.
Addressing these gaps involves several strategies:
Data sharing: Unlocking existing datasets scattered across individual laboratories and organizations can partially alleviate the geographic and taxonomic gaps in data coverage.
Annotation Efforts:
Leveraging crowdsourcing platforms enables collaborative data collection and facilitates volunteer-driven annotation, playing a pivotal role in expanding annotated datasets.
Collaborating with domain experts, including ecologists and machine learning practitioners, is crucial. Incorporating local ecological knowledge, particularly from indigenous communities in target regions, enriches data annotation efforts.
Developing AI-powered methods for automatic cataloging, searching, and data conversion is imperative to streamline data processing and extract relevant information efficiently.
Data Collection:
Prioritize continuous data collection efforts, especially in underrepresented regions and for less documented species and taxonomic groups.
Explore innovative approaches like Automated Multisensor Stations for Monitoring of Species Diversity (AMMODs) to enhance data collection capabilities.
Strategically plan data collection initiatives to target species at risk of extinction, ensuring data capture before irreversible loss occurs.
U2: Usability > Aggregation
A lot of well-annotated datasets are currently scattered within the confines of individual researchers’ or organizations’ hard drives. Aggregating and sharing these datasets would largely address the lack of publicly open annotated data needed for model training. This initiative would also mitigate geographic and taxonomic imbalances in existing data.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
Data Gap Type
Data Gap Details
S1: Sufficiency > Insufficient Volume
One of the foremost challenges is the scarcity of publicly open and adequately annotated data for model training. It is worth mentioning that the lack of annotated datasets is a common and major challenge applied to almost every modality of biodiversity data and is particularly acute for bioacoustic data, where the availability of sufficiently large and diverse datasets, encompassing a wide array of species, remains limited for model training purposes.
This scarcity of publicly open and well annotated datasets arises from the labor-intensive nature of data annotation and the dearth of expertise necessary to accurately identify species. This is the case even for relatively well-studied taxa such as birds, and is even more notable for taxa such as Orthoptera (grasshoppers and relatives). The incompleteness of our current taxonomy also compounds this issue.
Though some data cannot be shared with the public because of concerns about poaching or other legal restrictions imposed by local governments, there are still considerable sources of well-annotated data that currently reside within the confines of individual researchers’ or organizations’ hard drives. Unlocking these data will partially address the data gap.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
S2: Sufficiency > Coverage
There exists a significant geographic imbalance in data collected, with insufficient representation from highly biodiverse regions. The data also does not cover a diverse spectrum of species, intertwining with taxonomy insufficiencies.
Addressing these gaps involves several strategies:
Data sharing: Unlocking existing datasets scattered across individual laboratories and organizations can partially alleviate the geographic and taxonomic gaps in data coverage.
Annotation Efforts:
Leveraging crowdsourcing platforms enables collaborative data collection and facilitates volunteer-driven annotation, playing a pivotal role in expanding annotated datasets.
Collaborating with domain experts, including ecologists and machine learning practitioners, is crucial. Incorporating local ecological knowledge, particularly from indigenous communities in target regions, enriches data annotation efforts.
Developing AI-powered methods for automatic cataloging, searching, and data conversion is imperative to streamline data processing and extract relevant information efficiently.
Data Collection:
Prioritize continuous data collection efforts, especially in underrepresented regions and for less documented species and taxonomic groups.
Explore innovative approaches like Automated Multisensor Stations for Monitoring of Species Diversity (AMMODs) to enhance data collection capabilities.
Strategically plan data collection initiatives to target species at risk of extinction, ensuring data capture before irreversible loss occurs.
U2: Usability > Aggregation
A lot of well-annotated datasets are currently scattered within the confines of individual researchers’ or organizations’ hard drives. Aggregating and sharing these datasets would largely address the lack of publicly open annotated data needed for model training. This initiative would also mitigate geographic and taxonomic imbalances in existing data.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
One gap in data is the incomplete barcoding reference databases.
Data Gap Type
Data Gap Details
S2: Sufficiency > Coverage
eDNA is an emerging new technique in biodiversity monitoring. There are still a lot of issues impeding the application of eDNA-based tools. One gap in data is the incomplete barcoding reference databases. However, a lot of attention and effort are being devoted to filling this data gap. For example, the BIOSCAN project. It is worth mentioning that BIOSCAN-5M is a comprehensive dataset containing multi-modal information, including DNA barcode sequenceses and taxonomic labels for over 5 million insect specimens, presenting as a large reference library on species- and genus-level classification tasks.
Bias-correction of climate projections
Details (click to expand)
Climate projection provides essential information about future climate conditions, guiding efforts in mitigation and adaptation, such as disaster risk assessments and power grid optimization. ML enhances the accuracy of these projections by bias-correcting forecasts generated by physics-based climate models (e.g., CMIP6). ML achieves this by learning the relationship between historical climate simulations (e.g., CMIP6 data) and observed ground truth data (such as ERA5 or weather station observations).
The large uncertainties in future climate projection is a big problem of CMIP6. The large volume of data and the lack of uniform structure—such as inconsistent variable names, data formats, and resolutions across different CMIP6 models—make it challenging to utilize data from multiple models effectively.
The large uncertainties in future climate projection is a big problem of CMIP6. The large volume of data and the lack of uniform structure—such as inconsistent variable names, data formats, and resolutions across different CMIP6 models—make it challenging to utilize data from multiple models effectively.
Data Gap Type
Data Gap Details
U6: Usability > Large Volume
The large volume size of CMIP6 data poses significant challenges for downloading, transferring, storing, and processing with standard computational infrastructure. Although cloud-based access is available from various providers, it often comes with high costs and is not free. To address these challenges, it would be beneficial to make additional computational resources available alongside the storage of data.
U1: Usability > Structure
Data from different models is of different resolutions and variable names, which makes assimilating data from multiple models challenging.
R1: Reliability > Quality
There are large biases and uncertainties in the data, which can be improved by improving the climate models used to generate the simulations.
ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.
ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Downloading ERA5 data from the Copernicus Climate Data Store can be very time-consuming, taking anywhere from days to months. This delay is due to the large size of the dataset, which results from its high spatio-temporal resolution, its high demand, and the fact that the data is stored on tape.
U6: Usability > Large Volume
The large volume of data poses significant challenges for downloading, transferring, storing, and processing with standard computational infrastructure. Although cloud-based access is available from various providers, it often comes with high costs and is not free. To address these challenges, it would be beneficial to make additional computational resources available alongside the storage of data.
R1: Reliability > Quality
ERA5 is often used as “ground truth” in bias-correction tasks, but it has its own biases and uncertainties. ERA5 is derived from observations combined with physics-based models to estimate conditions in areas with sparse observational data. Consequently, biases in ERA5 can stem from the limitations of these physics-based models. It is worth noting that precipitation and cloud-related fields in ERA5 are less accurate compared to other fields and are not suitable for validating ML models. A higher-fidelity atmospheric dataset, such as an enhanced version of ERA5, is greatly needed. Machine learning can play a significant role in this area by improving the assimilation of atmospheric observation data from various sources.
Data is not regularly gridded and needs to be preprocessed before being used in an ML model.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
The access to weather station data in some regions can be very largely restricted; only a small fraction of the data is open to the public.
U1: Usability > Structure
The data is not regularly gridded and requires preprocessing before being used in an ML model. In regions with dense station coverage, making decisions about how to handle overlapping data can be somewhat arbitrary. Machine learning can assist in optimizing this process.
S2: Sufficiency > Coverage
There is no sufficient data or even no data from global south.
Bias-correction of weather forecasts
Details (click to expand)
ML can be used to improve the fidelity of high-impact weather forecasts by post-processing outputs from physics-based numerical forecast models and by learning to correct the systematic biases associated with physics-based numerical forecasting models.
Same as HRES, the biggest challenge of ENS is that only a portion of it is available to the public for free.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
The high-resolution real-time forecast is for purchase only and it is expensive to get them.
U6: Usability > Large Volume
Due to its high spatio-temporal resolution, HRES data is quite large, creating significant challenges for downloading, transferring, storing, and processing with standard computational infrastructure. To address this, the WeatherBench 2 benchmark dataset provides a solution for certain machine learning tasks involving ENS data.
The biggest challenge with using HRES data is that only a portion of it is available to the public for free.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
The high-resolution real-time forecast is for purchase only and it is expensive to get them.
U6: Usability > Large Volume
Due to its high spatio-temporal resolution, HRES data is quite large, creating significant challenges for downloading, transferring, storing, and processing with standard computational infrastructure. To address this, the WeatherBench 2 benchmark dataset provides a solution for certain machine learning tasks involving HRES data.
Data is not regularly gridded and needs to be preprocessed before being used in an ML model.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
The access to weather station data in some regions can be very largely restricted; only a small fraction of the data is open to the public.
U1: Usability > Structure
The data is not regularly gridded and requires preprocessing before being used in an ML model. In regions with dense station coverage, making decisions about how to handle overlapping data can be somewhat arbitrary. Machine learning can assist in optimizing this process.
S2: Sufficiency > Coverage
There is no sufficient data or even no data from global south.
Data-driven generation of climate simulations
Details (click to expand)
Generating climate simulations by running physics-based climate models is time consuming. ML can be used to more quickly generate climate simulations corresponding to different greenhouse gas emissions scenarios. Specifically, ML can be used to learn a surrogate model that approximates computationally-intensive climate simulations generated via Earth system models.
The dataset currently includes simulations from only one model. To enhance accuracy and reliability, it is important to include simulations from multiple models.
The dataset currently includes simulations from only one model. To enhance accuracy and reliability, it is important to include simulations from multiple models.
Data Gap Type
Data Gap Details
S2: Sufficiency > Coverage
Currently, the dataset includes information from only one model. Training a machine learning model with this single source of data may result in limited generalization capabilities. To improve the model’s robustness and accuracy, it is essential to incorporate data from multiple models. This approach not only enhances the model’s ability to generalize across different scenarios but also helps reduce uncertainties associated with relying on a single model.
The large data volume and lack of uniform structure (no consistent variable names, data strucuture, and data resolution across all models) makes it difficult to use data from more than one model of CMIP6.
The large data volume and lack of uniform structure (no consistent variable names, data strucuture, and data resolution across all models) makes it difficult to use data from more than one model of CMIP6.
Data Gap Type
Data Gap Details
U6: Usability > Large Volume
The large volume size of CMIP6 data poses significant challenges for downloading, transferring, storing, and processing with standard computational infrastructure. Although cloud-based access is available from various providers, it often comes with high costs and is not free. To address these challenges, it would be beneficial to make additional computational resources available alongside the storage of data.
U1: Usability > Structure
Data from different models is of different resolutions and variable names, which makes assimilating data from multiple models challenging.
R1: Reliability > Quality
There are large biases and uncertainties in the data, which can be improved by improving the climate models used to generate the simulations.
Detection of climate-induced ecosystem changes
Details (click to expand)
Climate change is inducing significant changes in ecosystems. ML can be used to assess the impact of climate change on biodiversity and identify critical areas for conservation.
The major data gap is the limited institutional capacity to process and analyze data promptly to inform decision-making.
Data Gap Type
Data Gap Details
M: Misc/Other
There is a significant institutional challenge in processing and analyzing data promptly to inform decision-making. To enhance institutional capacity for leveraging global data sources and analytical methods effectively, a strategic, ecosystem-building approach is essential, rather than solely focusing on individual researcher skill development. This approach should prioritize long-term sustainability over short-term project-based funding.
Funding presents a major bottleneck for ecosystem monitoring initiatives. While most funding allocations are short-term, there is a critical need for sustained and adequate funding to support ongoing monitoring efforts and maintain data processing capabilities.
The major data gap is the limited institutional capacity to process and analyze data promptly to inform decision-making.
Data Gap Type
Data Gap Details
M: Misc/Other
There is a significant institutional challenge in processing and analyzing data promptly to inform decision-making. To enhance institutional capacity for leveraging global data sources and analytical methods effectively, a strategic, ecosystem-building approach is essential, rather than solely focusing on individual researcher skill development. This approach should prioritize long-term sustainability over short-term project-based funding.
Funding presents a major bottleneck for ecosystem monitoring initiatives. While most funding allocations are short-term, there is a critical need for sustained and adequate funding to support ongoing monitoring efforts and maintain data processing capabilities.
Data access is restricted due to institutional barriers and other restrictions.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Access to the data is restricted, with limited availability to the public. Users often find themselves unable to access the comprehensive information they require and must settle for suboptimal or outdated data. Addressing this challenge necessitates a legislative process to facilitate broader access to data.
For highly biodiverse regions, like tropical rainforests, there is a lack of high-resolution data that captures the spatial heterogeneity of climate, hydrology, soil, and the relationship with biodiversity.
For highly biodiverse regions, like tropical rainforests, there is a lack of high-resolution data that captures the spatial heterogeneity of climate, hydrology, soil, and the relationship with biodiversity.
Data Gap Type
Data Gap Details
S3: Sufficiency > Granularity
There is a lack of high-resolution data that captures the spatial heterogeneity of climate, hydrology, soil, and so on, which are important for biodiversity patterns. This is because of a lack of observation systems that are dense enough to capture the subtleties in those variables caused by terrain. It would be helpful to establish decentralized monitoring networks to cost-effectively collect and maintain high-quality data over time, which cannot be done by one single country.
Development of hybrid-climate models
Details (click to expand)
Physics-based climate models incorporate numerous complex components, such as radiative transfer, subgrid-scale cloud processes, deep convection, and subsurface ocean eddy dynamics. These components are computationally intensive, which limits the spatial resolution achievable in climate simulations. ML models can emulate these physical processes, providing a more efficient alternative to traditional methods. By integrating ML-based emulations into climate models, we can achieve faster simulations and enhanced model performance.
An ML-ready benchmark dataset designed for hybrid ML-physics research, e.g. emulation of subgrid clouds and convection processes.
Data Gap Type
Data Gap Details
U6: Usability > Large Volume
A common challenge for emulating climate model components, especially subgrid scale processes is the large data volume, which makes data downloading, transferring, processing, and storing challenging. Computation resources, including GPUs and storage, are urgently needed for most ML practitioners. Technical help on optimizing code for large volumes of data would also be appreciated.
S3: Sufficiency > Granularity
The current resolution is still sufficient to resolve some physical processes, e.g. turbulence, and tornado. Extremely high-resolution simulations, like large-eddy-simulations, are what needed.
Intercomparison of global storm-resolving (5km or less) model simulations; used as the target of the emulator. Data can be found here.
Data Gap Type
Data Gap Details
U6: Usability > Large Volume
A common challenge for emulating climate model components, especially subgrid scale processes is the large data volume, which makes data downloading, transferring, processing, and storing challenging. Computation resources, including GPUs and storage, are urgently needed for most ML practitioners. Technical help on optimizing code for large volumes of data would also be appreciated.
S3: Sufficiency > Granularity
The current resolution is still sufficient to resolve some physical processes, e.g. turbulence, and tornado. Extremely high-resolution simulations, like large-eddy-simulations, are what needed.
ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.
ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Downloading ERA5 data from the Copernicus Climate Data Store can be very time-consuming, taking anywhere from days to months. This delay is due to the large size of the dataset, which results from its high spatio-temporal resolution, its high demand, and the fact that the data is stored on tape.
U6: Usability > Large Volume
The large volume of data poses significant challenges for downloading, transferring, storing, and processing with standard computational infrastructure. Although cloud-based access is available from various providers, it often comes with high costs and is not free. To address these challenges, it would be beneficial to make additional computational resources available alongside the storage of data.
R1: Reliability > Quality
ERA5 is often used as “ground truth” in bias-correction tasks, but it has its own biases and uncertainties. ERA5 is derived from observations combined with physics-based models to estimate conditions in areas with sparse observational data. Consequently, biases in ERA5 can stem from the limitations of these physics-based models. It is worth noting that precipitation and cloud-related fields in ERA5 are less accurate compared to other fields and are not suitable for validating ML models. A higher-fidelity atmospheric dataset, such as an enhanced version of ERA5, is greatly needed. Machine learning can play a significant role in this area by improving the assimilation of atmospheric observation data from various sources.
Extremely high-resolution simulations, such as large-eddy simulations, are needed. By explicitly resolving processes that are not resolved in current climate models, these simulations may serve as better ground truth for training machine learning models that emulate the physical processes and offer a more accurate basis for understanding and predicting climate phenomena.
Extremely high-resolution simulations, such as large-eddy simulations, are needed. By explicitly resolving processes that are not resolved in current climate models, these simulations may serve as better ground truth for training machine learning models that emulate the physical processes and offer a more accurate basis for understanding and predicting climate phenomena.
Data Gap Type
Data Gap Details
S6: Sufficiency > Missing Components
The resolution of current high-resolution simulations is still insufficient for resolving many physical processes, such as turbulence. To address this, extremely high-resolution simulations, like large-eddy simulations (with sub-kilometer or even tens of meter resolution), are needed. By explicitly resolving those turbulent processes, these simulations represent a more realistic realization of the atmosphere and therefore theoretically give better model results. These simulations may serve as ground truth for training machine learning models and offer a more accurate basis for understanding and predicting climate phenomena. Long-term climate simulations at this ultra-high resolution would significantly enhance both hybrid climate modeling and climate emulation, providing deeper insights into global warming scenarios.
Given the high computational cost of running such simulations, creating and sharing benchmark datasets based on these simulations is essential for the research community. This would facilitate model development and validation, promoting more accurate and efficient climate studies.
Though a lot of data is available, a set of regularly gridded 3D high-resolution observations of the atmosphere state (like a higher-resolution version of ERA5) is still needed. This is essential for both an improved understanding of the atmospheric processes and the development of ML-based weather forecast models and climate models.
Though a lot of data is available, a set of regularly gridded 3D high-resolution observations of the atmosphere state (like a higher-resolution version of ERA5) is still needed. This is essential for both an improved understanding of the atmospheric processes and the development of ML-based weather forecast models and climate models.
Data Gap Type
Data Gap Details
S6: Sufficiency > Missing Components
ML models are now trained on high-resolution climate model simulations and their skills are therefore limited by the performance of climate models. True observations of the atmosphere at high resolution (< 5km) are needed to train and validate the ML models and hence improve the models’ performance.
Digital reconstruction of the environment
Details (click to expand)
Modeling digital representations of environmental conditions and habitats using remote sensing data, such as satellite images, is crucial for understanding how environmental factors impact animal behavior and conservation efforts. This approach provides valuable insights into habitat conditions and changes, which are essential for effective wildlife conservation and management. ML can enhance this process by efficiently processing large volumes of data from various sources, leading to more detailed and accurate environmental reconstructions.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
Data Gap Type
Data Gap Details
S1: Sufficiency > Insufficient Volume
One of the foremost challenges is the scarcity of publicly open and adequately annotated data for model training. It is worth mentioning that the lack of annotated datasets is a common and major challenge applied to almost every modality of biodiversity data and is particularly acute for bioacoustic data, where the availability of sufficiently large and diverse datasets, encompassing a wide array of species, remains limited for model training purposes.
This scarcity of publicly open and well annotated datasets arises from the labor-intensive nature of data annotation and the dearth of expertise necessary to accurately identify species. This is the case even for relatively well-studied taxa such as birds, and is even more notable for taxa such as Orthoptera (grasshoppers and relatives). The incompleteness of our current taxonomy also compounds this issue.
Though some data cannot be shared with the public because of concerns about poaching or other legal restrictions imposed by local governments, there are still considerable sources of well-annotated data that currently reside within the confines of individual researchers’ or organizations’ hard drives. Unlocking these data will partially address the data gap.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
S2: Sufficiency > Coverage
There exists a significant geographic imbalance in data collected, with insufficient representation from highly biodiverse regions. The data also does not cover a diverse spectrum of species, intertwining with taxonomy insufficiencies.
Addressing these gaps involves several strategies:
Data sharing: Unlocking existing datasets scattered across individual laboratories and organizations can partially alleviate the geographic and taxonomic gaps in data coverage.
Annotation Efforts:
Leveraging crowdsourcing platforms enables collaborative data collection and facilitates volunteer-driven annotation, playing a pivotal role in expanding annotated datasets.
Collaborating with domain experts, including ecologists and machine learning practitioners, is crucial. Incorporating local ecological knowledge, particularly from indigenous communities in target regions, enriches data annotation efforts.
Developing AI-powered methods for automatic cataloging, searching, and data conversion is imperative to streamline data processing and extract relevant information efficiently.
Data Collection:
Prioritize continuous data collection efforts, especially in underrepresented regions and for less documented species and taxonomic groups.
Explore innovative approaches like Automated Multisensor Stations for Monitoring of Species Diversity (AMMODs) to enhance data collection capabilities.
Strategically plan data collection initiatives to target species at risk of extinction, ensuring data capture before irreversible loss occurs.
U2: Usability > Aggregation
A lot of well-annotated datasets are currently scattered within the confines of individual researchers’ or organizations’ hard drives. Aggregating and sharing these datasets would largely address the lack of publicly open annotated data needed for model training. This initiative would also mitigate geographic and taxonomic imbalances in existing data.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
One gap in data is the incomplete barcoding reference databases.
Data Gap Type
Data Gap Details
S2: Sufficiency > Coverage
eDNA is an emerging new technique in biodiversity monitoring. There are still a lot of issues impeding the application of eDNA-based tools. One gap in data is the incomplete barcoding reference databases. However, a lot of attention and effort are being devoted to filling this data gap. For example, the BIOSCAN project. It is worth mentioning that BIOSCAN-5M is a comprehensive dataset containing multi-modal information, including DNA barcode sequenceses and taxonomic labels for over 5 million insect specimens, presenting as a large reference library on species- and genus-level classification tasks.
Satellite images provide environmental information for habitat monitoring. Combined with other data, e.g. bioacoustic data, they have been used to model and predict species distribution, richness, and interaction with the environment. High-resolution images are needed but most of them are not open to the public for free.
Satellite images provide environmental information for habitat monitoring. Combined with other data, e.g. bioacoustic data, they have been used to model and predict species distribution, richness, and interaction with the environment. High-resolution images are needed but most of them are not open to the public for free.
Data Gap Type
Data Gap Details
S3: Sufficiency > Granularity
Resolution of the publicly open satellite images are not sufficient for some environment reconstruction studies.
O2: Obtainability > Accessibility
The resolution of publicly open satellite images is not high enough. High-resolution images are usually commercial and not open for free.
Disaster risk assessment
Details (click to expand)
As climate change progresses, extreme weather events and related hazards are expected to become more frequent and severe. To effectively address these challenges, robust disaster risk assessment and management are crucial. ML can be used within these efforts to analyze satellite imagery and geographic data, in order to pinpoint vulnerable areas and produce comprehensive risk maps.
More information, such as age of the building, should be included in the dataset.
Data Gap Type
Data Gap Details
U1: Usability > Structure
Building footprint datasets are usually in different formats other than the format/coordinate system used by the government. To ensure these datasets are usable for local government applications, it would be helpful to have them align with the government’s referred format and coordinate system.
S6: Sufficiency > Missing Components
More information about the building, such as the age of the building and the source of the data should be included in the dataset.
R1: Reliability > Quality
The building footprint data can contain errors due to detection inaccuracies in the models used to generate the dataset, as well as limitations of satellite imagery. These limitations include outdated images that may not reflect recent developments and visibility issues such as cloud cover or obstructions that can prevent accurate identification of buildings.
Country-specific exposure data can range from extensive and detailed to almost completely unavailable, even if they exist as hard copies in government offices.
U3: Usability > Usage Rights
OpenQuake GEM project provides comprehensive data on the residential, commercial, and industrial building stock worldwide, but it is restricted to non-commercial use only.
R1: Reliability > Quality
For some data, e.g. population data, there are several datasets available and they all differ from each other by a lot. Validation is needed before the data can be used comfortably and confidently.
Some data, e.g. geospatial socioeconomic data provided by the UNEP Global Resource Information Database, are not always current or complete.
S3: Sufficiency > Granularity
For open global data, the resolution and completeness are usually not sufficient for desired purposes, e.g. GDP data from the World Bank or US CIA is not sufficiently detailed for assessing risks from natural hazards.
Data tends to be proprietary, as the most consistent loss data is produced by the insurance industry.
O2: Obtainability > Accessibility
Even for a single event, collecting a robust set of homogeneous loss data poses a significant challenge.
U4: Usability > Documentation
With existing data, determining whether the data is complete can be a challenge as it is common that little or no metadata is associated with the loss data.
Resolution of current hazard data is not sufficient for effectively physical risks assessment
Data Gap Type
Data Gap Details
S3: Sufficiency > Granularity
Hazard (e.g. floods, tropical cyclones) data of global coverage tends to be of coarse resolution and variable quality. More detailed data and models with higher resolution should be used in risk assessments for the design of specific disaster risk management projects.
R1: Reliability > Quality
Projection of future climate hazards is essential for assessing the long-term risks, but those data are currently of large uncertainties. For example, there is large uncertainty in the wildfire projection in CMIP6 data.
Doesn’t have meta-data regarding when the infrastructures, e.g. building was built, whereas this information is important to identify age of the building which in the end characterises the exposure to hazard.
Doesn’t have meta-data regarding when the infrastructures, e.g. building was built, whereas this information is important to identify age of the building which in the end characterises the exposure to hazard.
The availability, usability, and reliability of socioeconomic data are difficult. In general, there is a notable scarcity of data from the Global South. Data at a more granular scale is usually missing for the Global North. When data does exist, they lack consistency across multiple sources.
The availability, usability, and reliability of socioeconomic data are difficult. In general, there is a notable scarcity of data from the Global South. Data at a more granular scale is usually missing for the Global North. When data does exist, they lack consistency across multiple sources.
Data Gap Type
Data Gap Details
U1: Usability > Structure
Data is scattered and not easily findable in one single place.
U4: Usability > Documentation
The accompanied documentation lacks clarity or consistency over time.
U5: Usability > Pre-processing
There is a lack of standardized and coherent structures for data entry in socioeconomic datasets, resulting in users spending significant time on pre-processing tasks such as translating, standardizing, and harmonizing data. Integrating socioeconomic data with other types of data, such as environmental and health data for comprehensive analysis, can be even more challenging due to the differences in data formats and standards.
R1: Reliability > Quality
There is no single unified data source for many socio-economic data, e.g. population, human settlement. Each organization and company comes up with its own dataset and they all differ from each other by a lot. Data validation (e.g. talk to the government and ask for more information) is needed before using the data, but it is hard to do this validation at scalable way because of the time, financial, and technical efforts required.
To improve the reliability and usability of data, data entry methods need to be standardized and automated, and a unified platform for data entry is also wished for.
S2: Sufficiency > Coverage
There is often limited socio-economic data availability in developing regions due to inadequate data collection infrastructure, financial constraints, and political instability.
S3: Sufficiency > Granularity
In developed regions (Global North), there is a lack of granular data for precise analysis. For example, neither the GDP data from the World Bank nor the US CIA is sufficiently detailed for assessing risks from natural hazards. There is a lack of asset-level data for studying the physical risk of infrastructure.
Very high-resolution reference data, for example, DEM currently is not freely open to the public.
Data Gap Type
Data Gap Details
O1: Obtainability > Findability
Surface elevation data defined by a digital elevation model (DEM) is one of the most essential types of reference data. The high-resolution elevation data has huge value for disaster risk assessment, particularly for the Global South.
Open DEM data with global coverage now goes to a resolution of 30-m, but the resolution is still insufficient for many disaster risk assessments. Higher-resolution datasets exist, but they are either with limited spatial coverage or are commercial products and are very expensive to get.
Distribution-side hosting capacity estimation
Details (click to expand)
The hosting capacity of a distribution-level substation feeder on a power grid is crucial because it determines the amount of distributed energy resources (DERs) that a distribution circuit can safely accommodate without compromising grid reliability. Distribution-level substation feeders are especially susceptible to voltage fluctuations caused by solar PV generation. Operationally, distribution level substation feeders must surmount voltage sags or fault ride-through to ensure generating sources stay connected without incurring major generation loss even if disconnected for short time frames or maintained as reactive power. This may be achieved with the use of inverter technology to modulate output to match the larger power system’s needs and protect it from faults and overloads.
While OpenDSS and GridLab-D are free to use as an alternative when real distribution substation circuit feeder data is unavailable, to perform site-specific or scenario studies, data from different sources may be needed to verify simulation results. Actual hosting capacity may vary from simulation results due to differences in load, environmental conditions, and the level of DER penetration.
While OpenDSS and GridLab-D are free to use as an alternative when real distribution substation circuit feeder data is unavailable, to perform site-specific or scenario studies, data from different sources may be needed to verify simulation results. Actual hosting capacity may vary from simulation results due to differences in load, environmental conditions, and the level of DER penetration.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
OpenDSS is free to use as an alternative when distribution utility real circuit feeder data is unavailable.
U2: Usability > Aggregation
To perform a realistic distribution system-level study for a particular region of interest, data concerning topology, loads, and penetration of DERs needs to be aggregated and collated from external sources.
U3: Usability > Usage Rights
Rights to external data for use with OpenDSS or GridLab-D may require purchase or partnerships with utilities to perform scenario studies with high DER penetration and load demand.
R1: Reliability > Quality
OpenDSS and GridLab-D studies require real deployment data for verification of results from substations. Additionally, distribution level substation feeder hosting capacity may vary based on load, environmental conditions, and the level of DER penetration in a service area.
Early detection and active monitoring of fire
Details (click to expand)
Climate change is expected to increase both the frequency and intensity of wildfires, as well as lengthen the fire season due to rising temperatures and shifting precipitation patterns. ML can play a crucial role in wildfire detection and monitoring by synthesizing data from various sources in order to provide more timely and precise information. For instance, ML algorithms can analyze satellite imagery from different regions to detect early signs of fires and track their progression. Additionally, ML can enhance automatic fire detection systems, improving their accuracy and responsiveness.
Thermal images captured by drones have high value but the cost of good sensors is high.
Data Gap Type
Data Gap Details
S3: Sufficiency > Granularity
Thermal images are highly valuable, but their resolution is often too low (commonly 120x90 pixels) and their field of vision is limited. Commercially available sensors can achieve 640x480 pixels, but they are much more expensive (~$10K). There are even Higher-resolution sensors are available but are currently restricted to military use due to security, ethical, and privacy concerns. Those seeking such high-resolution sensors should carefully weigh the benefits and drawbacks of their request.
U6: Usability > Large Volume
Data volume is a concern for those collecting drone images and seeking to share them with the public. Finding a platform that offers adequate storage for hosting the data is challenging, as it must ensure that users can download the data efficiently without issues.
There are very few samples of satellite images per day.
Data Gap Type
Data Gap Details
S1: Sufficiency > Insufficient Volume
Constrained by the revisit frequency, there are very few images per day captured by satellites.
Earth observation for climate-related applications
Details (click to expand)
Many climate-related applications suffer from a lack of real-time and/or on-the-ground data. ML can be used to analyze satellite imagery at scale in order to fill some of these gaps, via applications such as land cover classification, footprint detection for buildings, solar panel detection, deforestation detection, and emissions monitoring.
Satellite images are intensively used for Earth system monitoring. One of the two biggest challenges of using satellite images is the sheer volume of data which makes downloading, transferring, and processing data all difficult. The other one is the lack of annotated data. For many use cases, the lack of publicly open high-resolution imagery is also a bottleneck.
Satellite images are intensively used for Earth system monitoring. One of the two biggest challenges of using satellite images is the sheer volume of data which makes downloading, transferring, and processing data all difficult. The other one is the lack of annotated data. For many use cases, the lack of publicly open high-resolution imagery is also a bottleneck.
Data Gap Type
Data Gap Details
S3: Sufficiency > Granularity
Publicly available datasets often lack sufficient granularity. This is particularly challenging for the Global South, which typically lacks the funding for high-resolution commercial satellite imagery.
U6: Usability > Large Volume
The sheer volume of data now poses one of the biggest challenges for satellite imagery. When data reaches the terabyte scale, downloading, transferring, and hosting become extremely difficult. Those who create these datasets often lack the storage capacity to share the data. This challenge can potentially be addressed by one or more of the following strategies:
Data compression: Compress the data while retaining lower-dimensional information.
Lightweight models: Build models with fewer features selected through feature extraction. A successful example can be found here.
Large foundation model for remote sensing data: Purposefully construct large models (e.g., foundation models) that can handle vast amounts of data. This requires changes in research architecture, such as preprocessing architecture modifications.
O2: Obtainability > Accessibility
Very high-resolution satellite images (e.g., finer than 10 meters) typically come from commercial satellites and are not publicly available. This is particularly challenging for the Global South. One exception is the NICFI dataset, which offers high-resolution, analysis-ready mosaics of the world’s tropics.
U5: Usability > Pre-processing
Satellite images often contain a lot of redundant information, such as large amounts of data over the ocean that do not always contain useful information. It is usually necessary to filter out some of this data during model training
U2: Usability > Aggregation
Due to differences in orbits, instruments, and sensors, imagery from different satellites can vary in projection, temporal and spatial zones, and cloud blockage, each with its own pros and cons. To overcome data gaps (e.g. cloud blocking) or errors, multiple satellite images are often assimilated. Harmonizing these differences is challenging, and sometimes arbitrary decisions must be made.
U5: Usability > Pre-processing
The lack of annotated data presents another major challenge for satellite imagery. It is suggested that collaboration and coordination at the sector level should be organized to facilitate annotation efforts across multiple sectors and use cases. Additionally, the granularity of annotations needs to be increased. For example, specifying crop types instead of just “crops” and detailing flood damage levels rather than general “damaged” are necessary for more precise analysis.
M: Misc/Other
Cloud cover presents a major technical challenge for satellite imagery, significantly reducing its usability. To obtain information beneath the clouds, pixels from clear-sky images captured by other satellites are often used. However, this method can introduce noise and errors.
M: Misc/Other
There is also a lack of technical capacity in the Global South to effectively utilize satellite imagery.
Energy data fusion for policy and market analysis in energy systems
Details (click to expand)
Data collected from public utilities, energy companies, and government agencies by energy regulatory committees can provide detailed information with respect to generation, fuel consumption, emissions, and financial reports that better inform domestic policies to enforce and promote reduction of gas emissions through carbon pricing and renewable incentives, grid modernization and resilience planning for severe weather events, and equitable energy transitions. By providing continuously updated, well curated, analysis-ready energy system data, climate advocates will have better quantitative tools to influence political and administrative process thereby encouraging energy transition.
Public datasets from government agencies such as the EIA, EPA, FERC, and PHMSA are not ready for use in analysis ready data products. Data is often tabular as zip files with different file formats that may not share common identifiers or schema to readily join data. Collating, collecting, and merging these datasets can often provide greater context to the state of the energy system and the effectiveness of policy measures. Data can also be missing based on reporting gaps and redacted per-plant pricing information. While PUDL seeks to overcome the gaps by merging datasets based on entity matching and interpolation challenges still remain in terms of maintenance as usability can be sensitive to original source data format changes, updates, and new initiatives. The datagaps experienced in the maintenance of this dataset will be highlighted with respect to the source data that PUDL mines.
Public datasets from government agencies such as the EIA, EPA, FERC, and PHMSA are not ready for use in analysis ready data products. Data is often tabular as zip files with different file formats that may not share common identifiers or schema to readily join data. Collating, collecting, and merging these datasets can often provide greater context to the state of the energy system and the effectiveness of policy measures. Data can also be missing based on reporting gaps and redacted per-plant pricing information. While PUDL seeks to overcome the gaps by merging datasets based on entity matching and interpolation challenges still remain in terms of maintenance as usability can be sensitive to original source data format changes, updates, and new initiatives. The datagaps experienced in the maintenance of this dataset will be highlighted with respect to the source data that PUDL mines.
Data Gap Type
Data Gap Details
U1: Usability > Structure
The structure of PUDL is maintained, however, the structure of the source material from FERC, EIA, EPA, and other data providers can vary significantly between reporting years, new initiatives, and individual reporting details. For example, individual power plant identification numbers and associated operational data across sources may differ despite referencing the same plant.
Additionally, data versioning in spreadsheet format files, is non-existent, making it difficult to track updates and content changes made to an individual file that is provided by the regulatory agencies between website updates.
Common standards and formatting across agency datasets and reporting with documentation can provide the open source community with improved direction and responsiveness to changes between years and forms.
U2: Usability > Aggregation
Aggregation of data between publicly available agency provided materials is challenging as schema and naming conventions for similar information in addition to resolution can vary between sources. Probabilistic named entity recognition and interpolation techniques are utilized to join datasets where feasible. However, when aggregating public data with private data obtained by paid purchase access, data may not only require use of private APIs but may also have a significantly different schema or format. Utilizing relational database formatting standards and best practices as a universal standard across data providers can make joining data along common aliases, ids and correlations easier.
U3: Usability > Usage Rights
PUDL uses the Creative Commons Attribution License v4.0.
While PUDL sources data from public government regulatory agencies, data provided by private utility system operators may have quasi-public licensing restrictions. Despite the utility data being be provided to regulatory agencies, use by the wider public could be legally disputed. To overcome this, usage rights may be explicitly stated by the provider and/or the government agency publishing the data.
U4: Usability > Documentation
PUDL maintains and updates documentation as more datasets are incorporated to its database. However, Catalyst Cooperative, the group responsible for the continuous development of the project, monitor source datasets for changes in format and requirements between years.
U5: Usability > Pre-processing
PUDL seeks to create a parsing pipeline that allows concatenation and collation of information with respect to data requested by regulatory agencies. However, this parsing pipeline can face challenges as schema and format undergo changes with time. For example, FERC data has changed database formats from pdf to XBRL. Additionally, data sources can be semi-structured in nature requiring significant pre-processing and expertise in a variety of data formats including those that may be custom to the reporting agency.
Energy data in the form of PDFs can bottleneck data pipelines utilizing optical character recognition (OCR) technology due to scan quality thereby affecting data extraction especially data does not have a unified reporting format.
While PUDL relies on natural language processing techniques such as probabilistic entity matching to mitigate the challenges faced, oftentimes, these data gaps require human verification and extensive manual pre-processing . This can introduce technical debt while working with a solution provider to understand and better digest data contents.
S2: Sufficiency > Coverage
PUDL coverage is isolated to regulatory agencies and third party organizations within the United States.
S3: Sufficiency > Granularity
Granularity of data is constrained on a annual, monthly or quarterly basis depending on the source of data compiled by PUDL. PUDL utilizes interpolation techniques to unify resolution, however, this requires continuous maintenance and examination of historical data.
S6: Sufficiency > Missing Components
PUDL would like to incorporate open weather model data into the database to facilitate research into load demand and renewable generation. This would require down-sampling of weather model outputs in order to match pre-existing database resolution. Additionally, data with respect to transmission and congestion from grid operators would provide greater system context to the data provided by public agencies.
Energy-efficient building design
Details (click to expand)
The built environment, contributes significantly to global carbon dioxide emissions from the embodied carbon associated with building materials and construction to operational emissions with respect to lighting and ambient temperature, as well as overall urbanization.
Datasets featured can vary in types of data gaps depending on the content, coverage area, location, building type, building spatial plan, quantity measured, ambient environment, or power consumption or metered data availability.
Datasets featured can vary in types of data gaps depending on the content, coverage area, location, building type, building spatial plan, quantity measured, ambient environment, or power consumption or metered data availability.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
All data featured on the platform is open access with standardization on metadata format to allow for ease of use and information specific to buildings based on type, location, and climate zone. Data quality and guidance on curation and cleaning in addition to access restrictions are specified in the metadata of each hosted dataset.
U3: Usability > Usage Rights
Licensing information for each individual featured dataset is provided.
S3: Sufficiency > Granularity
Dataset time resolution and period of temporal coverage vary depending on the dataset selected.
S6: Sufficiency > Missing Components
Building data typically does not include grid interactive data, or signals from the utility side with respect to control or demand side management. Such data can be difficult to obtain or require special permissions. By enabling the collection of utility side signals, utility-initiated auto-demand response (auto-DR) and load shifting could be better assessed.
S2: Sufficiency > Coverage
Featured datasets are from test-beds, buildings, and contributing households from the United States.
The building data genome project 2 compiles building data from public open datasets along with privately curated building data specific to university and higher education institutions. While the dataset has clear documentation, the lack of diversity in building types necessitates further development of the dataset to include other types of buildings as well as expansion to coverage areas and times beyond those currently available.
The building data genome project 2 compiles building data from public open datasets along with privately curated building data specific to university and higher education institutions. While the dataset has clear documentation, the lack of diversity in building types necessitates further development of the dataset to include other types of buildings as well as expansion to coverage areas and times beyond those currently available.
Data Gap Type
Data Gap Details
U2: Usability > Aggregation
Data was collated from 7 open access public data sources as well as 12 privately curated datasets from facilities management at different college sites requiring manual site visits which are not included in the data repository at this time.
U3: Usability > Usage Rights
S2: Sufficiency > Coverage
The dataset is curated from buildings on university campuses thereby limiting the diversity of building representation. To overcome the lack of diversity in building data, data sharing incentives and community open source contributions can allow for the expansion of the of the dataset.
S3: Sufficiency > Granularity
The granularity of the meter data is hourly which may not be adequate for short term load forecasting and efficiency studies.
S4: Sufficiency > Timeliness
The dataset covers hourly measurements from January 1, 2016 to December 31, 2018.
Estimation of forest carbon stock
Details (click to expand)
Forests are one of the Earth’s major carbon sinks, absorbing carbon dioxide (CO₂) from the atmosphere through photosynthesis and storing it in biomass (trees and vegetation) and soil. Accurate estimates of carbon stock help quantify the amount of CO₂ forests are sequestering, which is essential for climate change mitigation efforts. ML can help by providing more precise and large-scale estimates of forest carbon through the analysis of satellite imagery. This approach can significantly improve upon traditional, labor-intensive forest inventory surveys, making carbon stock assessments more efficient and scalable.
GEDI is globally available but has some intricacies, e.g. geolocation errors, and weak return signal if the forest is dense, which bring uncertainties and errors into the estimate of canopy height.
The data is manually collected and recorded, resulting in errors, missing values, and duplicates. Additionally, it is limited in coverage and collection frequency.
The data is manually collected and recorded, resulting in errors, missing values, and duplicates. Additionally, it is limited in coverage and collection frequency.
Data Gap Type
Data Gap Details
U5: Usability > Pre-processing
There is a lot of missing data and duplicates.
S2: Sufficiency > Coverage
Since data is collected manually, it is hard to scale and limited to certain regions only.
Estimation of methane emissions from rice paddies
Details (click to expand)
Rice paddies are a major source of global anthropogenic methane emissions. Accurate quantification of CH₄ emissions, especially how they vary with different agricultural practice is crucial for addressing climate change. ML can enhance methane emission estimation by automatically processing and analyzing remote-sensing data, leading to more efficient assessments.
There is a lack of direct observation of methane emissions.
Data Gap Type
Data Gap Details
S6: Sufficiency > Missing Components
Direct measurement of methane emissions is often expensive and labor-intensive. But this data is essential as it provides the ground truth for training and constraining ML models. Increased funding is needed to support and encourage comprehensive data collection efforts.
Extreme heat prediction
Details (click to expand)
Extreme heat is becoming more common in a changing climate, but predicting and accurately modeling extreme heat is difficult. ML can help by improving extreme heat prediction.
The major change involves managing the size of the data. While cloud platforms offer convenience, they come with costs. Additionally, handling large datasets requires specific techniques, such as distributed computing and occasionally large-memory computing nodes (for certain statistics).
Fault detection in low voltage distribution grids
Details (click to expand)
The low voltage distribution portion of the grid directly supplies power to consumers. As consumers integrate more distributed energy resources (DERs) and dynamic loads (such as electric vehicles), low voltage distribution systems are susceptible to power quality issues that can affect the stability and reliability of the grid. Fault inducing harmonics can be challenging to monitor, diagnose, and control due to the number of nodes/buses that connect various grid assets and the short distances between them. As load and generation change rapidly throughout the day, advanced monitoring systems become essential in detecting and localizing faults so adaptive protection and network reconfiguration efforts can be made to ensure reliability and stability.
For effective fault localization using µPMU data, it is crucial that distribution circuit models supplied by utility partners are accurate, though they often lack detailed phase identification and impedance values, leading to imprecise fault localization and time series contextualization. Noise and errors from transformers can degrade µPMU measurement accuracy. Integrating additional sensor data affects data quality and management due to high sampling rates and substantial data volumes, necessitating automated analysis strategies. Cross-verifying µPMU time series data, especially around fault occurrences, with other data sources is essential due to the continuous nature of µPMU data capture. Additionally, µPMU installations can be costly, limiting deployments to pilot projects over a small network. Furthermore, latencies in the monitoring system due to communication and computational delays challenge real-time data processing and analysis.
For effective fault localization using µPMU data, it is crucial that distribution circuit models supplied by utility partners are accurate, though they often lack detailed phase identification and impedance values, leading to imprecise fault localization and time series contextualization. Noise and errors from transformers can degrade µPMU measurement accuracy. Integrating additional sensor data affects data quality and management due to high sampling rates and substantial data volumes, necessitating automated analysis strategies. Cross-verifying µPMU time series data, especially around fault occurrences, with other data sources is essential due to the continuous nature of µPMU data capture. Additionally, µPMU installations can be costly, limiting deployments to pilot projects over a small network. Furthermore, latencies in the monitoring system due to communication and computational delays challenge real-time data processing and analysis.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
For µPMU data to be utilized for fault localization, the distribution circuit model must be provided by the partnering utility. Typically the distribution circuit lacks notation with respect to the phase identification and impedance values, often providing rough approximations which can ultimately influence the accuracy of localization as well as time series contextualization of a fault. Decreased accuracy of localization can then affect downstream control mechanisms to ensure operational reliability.
U5: Usability > Pre-processing
µPMU data is sensitive to noise especially from geomagnetic storms which can induce electric currents in the atmosphere and impacting measurement accuracy. They can also be compromised by errors introduced by current an potential transformers.
Depending on whether additional data from other sensors or field reports is being used to classify µPMU time series data, creation of a joint sensor dataset can influence quality based on the overall sampling rate and format of the additional non-µPMU data.
U6: Usability > Large Volume
Due to the high sampling rates, data volume from each individual µPMU can be challenging to manage and analyze due to its continuous nature. Coupled with the number of µPMUs required to monitor a portion of the distribution network, the amount of data can easily exceed terabytes. Automation of indexing and mining time series by transient characteristics can facilitate domain specialist verification efforts.
R1: Reliability > Quality
Since µPMU data is continously captured, time series data leading up to or even identifying a fault or potential fault requires verification from other data sources.
Digital Fault Recorders (DFRs) capture high resolution event driven data such as disturbances due to faults, switching and transients. They are able to detect rapid events like lightning strikes and breaker trips while also recording the current and voltage magnitude with respect to time. Additionally, system dynamics over a longer period following a disturbance can also be captured. When used in conjunction with µPMU data, DFR data can assist in verifying significant transients found in the µPMU data which can facilitate improved analysis of both signals leading up to and after an event from the perspective of distribution-side state.
S2: Sufficiency > Coverage
Currently µPMU installation to existing distribution grids have significant financial costs so most deployments have been in the form of pilot projects with utilities. Pilot studies include the Flexgrid testing facility at Lawrence Berkeley National Laboratory (LBNL), Philadelphia Navy Yard microgrid (2016-2017), the micro-synchrophasors for distribution systems plus-up project (2016-2018), resilient electricity grids in the Philippines (2016), the GMLC 1.2.5- sensing and measurement strategy (2016), the bi-national laboratory for the intelligent management of energy sustainability and technology education in Mexico City (2017-2018) based on North American Synchrophasor Initiative (NASPI) reports.
Coverage is also limited by acceptance to this technology due to a pre-existing reliance on SCADA systems which measure grid conditions on a 15 minute cadence. As transients become more common, especially on the low voltage distribution grid, transition to monitoring with higher resolution will become necessary.
S4: Sufficiency > Timeliness
µPMU data can suffer from multiple latencies within the monitoring system of the grid that are unable to keep up with the high sampling rate of the continuous measurements that µPMUs generate. Latencies occur in the context of the system communications surrounding signals as they are being recorded, processed, sent, and received. This can be due to the communication medium used, cable distance, amount of processing, and computational delay. More specifically, the named latencies are measurement, transmission, channel, receiver, and algorithm related.
Utilization of computer vision for right of way (RoW) transmission line clearance detection to identify areas where vegetation and other objects are in close proximity to towers, lines, and other grid assets which may increase fire risk.
Grid inspection robot imagery may require coordination efforts with local utilities to gain access over multiple robot trips, image preprocessing to remove ambient artifacts, position and location calibration, as well as limitations in the identification of degradation patterns based on the resolution of the robot mounted camera.
Grid inspection robot imagery may require coordination efforts with local utilities to gain access over multiple robot trips, image preprocessing to remove ambient artifacts, position and location calibration, as well as limitations in the identification of degradation patterns based on the resolution of the robot mounted camera.
Data Gap Type
Data Gap Details
U2: Usability > Aggregation
Data needs to be aggregated and collated for multiple cable inspection robots for generalizability and requires multiple robot trips to an initial inspection to identify target locations that need further data collection, followed by a second trip at target locations for camera capture.
U3: Usability > Usage Rights
Data is proprietary and requires coordination with utility.
U5: Usability > Pre-processing
Data may need significant preprocessing and thresholding to perform image segmentation task.
S2: Sufficiency > Coverage
It is necessary to supplement data with position orientation system data to locate the cable inspection robot (this may be by having the robot complete two inspections–a preliminary one to identify inspection targets, followed by a more detailed autonomous inspection of targets with additional PTZ camera image capture of inspection targets on device).
S3: Sufficiency > Granularity
Spatial resolution depends on the type of cable inspection robot utilized.
Unmanned aerial vehicle (UAV) or drone imagery for vegetation management near transmission and distribution lines may require partnerships with private companies and utilities for access and usage. LiDAR data is sparse and may partially scan power transmission lines resulting in poor data quality. Coverage area is often relegated to right of way (RoWs) of interest which may require continuous monitoring for future vegetation growth and inspection.
Unmanned aerial vehicle (UAV) or drone imagery for vegetation management near transmission and distribution lines may require partnerships with private companies and utilities for access and usage. LiDAR data is sparse and may partially scan power transmission lines resulting in poor data quality. Coverage area is often relegated to right of way (RoWs) of interest which may require continuous monitoring for future vegetation growth and inspection.
Data Gap Type
Data Gap Details
O1: Obtainability > Findability
Must be involved in an active study with a partnering utility to get access to pre-existing drone data or to get permission to collect drone data.
U3: Usability > Usage Rights
Once collected, data is private as RoWs represent critical energy infrastructure.
U5: Usability > Pre-processing
LiDAR data is sparse if equipment partially scans power transmission lines resulting in weak features that may make it difficult for computer vision algorithms to detect and distinguish lines from pylons or towers that support overhead power lines. Preprocessing the data to identify partial scans may be helpful.
S2: Sufficiency > Coverage
Coverage can vary depending on the RoW examined. Often multiple datasets that contain multiple transmission RoW UAV image data would be necessary to increase the number of image examples in the dataset.
S4: Sufficiency > Timeliness
Measurements should be taken at multiple periods to examine transmission line characteristics to both vegetation growth and or line sag caused by overvoltage conditions.
Identification and mapping of climate policy
Details (click to expand)
Laws and regulations relevant to climate change mitigation and adaptation are essential for assessing progress on climate action and addressing various research and practical questions. ML can be employed to identify climate-related policies and categorize them according to different focus areas.
Data is not available in machine-readable formats and is limited to English-language literature from major journals.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Many data sources that should be open are not fully accessible. For instance, abstracts are generally expected to be openly available, even for proprietary data. However, in practice, for some papers, only a subset of abstracts is often accessible.
U1: Usability > Structure
Most of the data is in PDF format and should be converted to machine-readable formats.
S2: Sufficiency > Coverage
Research is currently limited to literature published in English (at least the abstracts) and from major journals. Many region-specific journals or literature indexed in other languages are not included. These should be translated into English and incorporated into the database.
Laws and regulations for climate action are published in various formats through national and subnational governments, and most are not labeled as a “climate policy”. There are a number of initiatives that take on the challenge of selecting, aggregating, and structuring the laws to provide a better overview of the global policy landscape. This, however, requires a great deal of work, needs to be permanently updated, and datasets are not complete.
Laws and regulations for climate action are published in various formats through national and subnational governments, and most are not labeled as a “climate policy”. There are a number of initiatives that take on the challenge of selecting, aggregating, and structuring the laws to provide a better overview of the global policy landscape. This, however, requires a great deal of work, needs to be permanently updated, and datasets are not complete.
Data Gap Type
Data Gap Details
U1: Usability > Structure
Much of the data are in PDF format and need to be structured into machine-readable format. Much of the data in original languages of the publishing country and needs to be translated into English.
U2: Usability > Aggregation
Legislation data is published through national and subnational governments, and often is not explicitly labeled as “climate policy”. Determing whether it is climate-related is not simple.
This information is usually published on local websites and must be downloaded or scraped manually. There are a number of initiatives, such as Climate Policy Radar, International Energy Agency, and New Climate Institute that are working to address this by selecting, aggregating, and structuring these data to provide a better overview of the global policy landscape. However, this process is labor-intensive, requires continuous updates, and often results in incomplete datasets.
Improving power grid optimization
Details (click to expand)
Traditionally optimal power flow (OPF) seeks to solve the objective of minimizing the cost of power generation to meet a given load (economic dispatch) such that line limits due to thermal, voltage, or stability along with generation limits are met while maintaining power balance at each bus in the transmission system. Traditional techniques formulate OPF as a non-linear, constrained, non-convex optimization problem which can be solved for AC and DC systems separately. Traditional OPF solvers use a linear program to determine generation needed to minimize cost and satisfy load demand while adhering to physical constraints of the system. However, as the grid integrates more renewable generation sources there are trends towards the development of hybrid AC/DC power grids to address the limitations of traditional AC transmission systems and the desire to access remote renewables. Such hybrid systems present new challenges to traditional OPF by enabling bidirectional power flow, requiring the adaptation of OPF objective function and constraints to account for new losses, increased costs and congestion. ML can be used to approximate OPF problems, in order to allow them to be solved at greater speed, scale, and fidelity.
Grid2Op is a reinforcement learning framework that builds an environment based on topologies, selected grid observations, a selected reward function, and selected actions for an agent to select from. The framework relies on control laws rather than direct system observations which are subject to multiple constraints and changing load demand. Time steps representing 5minutes are unable to capture complex transients and can limit the effectiveness of certain actions within the action space over others. Furthermore, customization of the Grid2Op can be challenging as the platform does not allow for single to multiagent conversion, and is not a suitable environment for cascading failure scenarios due to game over rules.
Grid2Op is a reinforcement learning framework that builds an environment based on topologies, selected grid observations, a selected reward function, and selected actions for an agent to select from. The framework relies on control laws rather than direct system observations which are subject to multiple constraints and changing load demand. Time steps representing 5minutes are unable to capture complex transients and can limit the effectiveness of certain actions within the action space over others. Furthermore, customization of the Grid2Op can be challenging as the platform does not allow for single to multiagent conversion, and is not a suitable environment for cascading failure scenarios due to game over rules.
Data Gap Type
Data Gap Details
U4: Usability > Documentation
In the customization of the reward function, there are several TODOs in place concerning the units and attributes of the reward function related to redispatching. Documentation and code comments can sometimes provide conflicting information. Modularity of reward, adversary, action, environment, and backend are nonintuitive requiring pregenerated dictionaries rather than dynamic inputs or conversion from single agent to multiagent functionality.
U5: Usability > Pre-processing
The game over rules and constraints are difficult to adapt when customizing the environment for cascading failure scenarios that may be interesting to conduct for more complex adversaries such as natural disasters. Code base variations between versions especially between the native and Gym formated framework lose features present in the legacy version including topology graphics.
S2: Sufficiency > Coverage
Coverage is limited to the network topologies provided by the grid2op environment which is based on different IEEE bus topologies. While customization of the environment in terms of the “Backend,” “Parameters,” and “Rules” are possible, there may be dependent modules that may still enforce game-over rules. Furthermore, since backend modeling is not the focus of grid2op, verification that customization obeys physical laws or models is necessary.
S3: Sufficiency > Granularity
The time resolution of 5-minute increments may not represent realistic observation time series grid data or chronics. Furthermore, the granularity may limit the effectiveness of specific actions in the provided action space. For example, the use of energy storage devices in the presence of overvoltage has little effect on energy absorption incentivizing the agent to select from grid topology actions such as line changing line status or changing bus rather than setting storage.
R1: Reliability > Quality
The grid2op framework relies on mathematical robust control laws and rewards which train the RL agent based on set observation assumptions rather than actual system dynamics which are susceptible to noise, uncertainty, and disturbances not represented in the simulation environment. It has no internal modeling of the equations of the grids, or what kind of solver that needs to be adopted to solve traditional nonlinear optimal power flow equations. Specifics concerning modeling and preferred solver require users to customize or create a new “Backend.” Additionally, such RL human-in-the-loop systems in practice require trustworthiness and quantification of risk.
Traditional OPF simulation software depending on whether it is open source, may require the purchase of licenses for advanced features and functionalities. To simulate more complex systems or regions additional data regarding energy infrastructure, region-specific load demand, and renewable generation may be needed to conduct studies. OPF simulation output would require verification and performance evaluation to assess results in practice. Increasing the granularity of the simulation model by increasing the number of buses, limits, or additional parameters increases the complexity of the OPF problem, thereby increasing the computational time and resources required.
Traditional OPF simulation software depending on whether it is open source, may require the purchase of licenses for advanced features and functionalities. To simulate more complex systems or regions additional data regarding energy infrastructure, region-specific load demand, and renewable generation may be needed to conduct studies. OPF simulation output would require verification and performance evaluation to assess results in practice. Increasing the granularity of the simulation model by increasing the number of buses, limits, or additional parameters increases the complexity of the OPF problem, thereby increasing the computational time and resources required.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
MATPOWER is open source and PowerWorld Simulator has several options for industry practitioners as well as those who would like to use it for educative or academic purposes. There is demo software that is licensed for educational use that includes simulator features such as available transfer capability, optimal power flow, security-constrained OPF, OPF reserves, PV/QV curve tool, transient stability, and geomagnetically induced current. In terms of topology, the free version contains up to 13 buses while the full version of the simulator can handle 250,000 buses.
U2: Usability > Aggregation
In MATPOWER and PowerWorld outside data may be required to simulate conditions over a specific region with a given amount of DERs, generating sources, bus topology, and line limits.
U3: Usability > Usage Rights
Depending on whether proprietary simulators are pursued (PowerWorld) there may be licensing costs for use of certain features.
R1: Reliability > Quality
Traditional OPF simulation software simplifies the power system and makes assumptions about the system behavior such as perfect power factor correction or constant system parameters. Simulation results may need to be verified with real-world results.
S3: Sufficiency > Granularity
In PowerWorld, bus topologies available may be simplified representations of actual grids to simplify the modeling and simulation techniques to represent overall system behavior. MATPOWER requires the user to define the bus matrix. As the number of buses in a power system increases the computational complexity of OPF increases requiring more resources and time to solve. Additional parameters such as line limits, number of generating sources, number of DERs, and load demand also increase the complexity of the model as more constraints and assets are introduced.
While network datasets are open source, maintenance of the repository requires continuous curation and collection of more complex benchmark data to enable further AC-OPF simulation and scenario studies. Industry engagement can assist in developing more realistic data though such data without cooperative effort may be hard to find.
While network datasets are open source, maintenance of the repository requires continuous curation and collection of more complex benchmark data to enable further AC-OPF simulation and scenario studies. Industry engagement can assist in developing more realistic data though such data without cooperative effort may be hard to find.
Data Gap Type
Data Gap Details
O1: Obtainability > Findability
Industry engagement can assist in developing detailed and realistic networked datasets and operating conditions, limits, and constraints.
O2: Obtainability > Accessibility
PGLib-OPF is open source.
U2: Usability > Aggregation
Repository maintenance requires continuous curation of more complex networked benchmark data for more realistic AC-OPF simulation studies.
Marine wildlife detection and species classification
Details (click to expand)
Marine wildlife detection and species classification are crucial for understanding the impacts of climate change on marine ecosystems. These processes involve identifying and categorizing different marine species. ML can significantly enhance these efforts by automatically processing large volumes of data from diverse sources, improving accuracy and efficiency in monitoring and analyzing marine life.
Data downloading is a bottleneck because it requires familiarity with APIs, which not all users possess.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
API is needed to download data but many ecologists are not familiar with scripting languages.
M: Misc/Other
It would be ideal if Copernicus also made biodiversity data available on its website. Having access to both biodiversity data and associated environmental ocean data on the same platform would significantly enhance efficiency and accessibility. This integration would eliminate the need to download massive datasets for local analysis, streamlining the process for users.
Same as terrestrial biodiversity data, the lack of good annotated data is biggest bottleneck. Regarding existing data, enabling broader data sharing is the most critical challenge to address. We should also be strategic data collection efforts, targeting places where biodiversity is large but currently available data is sparse.
Same as terrestrial biodiversity data, the lack of good annotated data is biggest bottleneck. Regarding existing data, enabling broader data sharing is the most critical challenge to address. We should also be strategic data collection efforts, targeting places where biodiversity is large but currently available data is sparse.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
A lot of ocean data is collected, but a persistent challenge is ensuring that this data is shared and utilized by those who need it. Currently, much of the data remains siloed within individual institutions, making it difficult to access for collaborative purposes. Despite numerous initiatives, such as Ocean Biodiversity Information System (OBIS), Integrated Ocean Observing System (IOOS), and others, data accessibility continues to be the biggest hurdle.
To facilitate large-scale data sharing, there is a need for incentives, robust platforms for data storage, clear guidelines, and straightforward pipelines for data sharing.
U1: Usability > Structure
There is a lack of data format standardization across the ocean science community.
U2: Usability > Aggregation
Much of the data is siloed within individual institutions and not easily accessible for collaboration.
U5: Usability > Pre-processing
The volume of raw data (e.g. from imaging and acoustics) is massive, and requires significant effort to annotate and extract insights. In order to accelerate that, some or all of the following solutions can be adopted.
Accelerate the data analysis pipeline, particularly for visual data, through the Ocean Vision AI initiative
Engage the broader community to participate in exploration, discovery, and annotation through initiatives like a mobile game
Target annotation efforts strategically to maximize the impact on model performance, such as focusing on the rare species.
S2: Sufficiency > Coverage
There are massive gaps in the coverage of ocean biodiversity data, with only about 7% of the upper 5 meters of the ocean being regularly monitored, and 30-60% of ocean life still unknown to science. The data is also heavily biased towards coastal regions, with much less coverage of the open ocean.
While collecting data from the deep ocean is technologically challenging, the primary issue is the lack of financial incentives. High seas fall outside national jurisdictions, so data collection often occurs only through mining companies, military operations, or ad hoc research expeditions. The absence of marine protected areas on high seas and the migratory nature of species like phytoplankton further complicate data collection. Financial tools or regulations could incentivize data collection.
Only data from 2019 to March 2022 is publicly available. Registration is required to access the data.
Modeling effects of soil processes on soil organic carbon
Details (click to expand)
Understanding the causal relationship between soil organic carbon and soil management or farming practices is crucial for enhancing agricultural productivity and evaluating agriculture-based climate mitigation strategies. ML can significantly contribute to this understanding by integrating data from diverse sources to provide more precise spatial and temporal analyses.
Data collection is extremely expensive for some variables and the simulated variables are hence used. But the simulated values have large uncertainties due to the assumptions and simplifications made in the model.
Data collection is extremely expensive for some variables and the simulated variables are hence used. But the simulated values have large uncertainties due to the assumptions and simplifications made in the model.
Data Gap Type
Data Gap Details
R1: Reliability > Quality
Soil carbon generated from simulators are not reliable because these process-based models might be obsolete, or might have certain kind of systematic bias which gets reflected in the simulated variables. But ML scientists who use those simulated variables usually don’t have the proper knowledge to properly calibrate these process based models.
In general, there is no sufficient (in both coverage and granularity) soil organic carbon data for training a well-generalized ML model.
Data Gap Type
Data Gap Details
U1: Usability > Structure
Data is collected by different farmers on different farms and therefore has some consistency issue and needs to be structured.
S3: Sufficiency > Granularity
In general, there is no sufficient (in both coverage and granularity) soil organic carbon data for training a well-generalized ML model. One reason is that collecting those data is very expensive – the hardware is costly and collecting data at a high frequency is even more expensive.
S2: Sufficiency > Coverage
In general, there is no sufficient (in both coverage and granularity) soil organic carbon data for training a well-generalized ML model. One reason is that collecting those data is very expensive – the hardware is costly and collecting data at a high frequency is even more expensive.
Non-intrusive electricity load monitoring
Details (click to expand)
Non-intrusive load monitoring (NILM) is a strategy to disaggregate the total energy consumption profile into individual appliance load profiles for a building. This strategy can provide a insight to individual consumer behavior for real-time pricing, target customers who may be due for an appliance upgrade, and enable building energy management systems (EMS) to enact demand response strategies such as load shifting for sheddable or curtailable loads. These strategies are significant in reducing peaks in demand thereby maintaining grid stability.
Pecan Street DataPort requires non-academic and academic users to purchase access via licensing and number of building data features requested. Coverage area of data is primarily concentrated in the Mueller planned housing community in Austin, Texas–a modern built environment which is not representative of older historical buildings that may be in need of energy efficient upgrades and retrofits. In customer segmentation studies and consumer-in-the-loop load consumption modeling, annual sociodemographic survey data may be too coarse and not provide insight into behavioral effects of household members on consumption profiles with time.
Pecan Street DataPort requires non-academic and academic users to purchase access via licensing and number of building data features requested. Coverage area of data is primarily concentrated in the Mueller planned housing community in Austin, Texas–a modern built environment which is not representative of older historical buildings that may be in need of energy efficient upgrades and retrofits. In customer segmentation studies and consumer-in-the-loop load consumption modeling, annual sociodemographic survey data may be too coarse and not provide insight into behavioral effects of household members on consumption profiles with time.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Data is downloadable as a static file or accessible via the DataPort API. Based on the licensing agreement, a small dataset is available free for academic individuals with pricing for larger datasets. Commercial use requires paid access based on the access to desired features with standard licensing vs unlimited.
U3: Usability > Usage Rights
Usage rights vary depending on the agreed upon licensing agreement and use-case.
S2: Sufficiency > Coverage
The Pecan street dataset represents the same type of modern planned building with similar floor plans and little variations with respect to other buildings. In the initial dataset covering the Mueller community in Austin, Texas, it is important to note that the community began development following 1999 which may not be a representative sample of older buildings which are more likely to require upgrades to energy efficient appliances and systems.
Additionally the dataset primarily covers Texas which is supplied by ERCOT, and has some limited coverage in New York, California, and is growing to include Puerto Reco depending on volunteer participation. This could introduce self-selection bias, as households who participate are already interested in energy saving.
Enrollment of older built environment homes and different temperate regions within the United States and globally, may provide greater insight into household engagement with energy consumption behavior and efficiency monitoring.
S6: Sufficiency > Missing Components
The data does not track real-time occupancy of individuals in the household which may provide insight into behavioral effects on energy consumption. Addition of this data, can also allow for improved customer segmentation models based on consumption, as patterns can change with respect to time and day of the week. The data would also be amenable for consumer-in-the-loop energy management studies with respect to comfort based on customer habitual activity, location in the house, and number of occupants.
S3: Sufficiency > Granularity
For customer segmentation studies by consumption patterns, the dataset contains annual survey responses by participants with respect to household demographics and home features which may be too coarse in granularity for tracking how customer segments change over time.
For accurate NILM studies, benchmark datasets are required that include not only consumption but generation as well. While some datasets may include some generation information, most studies do not take rooftop solar generation into account. Additionally, loads such as electric vehicles or battery storage were not present. Building level data focused mostly on single family housing units in specific areas limiting the diversity of representation. Furthermore, most datasets are no longer maintained following study close.
For accurate NILM studies, benchmark datasets are required that include not only consumption but generation as well. While some datasets may include some generation information, most studies do not take rooftop solar generation into account. Additionally, loads such as electric vehicles or battery storage were not present. Building level data focused mostly on single family housing units in specific areas limiting the diversity of representation. Furthermore, most datasets are no longer maintained following study close.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
While all datasets are open for use, DRED can be accessed by request. GREEND can be requested by .
U1: Usability > Structure
For the Non-intrusive Load Monitoring Toolkit which provides researchers a toolkit for evaluating different algorithms on datasets introduced, developers had to collate and fuse data from a variety of formats that were specific to each study. Using the REDD dataset format as inspiration, the toolkit developers created a standard energy disaggregation data format(NILMTK-DF) with common metadata and structure which required manual dataset specific preprocessing.
U5: Usability > Pre-processing
For the Non-intrusive Load Monitoring Toolkit which provides researchers a toolkit for evaluating different algorithms on datasets introduced, developers had to retrieve consumption data from a variety of sources with bespoke formats that were specific to each study. Using the REDD dataset format as inspiration, the toolkit developers created a standard energy disaggregation data format(NILMTK-DF) with common metadata and structure which required manual dataset specific preprocessing.
Sub-metered data relies on the sensor network installation made in the building. Depending on the technology used, some sensors required calibration or were prone to malfunctions and delays. Additionally, interference from other devices could also be present in the aggregate building level readings, such as that experienced by REFIT, that needed to be addressed manually to enhance the usability of the dataset.
R1: Reliability > Quality
In the AMPds2 data the sum of the sub-metered consumption data did not add up to the whole house consumption due to some rounding error in the meter measurements.
Datasets that required self-reporting may introduce participant bias as the resolution with which households update occupancy information may vary. Additionally, consumption behaviors in volunteer households may have a pre-existing propensity for energy efficient actions based on enthusiasm for study participation and therefore not representative of the general population. For example, in UK-DALE, participants were comprised of masters and PhD graduate students from Imperial College which may have energy usage patterns and behaviors that are not representative of the general population.
S2: Sufficiency > Coverage
In AMPds2 data there were some missing data from electricity and water and natural gas readings. Additionally, there existed un-metered household loads which were not accounted for in the whole house reading and dishwasher consumption did not have a direct water meter for monitoring. REFIT also did not monitor appliances that could not be connected to individual appliance monitors such as electric ovens. Depending on the built environment, even with the use of external plugload sensors for disaggregated load consumption, some larger loads may not be able to be connected to building level meters. For example, in the GREEND dataset, electric boilers in Austria were connected to separate meters. In the UMass smart home dataset, gas furnaces, exahust fans, and recirculator pump loads were not able to be monitored.
AMPds2, DEDDIAG, DRED, iAWE, REDD, REFIT, and UMass smart home dataset all gather data in single family homes which may not be representative of the diversity of buildings in terms of age, location, construction, and potential household demographics. REFIT data covers different single family home types such as detached, semi-detached, and mid-terrace homes that ranged from 2-6 bedrooms with builds from the 1850s to 2005. GREEND covers apartments in addition to single family homes but the number of households was 9. AMPds2, DRED, iAWE only cover a single household. Additionally, datasets are specific to the location where the measurements were taken which are susceptible to the environmental conditions of the region as well as the culture of the population. For example, REDD consists of data from 10 monitored homes which may not be representative of common appliances contributing to the overall load of the broader population outside of Boston.
COMBED contains complex load types that may rely on variable-speed drives as well as multi-state devices, which the other datasets do not contain which may be due to the difference in building type but could also be due to the lack of diversity of appliance representation.
ECO data relied on smart plugs for disaggregated load consumption measurements which varied between households depending on the smart plug appliance coverage. For all households the total consumption was not equal to the sum of the consumption measured from the plugs alone, indicative of a high proportion of non-attributed consumption.
S6: Sufficiency > Missing Components
All datasets mentioned do not include electric vehicle loads. REDD, AMPds2, COMBED, DEDDIAG, DRED, GREEND, iAWE, UK-DALE do not include generation from rooftop solar. REFIT contains solar from three homes but they were not the focus of the study and were treated as solar interference to the aggregate load. The UMass smart home dataset only had representation of one home with solar and wind generation, though at a significantly larger square footage and build compared to the other two homes that were featured.
While DRED provided occupancy information through data collected from wearable devices with respect to the home and ECO and IDEAL through self-reporting and an infrared entryway sensor, all other studies did not.
The majority of datasets are not amenable to human in the loop user behavior analysis with respect to consumption patterns, response to feedback, and the effectiveness in load shifting to promote energy conserving behaviors due to their lack of representation.
While AMPds2 includes some utility data, most datasets do not incorporate billing or real time pricing. This type of data would be beneficial as it varies from time, season, region, and utility.
Battery storage was not taken into account in all building consumption datasets.
Offshore wind power forecasting: Long-term (3 hours-1 year)
Details (click to expand)
Long-term wind forecasting can allow for resource assessment studies for offshore energy production, wind resource mapping, and wind farm modeling.
Due to the location, FINO platform sensors are prone to failures of measurement sensors due to adverse outdoor conditions such as high wind and high waves. Often when sensors fail manual recalibration or repair is nontrivial requiring weather conditions to be amenable for human intervention. This can directly affect the data quality which can last for several weeks to a season. Coverage is also constrained to the platform and associated wind farm location.
Due to the location, FINO platform sensors are prone to failures of measurement sensors due to adverse outdoor conditions such as high wind and high waves. Often when sensors fail manual recalibration or repair is nontrivial requiring weather conditions to be amenable for human intervention. This can directly affect the data quality which can last for several weeks to a season. Coverage is also constrained to the platform and associated wind farm location.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
The data is free to use but requires sign up through a login account at: https://login.bsh.de/fachverfahren/
U5: Usability > Pre-processing
The dataset is prone to failures of measurement sensors. Issues with data loggers, power supplies, and effects of adverse conditions such as low aerosol concentrations can influence data quality. High wind and wave conditions impact the ability to correct or recalibrate sensors creating data gaps that can last for several weeks or seasons.
S2: Sufficiency > Coverage
Coverage of wind farms is relegated to the dimensions of the platform itself and the wind farm that it is built in proximity to. For locations with different offshore characteristics similar testbed platforms or buoys can be developed.
S5: Sufficiency > Proxy
Due to the nature of sensors exposed to environmental ocean conditions and storms, often FINO sensors may need maintenance and repair but are difficult to physically access. Gaps in the data from lack of data points can be addressed by utilizing mesoscale wind modeling output.
Spatiotemporal coverage of the offshore meteorological and windspeed platform data is restricted to the dimensions of the platform itself as well as the time of construction. Depending on the data provider access to the data may require the signing of a non-disclosure agreement.
Spatiotemporal coverage of the offshore meteorological and windspeed platform data is restricted to the dimensions of the platform itself as well as the time of construction. Depending on the data provider access to the data may require the signing of a non-disclosure agreement.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Access to data must be requested with different data providers having varying levels of restrictions. For data obtained from Orsted, access is only provided by signing a standard non-disclosure agreement. For more information mail R&D at datasharing@orsted.com.
S2: Sufficiency > Coverage
Spatiotemporal coverage of the dataset varies depending on the construction of the platform testbed and location but overall data is available from 2014 to the present. While measurements from LiDAR have higher resolution than wind mast data, sensor information is still restricted to the dimensions of the platform and the associated off-shore windfarm when present. Data provided by Orsted from LiDAR sensors includes 10 minute statistics.
Offshore wind power forecasting: Short-term (10 min)
Details (click to expand)
Short-term wind forecasting can enable estimation of active power generated by wind farms in the absence of curtailment.
Data obtainability is achieved by requesting access via the Orsted form. Sufficiency of the dataset is constrained by volume where only a finite amount of short term off-shore wind farms exist to which expanding the coverage area, volume and time granularity of data to under 10 minutes may enable transient detection from generated active power.
Data obtainability is achieved by requesting access via the Orsted form. Sufficiency of the dataset is constrained by volume where only a finite amount of short term off-shore wind farms exist to which expanding the coverage area, volume and time granularity of data to under 10 minutes may enable transient detection from generated active power.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Need to request access via form from Orsted.
S1: Sufficiency > Insufficient Volume
Would require data from multiple wind farms over variety of regions to get a more accurate comparison against weather model data
S2: Sufficiency > Coverage
Coverage is over Europe specifically, off-shore wind conditions will vary depending on environment and cannot scale or transfer to other temperate regions of the world
S3: Sufficiency > Granularity
Time granularity of 10min is too large to capture transients in active power generated.
S4: Sufficiency > Timeliness
Only two years worth of data from 2016-2018 is provided, through additional data collection from offshore wind farms or simulation.
Post-disaster damage assessment
Details (click to expand)
Post-disaster evaluations are crucial for identifying vulnerabilities exposed by climate-related events, which is essential for enhancing resilience and informing climate adaptation strategies. ML can help by rapidly identifying and quantifying damage, such as structural collapse or vegetation loss, thereby improving response and recovery efforts.
Financial loss data is typically proprietary and held by insurance and reinsurance companies, as well as financial and risk management firms. Some of the data should be made available for research purposes.
The resolution of publicly available datasets is insufficient for accurate damage assessments. To improve this, some commercial high-resolution images should be made accessible for research purposes.
The resolution of publicly available datasets is insufficient for accurate damage assessments. To improve this, some commercial high-resolution images should be made accessible for research purposes.
Data Gap Type
Data Gap Details
S4: Sufficiency > Timeliness
Both pre- and post-disaster imagery are needed, but pre-disaster imagery sometimes is outdated, not really reflecting the condition right before the disaster.
S3: Sufficiency > Granularity
Accurate damage assessment requires high-resolution images, but the resolution of current publicly open datasets is inadequate for this purpose. While companies offer high-resolution images, they are often prohibitively expensive. To address this, some commercial high-resolution images should be made available for research purposes at no cost.
Data is highly biased towards North America. Similar datasets but focusing on other parts of the world are needed. Additionally, the dataset should include more detailed information on the severity of the damage.
Data is highly biased towards North America. Similar datasets but focusing on other parts of the world are needed. Additionally, the dataset should include more detailed information on the severity of the damage.
Data Gap Type
Data Gap Details
S3: Sufficiency > Granularity
There is no differentiation of grades of damage. More granular information about the severity of damage is needed.
S2: Sufficiency > Coverage
Data is also highly biased towards North America. Data from the other parts of the world is highly needed.
Short-term electricity load forecasting
Details (click to expand)
Short-term load forecasting (STLF) is critical for utilities to balance demand with supply. Utilities need accurate forecasts (e.g. on the scale of hours, days, weeks, up to a month) to plan, schedule, and dispatch energy while decreasing costs and avoiding service interruptions. Furthermore, for grids that may have portions privatized, utilities rely on forecasts to procure (e.g. source and purchase) energy to meet demands. In peak conditions, where loads have been underestimated, utilities have limited options. One option is to utilize reserve capacity, or additional electric supply to ensure reliable power to customers. This usually entails recruitment of expensive peaker plants dependent on fossil-fuels in city centers to meet immediate demands over short distances. Another option is for the utility to initiate an outage to clip peaks. In the worst case, grid assets can be overloaded resulting in system failure and unplanned blackouts. Due to the reliance of historical electricity load data, weather forecasts, time with respect to the day, week, or month, and continuous streams of advanced metering infrastructure (AMI) data, machine learning models are well suited to handle large amounts of data and capture non-linearities which traditional linear models may struggle with.
AMI data is challenging to obtain without pilot study partnerships with utilities since data collection on individual building consumer behavior can infringe upon customer privacy especially at the residential level. Granularity of time series data can also vary depending on the level of access to the data whether it be aggregated and anonymized or based on the resolution of readings and system. Additionally, coverage of data will be limited to utility pilot test service areas thereby restricting the scope and scale of demand studies.
AMI data is challenging to obtain without pilot study partnerships with utilities since data collection on individual building consumer behavior can infringe upon customer privacy especially at the residential level. Granularity of time series data can also vary depending on the level of access to the data whether it be aggregated and anonymized or based on the resolution of readings and system. Additionally, coverage of data will be limited to utility pilot test service areas thereby restricting the scope and scale of demand studies.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Access to real AMI data can be difficult to obtain due to privacy concerns. Even when partnered with a utility, the AMI data can undergo anonymization and aggregation to protect individual customers. Some ISOs are able to provide data provided that a written records request is submitted. If requesting personal consumption data, tiered pricing enrollment, for example time of use, may limit temporal resolution of data that utility can provide based on system. Open datasets, on the other hand may only be available for academic research or teaching use (ISSDA CER data).
U2: Usability > Aggregation
AMI data when used jointly with other data that may influence demand such as weather data, availability of rooftop solar, presence of electric vehicles, building specifications, and appliance inventory may require significant additional data collection or retrieval.
U3: Usability > Usage Rights
For ISSDA CER data use, a request form must be submitted for evaluation by issda@ucd.ie. For data obtained through utility collaborative partnerships, usage rights may vary.
U5: Usability > Pre-processing
Data cleanliness may vary depending on the data source. For individual private data collection through testbed development, cleanliness can depend on formats of data stream output from the sensor network system installed.
R1: Reliability > Quality
Anonymized data may not be verifiable or useful once it is open.
S2: Sufficiency > Coverage
S3: Sufficiency > Granularity
Meter resolution can vary based on the hardware ranging from 1 hour, 30 minute, to 15 minute measurement intervals. Depending on the level of anonymization and aggregation of data, the granularity may be constrained to other factors such as the cadence of time of use pricing and other tiered demand response programs employed by the partnering utility.
S4: Sufficiency > Timeliness
With respect to the CER Smart Metering Project and the associated Customer Behavior Trials (CBT), Electric Ireland and Bord Gais Energy smart meter installation and monitoring occurred from 2009-2010. This anonymized dataset may no longer be representative of current behavior usage as household compositions and associated loads change with time. Similarly, pilot programs through participating utilities are finite in nature.
Faraday synthetic AMI data is a response to the bottlenecks faced in retrieval of building level demand data based on consumer privacy. However, despite the model being trained on actual AMI data, synthetic data generation requires a priori knowledge with respect to building type, efficiency, and presence of low carbon technology. Furthermore, since the dataset is trained on UK building data, AMI time series generated may not be an accurate representation of load demand in regions outside the UK. Finally, since data is synthetically generated studies will require validation and verification using real data or aggregated per substation level data to assess its effectiveness.
Faraday synthetic AMI data is a response to the bottlenecks faced in retrieval of building level demand data based on consumer privacy. However, despite the model being trained on actual AMI data, synthetic data generation requires a priori knowledge with respect to building type, efficiency, and presence of low carbon technology. Furthermore, since the dataset is trained on UK building data, AMI time series generated may not be an accurate representation of load demand in regions outside the UK. Finally, since data is synthetically generated studies will require validation and verification using real data or aggregated per substation level data to assess its effectiveness.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Faraday is currently accessible through Centre for Net Zero’s API.
The Variational Autoencoder Model can generate synthetic AMI data based on several conditions. The presence of low carbon technology (LCT) for a given household or property type depends on access to battery storage solutions, solar rooftop panels, and presence of electric vehicles. This type of data may require curation of LCT purchases by type and household. Building type and efficiency at the residential and commercial/industrial level may also be difficult to access requiring the user to set initial assumptions or seek additional datasets. Furthermore, data verification requires a performance metric based in actual readings. This may be done through access to substation level load demand data.
U3: Usability > Usage Rights
Faraday is open for alpha testing by request only.
R1: Reliability > Quality
Verification of AMI synthetic data requires verification which can be done in a bottom up grid modeling manner. For example, load demand at the substation level can be estimated based on the sum of individual building loads which the substation services. This value can then be compared to the actual substation load demand provided through a private partnerships with distribution network operators (DNOs). However, accuracy of a specific demand profile per property or section of properties would require identification of a population of buildings, a connected real-world substation, and residential low carbon technology investment for the set of properties under study.
S2: Sufficiency > Coverage
Faraday is trained from utility provided AMI data from the UK which may not be representative of load demand and corresponding building type and temperate zone of other global regions.
Coverage of data is restricted to the pilot test bed whether it be through private collection, partnership with the utility, or use of pre-existing demand data.
S3: Sufficiency > Granularity
Data granularity is limited to the granularity of data the model was trained on.
S4: Sufficiency > Timeliness
Timeliness of dataset would require continuous integration and development of the model using MLOps best practices to avoid data and model drift. By contributing to Linux Foundation Energy’s OpenSynth initiative, Centre for Net Zero hopes to build a global community of contributors to facilitate research.
Smart inverter management for distributed energy resources
Details (click to expand)
Distributed energy resources (DERs) such as solar photovoltaics and energy storage systems are a part of low-inertia power systems that do not rely on traditional rotating components. These DERs rely on distributed inverters to convert power from DC to AC which typically are configured to unity power factor. An alternative to unity power factor, inverters can be “smart” by dynamically managing effects of intermittancy prior to feeding power back to feeder circuits at the distribution substation level. Smart inverters can perform Volt-VAR (Voltage-VAR) and Volt-Watt (Voltage-Watt) operations, which involve adjusting the output voltage and frequency of the inverter to maintain grid stability. In other words, the DER inverter is controlled to dynamically adjust reactive power injection back into the grid. This is crucial for preventing voltage sags and swells that can occur due to the integration of DERs into the grid.
There is a need to enhance existing simulation tools to study inverter-based power systems rather than traditional machine-based based. Simulations should be able to represent a large number of distribution-connected inverters that incorporate DER models with the larger power system. Uncertainty surrounding model outputs will require verification and testing. Furthermore, accessibility to simulations and hardware in the loop facilities and systems requires user access proposal submission for NREL’s Energy Systems Integration Facility access. Similar testing laboratories may require access requests and funding.
There is a need to enhance existing simulation tools to study inverter-based power systems rather than traditional machine-based based. Simulations should be able to represent a large number of distribution-connected inverters that incorporate DER models with the larger power system. Uncertainty surrounding model outputs will require verification and testing. Furthermore, accessibility to simulations and hardware in the loop facilities and systems requires user access proposal submission for NREL’s Energy Systems Integration Facility access. Similar testing laboratories may require access requests and funding.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Contact NREL precise@nrel.gov for access to the PRECISE model
Submit an Energy Systems Integration Facility (ESIF) laboratory request form to userprogram.esif@nrel.gov to gain access to hardware in the loop inverter simulation systems. Access to particular hardware may require collaboration with inverter manufacturers which may have additional permission requirements.
R1: Reliability > Quality
The optimization routine of the simulation model may face challenges in determining the precise balance between grid operation criteria and impacts on customer PV generation. Generation may still require curtailment by the utility to prioritize grid stability. To circumvent this gap external data on distribution side operating conditions, load demand, solar generation, and utility-initiated generation curtailment can be collected and introduced into expanded simulation studies.
Smart inverter operational data is not publicly available and requires partnerships with research labs, utilities, and smart inverter manufacturers. However, the California Energy Commission maintains a database of UL 1741-SB compliant manufacturers and smart inverter models that can then be contacted for research partnerships. In terms of coverage area, while California and Hawaii are now moving towards standardizing smart inverter technology in their power systems, other regions outside of the United States may locate similar manufacturers through partnerships and collaborations.
Smart inverter operational data is not publicly available and requires partnerships with research labs, utilities, and smart inverter manufacturers. However, the California Energy Commission maintains a database of UL 1741-SB compliant manufacturers and smart inverter models that can then be contacted for research partnerships. In terms of coverage area, while California and Hawaii are now moving towards standardizing smart inverter technology in their power systems, other regions outside of the United States may locate similar manufacturers through partnerships and collaborations.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Particularly to the CEC database, one will need to contact the CEC or manufacturer to receive additional information for a particular smart inverter. Detailed studies using smart inverter hardware may require collaboration with a utility and research organization to perform advanced research studies.
U2: Usability > Aggregation
To retrieve additional data beyond the single entry model and manufacturer of a particular smart inverter, one may need to contact a variety of manufacturers to get access to datasets and specifications for operational smart inverter data, laboratories to get access to hardware in the loop test centers, and utilities or local energy commissions for smart inverter safety compliance and standards.
S2: Sufficiency > Coverage
New grid support functions defined by UL1741-SA and UL1741-SB are optional but will be required for California and Hawaii so as of now public manufacturing data is available only via the CEC website. Collaborations and contact with manufacturers outside the US may be necessary to compile a similar database and contact with utilities can provide a better understanding of similar UL 1741-SB criteria adoption.
Solar installation site assessment
Details (click to expand)
Statistical analysis on solar PV system components for pricing, logistics, planning, and site capacity studies is an important part of the process for siting solar PV systems. Spatiotemporal generation forecasting using pre-existing site data can be used to inform future site recommendations, policy, and decision making with respect to new developments.
The LBNL solar panel PV system dataset excluded third party owned systems and systems with battery backup. Since data was self-reported, component cost data may be imprecise. The dataset also has a data gap in timeliness as some of the data includes historical data which may not reflect current pricing and costs of PV systems.
The LBNL solar panel PV system dataset excluded third party owned systems and systems with battery backup. Since data was self-reported, component cost data may be imprecise. The dataset also has a data gap in timeliness as some of the data includes historical data which may not reflect current pricing and costs of PV systems.
Data Gap Type
Data Gap Details
S2: Sufficiency > Coverage
The dataset excluded third-party owned systems, systems with battery backup, self-installed systems, and data that was missing installation prices. Data was self reported and may be inconsistent based on the reporting of component costs. Furthermore, some state markets were under represented or missing which can be alleviated by new data collection or use of dataset jointly with simulation studies.
S4: Sufficiency > Timeliness
Dataset includes historical data which may not reflect current pricing for PV systems. To alleviate this, updated pricing may be incorporated in the form of external data or as additional synthetic data from simulation.
The USPVDB data must be accessed through the United States Geological Survey (USGS) mapper browser application or for download as GIS data in the form of shapefiles or geojsons. Tabular data and metadata are provided in CSV and XML format. Coverage of the dataset is isolated to the US specifically over densely populated regions. Enhancing the data by supplementing it with international large-scale photovoltaic satellite imagery can expand the coverage area of the dataset.
The USPVDB data must be accessed through the United States Geological Survey (USGS) mapper browser application or for download as GIS data in the form of shapefiles or geojsons. Tabular data and metadata are provided in CSV and XML format. Coverage of the dataset is isolated to the US specifically over densely populated regions. Enhancing the data by supplementing it with international large-scale photovoltaic satellite imagery can expand the coverage area of the dataset.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Data may be accessed through the USGS’s designated USPVDB mapper or downloaded as shapefiles for GIS data, tabular data, or as XML: metadata. Data is open and easily obtainable.
S2: Sufficiency > Coverage
Coverage is over the US and specifically over densely populated regions that may or may not correlate to areas of low cloud cover and high solar irradiance. Representation of smaller scale private PV systems could expand the current dataset to less populated areas as well as regions outside the US.
Solar power forecasting: Long-term (>24 hours)
Details (click to expand)
Longer-term solar forecasts are beneficial for energy market pricing, investment decisions, and integration with other renewable energy sources such as hydroelectric plants to allow for larger scale coordination and grid operational studies. Additionally, inclusion of energy storage systems to harvest solar energy on longer time scales can be better aligned with longer term demand forecasting and predicted solar peaks.
While the synthetic PV plant data is beneficial to perform forecasting and control simulation case studies when actual data is not present there are limitations with respect verification for site specific projects, representation of coverage areas outside of the US, and modeling assumptions based on data proxies that have to be taken into account when interpreting results.
While the synthetic PV plant data is beneficial to perform forecasting and control simulation case studies when actual data is not present there are limitations with respect verification for site specific projects, representation of coverage areas outside of the US, and modeling assumptions based on data proxies that have to be taken into account when interpreting results.
Data Gap Type
Data Gap Details
R1: Reliability > Quality
May not be suitable for site-specific projects because the simulation output may not be representative of a particular region requiring additional outside data to be utilized or adaptation of simulation to better represent the spatiotemporal region of interest.
Simulated PV is based on numeric weather prediction simulation and sub-hour irradience algorithms for day ahead and 5 minute data respectively which serves as a proxy for real PV data which may require verification with supplemental data or measurements from solar power inverter outputs.
S2: Sufficiency > Coverage
Synthetic data is based on solar conditions over the US for 2006 and may not be suitable for other locations.
Solar power forecasting: Medium-term (6-24 hours)
Details (click to expand)
Medium-term solar forecasts can be beneficial for simulation case studies in demand response, microgrid behavior, electricity markets, and solar site planning.
Depending on the region of interest, data can be retrieved from different open data satellites that are both geostationary as well as swath which may differ in spatial and temporal resolutions and coverage area. Additionally, multispectra data may have challenges with respect to preprocessing and preparing the data for analysis. Specifically for medium term solar forecasting, actual ground irradiance may differ from approximations made by models that utilize satellite derived cloud cover products. This is because different cloud types can have different impacts on irradiance. Supplementation with ground based measurements for verification and improvements in granularity are suggested solutions.
Depending on the region of interest, data can be retrieved from different open data satellites that are both geostationary as well as swath which may differ in spatial and temporal resolutions and coverage area. Additionally, multispectra data may have challenges with respect to preprocessing and preparing the data for analysis. Specifically for medium term solar forecasting, actual ground irradiance may differ from approximations made by models that utilize satellite derived cloud cover products. This is because different cloud types can have different impacts on irradiance. Supplementation with ground based measurements for verification and improvements in granularity are suggested solutions.
Data Gap Type
Data Gap Details
U2: Usability > Aggregation
Data can be retrieved from different open data satellites that are both geostationary and swath requiring collation of data if multiple regions of interest are selected.
U5: Usability > Pre-processing
Multispectral remote sensing data may require preprocessing depending on the wavelength, band combinations, and satellite products chosen for data analysis and model training. Solar forecasting may require band combinations in the visible and infrared spectra.
R1: Reliability > Quality
Different cloud types impact actual ground irradiance differently requiring verification and supplementation of measurements with ground based cloud cover sensor data.
S3: Sufficiency > Granularity
Spatial and temporal coverage can vary depending on the type of satellite selected for cloud cover and solar irradiance forecasting studies over a region of interest. Accurately forecasting changes in global irradiance during partly cloudy days can be difficult due to the variability of coverage in short time frames.
Solar power forecasting: Short-term (30 min-6 hours)
Details (click to expand)
Hourly site-specific solar forecasting can assist with solar energy estimates based on measured irradiance, photovoltaic inverter output energy, and turbine level output. Forecasting at this level can prove beneficial for joint distributed energy resource and energy storage microgrid scheduling studies, and system reliability studies.
While NOAA’s SOLRAD is an excellent data source for long term solar irradiance and climate studies, data gaps exist for the short term solar forecasting use case (which requires hourly averages). Data quality of hourly averages is lower than that of native resolution data impacting effective short-term forecasting for real-time energy management for grid stability, demand response, real-time market price predictions, and dispatch. Coverage area is also constrained to certain parts of the United States based on the SURFRAD network location.
While NOAA’s SOLRAD is an excellent data source for long term solar irradiance and climate studies, data gaps exist for the short term solar forecasting use case (which requires hourly averages). Data quality of hourly averages is lower than that of native resolution data impacting effective short-term forecasting for real-time energy management for grid stability, demand response, real-time market price predictions, and dispatch. Coverage area is also constrained to certain parts of the United States based on the SURFRAD network location.
Data Gap Type
Data Gap Details
S2: Sufficiency > Coverage
Coverage area is constrained to SOLRAD network locations in the United States, namely: Albuquerque, NM
Bismark, ND
Hanford, CA
Madison, WI
Oak Ridge, TN
Salt Lake City, UT
Seattle, WA
Sterling, VA
Tallahassee, FL
For dataset to be generalize to other regions, regions with similar climates and temperate zones would have to be identified.
S3: Sufficiency > Granularity
Data quality of the hourly averages is lower than that of the native resolution data which can impact effective short-term forecasting for real-time energy management for grid stability, demand response, real-time market price predictions, and dispatch. To solve this problem either using the data in the very short term may be better or utilization of additional data such as the data from sky imagers and other sensors with frequent measurement outputs.
While data coverage is global and based on data derived from satellite imagery as input to the Fast All-sky Radiation Model (FARM), a radiative transfer model, the output is calculated over specific time frames and would require to be calculated and updated for modern times. Furthermore, data is unbalanced as the region that has the longest temporal coverage is the United States. Satellite based estimation of solar resource information may be susceptible to cloud cover, snow, and bright surfaces which would require additional verification from ground based measurements and collation of outside data sources. Additionally, since data is derived from satellites, data may require preprocessing to account for parallax effects when looking at particular regions based on the field of view of the coverage satellite and the region of interest which may not be expressed in the FARM higher level tabular products.
While data coverage is global and based on data derived from satellite imagery as input to the Fast All-sky Radiation Model (FARM), a radiative transfer model, the output is calculated over specific time frames and would require to be calculated and updated for modern times. Furthermore, data is unbalanced as the region that has the longest temporal coverage is the United States. Satellite based estimation of solar resource information may be susceptible to cloud cover, snow, and bright surfaces which would require additional verification from ground based measurements and collation of outside data sources. Additionally, since data is derived from satellites, data may require preprocessing to account for parallax effects when looking at particular regions based on the field of view of the coverage satellite and the region of interest which may not be expressed in the FARM higher level tabular products.
Data Gap Type
Data Gap Details
U5: Usability > Pre-processing
Since data is derived from satellite imagery data may require pre-processing to account for pixel variability, parallax effects, and additional modeling using radiative transfer to improve solar radiation estimates.
R1: Reliability > Quality
Satellite based estimation of solar resource information for sites susceptible to cloud cover, snow, and bright surfaces may not be accurate thereby requiring verification from ground based measurements.
S4: Sufficiency > Timeliness
Data flow from satellite imagery to solar radiation measurement output from Fast All-sky Radiation Model for Solar applications needs to be recalculated and updated to expand beyond the coverage years of the represented global regions. For information with respect to the coverage area and years covered visit this link.
While NREL’S SRRL BMS provides real-time joint variable data from ground based sensors coverage is reserved to the sensor network in Golden, CO in the United States. Since the measurement system is comprised of diverse sensors, sensors may malfunction or go out of calibration requiring human intervention and maintenance following detection which may be delayed leading to inaccuracies in the data.
While NREL’S SRRL BMS provides real-time joint variable data from ground based sensors coverage is reserved to the sensor network in Golden, CO in the United States. Since the measurement system is comprised of diverse sensors, sensors may malfunction or go out of calibration requiring human intervention and maintenance following detection which may be delayed leading to inaccuracies in the data.
Data Gap Type
Data Gap Details
U5: Usability > Pre-processing
Instrument malfunction or calibration require human intervention leading to inaccuracies in measured data quantities especially if detection is delayed, affecting solar forecast accuracies. Despite this, the dataset continues to be maintained.
S2: Sufficiency > Coverage
Coverage is reserved to Golden, CO. Though other locations may additionally benefit from similar sensor monitoring systems, especially those with variations in weather patterns that could affect solar irradiance forecasting and thereby energy harvesting.
PV Anlage-Reinhart System information for PV systems collated and compiled by SMA with PV inverter data requires creating a user profile requests for specific system access, may lack clear instructions in languages outside of German, and have greater representation of systems located in Germany, Netherlands, and Australia, despite the presence of data globally. Furthermore, a subset of the systems cultivated contain joint energy storage data which may be valuable for DER specific load forecasting studies.
PV Anlage-Reinhart System information for PV systems collated and compiled by SMA with PV inverter data requires creating a user profile requests for specific system access, may lack clear instructions in languages outside of German, and have greater representation of systems located in Germany, Netherlands, and Australia, despite the presence of data globally. Furthermore, a subset of the systems cultivated contain joint energy storage data which may be valuable for DER specific load forecasting studies.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Need to utilize web interface or create a user profile/request membership to be granted access to additional data or data in desired format. For immediate use, data is unable to be downloaded in zip format or raw format, and must be scraped from the web browser portal to be used for free. Please contact SMA supplier for membership/usage rights.
U4: Usability > Documentation
Documentation is primarily in German, and lacks the same detail of data in the English version of the website. Companion works utilizing the data are not readily cited or linked. Since data can only be viewed via the portal, unless express permission is given to download the data, language barriers can be challenging in interpreting the displayed data values.
S2: Sufficiency > Coverage
Coverage is dependent on country of interest with greater representation of PV system data in Europe. In fact, of the 2018 countries represented, the system measurements vary from one system per country to 43,665 systems per country. Systems in Germany, the Netherlands, and Australia have more PV system testbeds than other global regions. Furthermore, depending on the testbed system selected some may have additional information on battery storage though inconsistent throughout most testbeds. This gap can be addressed by increasing the amount of private user contributed system data from diverse regions to supplement those already curated by SMA.
While SOLETE is advantageous to use for joint wind solar DER forecasting at the inverter level generation studies, the dataset can be improved by addressing several gaps in data sufficiency, namely expansion of the temporal coverage to include seasonal variations which may be addressed with additional outside data or simulation. Outside data or simulation may also improve scaling of the study to address multiple generation sources (more than one PV array and wind turbine) and the coordination between them to maintain grid reliability and stability. Additionally, a data wish for SOLETE includes the addition of maintenance schedules or system downtime data to more realistically model system dynamics with DERs.
While SOLETE is advantageous to use for joint wind solar DER forecasting at the inverter level generation studies, the dataset can be improved by addressing several gaps in data sufficiency, namely expansion of the temporal coverage to include seasonal variations which may be addressed with additional outside data or simulation. Outside data or simulation may also improve scaling of the study to address multiple generation sources (more than one PV array and wind turbine) and the coordination between them to maintain grid reliability and stability. Additionally, a data wish for SOLETE includes the addition of maintenance schedules or system downtime data to more realistically model system dynamics with DERs.
Data Gap Type
Data Gap Details
S1: Sufficiency > Insufficient Volume
SOLETE only covers a single wind turbine and PV array which may not take into account considerations when there are multiple generation sources and coordination is required between multiple generation sources of the same type. This gap can be alleviated by physically expanding the network to include multiple PV arrays and wind turbines, or by combining SOLETE with data from additional outside sources from utility, power electronics, and energy tech companies that may have similar datasets to perform larger coordination and grid control studies.
S2: Sufficiency > Coverage
The temporal coverage of SOLETE is limited to 15 months which is unable to capture long-term seasonal variations in joint wind and irradiance data.
S3: Sufficiency > Granularity
The resolution and sampling rate of the joint dataset can impact the precision of the analysis especially when fusing data of different temporal resolutions, from second to hourly together. Aggregation of second level data to hourly data may impact the outcomes of joint short-term solar and wind forecasting.
S6: Sufficiency > Missing Components
SOLETE does not include maintenance schedule data or system downtimes that occurred during data gathering. Retroactively assessing and supplementing the dataset with maintenance schedule data either by simulation or actual data collection from SYSLAB records may improve system forecasting and modeling to include uncertainties with respect to scheduled system maintenance.
Solar power forecasting: Very-short-term (0-30min)
Details (click to expand)
Very-short-term solar power forecasting is critical for time series irradiance forecasting and solar ramp event identification. Solar irradiance ramp events can be defined as sudden changes in solar irradiance within a short time interval. These events are often caused by transient clouds that can lead to abrupt fluctuations in the incoming solar energy. Cloud analysis using cloud segmentation and classification as a proxy to determining solar irradiance attentuation can assist in determining solar generation for photovoltaics and concentrated solar power towers. Solar generation predictions are important for real time electricity market and pricing studies, real-time dispatch of other generating sources, and energy storage control studies.
ARM dataset includes data from various DOE sites that include sensor information from sun-tracking photometers, radiometers, spectrometer data which is helpful in understanding hyperspectral solar irradiance and cloud dynamics. ARM sites generate large datasets which can be challenging to store, stream, analyze and archive, may be sensitive to sensor noise, and require further measurement verification especially with respect to aerosol composition. Additionally, ARM data coverage is limited to ARM sites motivating future collaboration with partner networks to enhance observational spatial coverage.
ARM dataset includes data from various DOE sites that include sensor information from sun-tracking photometers, radiometers, spectrometer data which is helpful in understanding hyperspectral solar irradiance and cloud dynamics. ARM sites generate large datasets which can be challenging to store, stream, analyze and archive, may be sensitive to sensor noise, and require further measurement verification especially with respect to aerosol composition. Additionally, ARM data coverage is limited to ARM sites motivating future collaboration with partner networks to enhance observational spatial coverage.
Data Gap Type
Data Gap Details
U6: Usability > Large Volume
ARM sites generate large datasets which can be challenging to to store, analyze, stream, and archive. Automating ingestion and analysis of data using artificial intelligence can alleviate volume by compressing/reducing data storage and provide novel ways to index and access the data.
R1: Reliability > Quality
Data quality from ARM site sensors can be sensitive to noise and calibration issues requiring field specialists to identify potential problems. Since data volume is large, ingestion of data and identification of measurement drift benefit from automation.
S2: Sufficiency > Coverage
Spatial coverage of radiation and associated ground based atmospheric phenomena are limited to ARM sites within the United States. To increase spatial context collaboration with partner sensor network sites with the DOE and ARM program can expand coverage within the United States. Similar initiatives outside the United States can enable better solar potential studies in regions with different environments.
S3: Sufficiency > Granularity
There is a need for the retrieval of enhanced aerosol composition measurements in addition to ice nucleating particle measurements for better understanding cloud and weather dynamics jointly with solar irradiance for DER site planning and solar potential surveying.
Data coverage is limited to Gaithersburg, MD NIST campus and is no longer being maintained after July 2017.
Data Gap Type
Data Gap Details
S2: Sufficiency > Coverage
Since testbeds are located on the NIST campus spatial coverage is limited to the institution’s site. Similar datasets outside which combine sensor information from the solar irradiance conditions, and the associated solar generated power at the output of the inverter would require investment in similar site-specific testbeds in different regions.
S4: Sufficiency > Timeliness
The dataset is no longer being maintained after July 2017 which given the investment in equipment for the project may be worth visiting to study the long term changes in the solar efficiency of panels with respect to time and operational degradation.
Data from Solcast is accessible via academic or research institution. Solcast uses course surface elevation models aligned with reanalysis data leading to significant elevation differences between ground data sites and cell height. While a global dataset, coverage is limited to 33 sites with 18 in tropical/subtropical locations and 15 in temperate locations. Time granularity is also between 5-60min.
Data from Solcast is accessible via academic or research institution. Solcast uses course surface elevation models aligned with reanalysis data leading to significant elevation differences between ground data sites and cell height. While a global dataset, coverage is limited to 33 sites with 18 in tropical/subtropical locations and 15 in temperate locations. Time granularity is also between 5-60min.
Data Gap Type
Data Gap Details
O1: Obtainability > Findability
Data is only accessible via collaborating academic or research institution.
R1: Reliability > Quality
significant changes in elevation between sites can impact the clear-sky solar irradiance estimation due to variations in the atmosphere that radiation travels through and interacts with. Furthermore, ground stations may be above clouds, which can affect the accuracy of satellite-derived solar irradiance estimates. Solcast uses a coarse surface elevation model aligned with the reanalysis data used, which may lead to significant elevation differences between ground data sites and the cell height.
S2: Sufficiency > Coverage
While data products cover global sites, of the sites 33 are covered with 18 representing tropical and subtropical regions and 15 temperate regions. Further work would need to be done to create data products for regions outside the 33 represented as well as for areas with different environmental conditions.
S3: Sufficiency > Granularity
The current time granularity of the dataset ranges from products taken at time resolultions of 5-60minutes. For shorter than 5minute term forecasting supplemental data would be needed.
Data coverage and granularity is limited by the location of the cameras and constrained to 10-minute increments. Resolution is also limited to 352x288 24bit jpeg images (see device specifications).
Data coverage and granularity is limited by the location of the cameras and constrained to 10-minute increments. Resolution is also limited to 352x288 24bit jpeg images (see device specifications).
Data Gap Type
Data Gap Details
S2: Sufficiency > Coverage
Coverage is constrained by the location of the sensor network, the amount of sensors within the network and the spatial distances between sensors. To improve coverage, similar sensor networks must be created in different environmental conditions with varying granularity.
S3: Sufficiency > Granularity
Resolution of images are limited to 352x288 24bit jpeg images taken at 10 minute increments based on device specifications. Studies fusing information from other sensors that have multispectral capabilities or measure additional quantities may provide additional information that facilitate better solar irradiance predictions and model the effect of water vapor.
S2: Sufficiency > Coverage
Current dataset is derived from two previous sky imager datasets in Singapore. Studies that extend beyond Singapore would require similar sensor testbed network, or would require the collation and cultivation of sky image datasets over different temperate environments in addition to the use of proxy data that may not be ground based (remote sensing).
There is a need for annotated labels sky image data for cloud detection and segmentation purposes for improved local and PV site-specific irradiance predictions. The data is ultimately constrained to the coverage area of Singapore and restricts users from its commercial use.
There is a need for annotated labels sky image data for cloud detection and segmentation purposes for improved local and PV site-specific irradiance predictions. The data is ultimately constrained to the coverage area of Singapore and restricts users from its commercial use.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Dataset is under the common creative license. Commercial use is not allowed and need to request access to data via the form.
S1: Sufficiency > Insufficient Volume
There is a need for a larger dataset that contains manually annotated cloud mask labels. The current dataset is also unbalanced with fewer nighttime data which though may not directly impact solar irradiance studies, it does affect cloud dynamics which may be crucial for forecasting irradiance at different times and timescales.
Terrestrial wildlife detection and species classification
Details (click to expand)
Terrestrial wildlife detection and species classification are essential for understanding the impacts of climate change on terrestrial ecosystems. Similarly to marine wildlife studies, ML can greatly improve these efforts by automatically processing large volumes of data from diverse sources, enhancing the accuracy and efficiency of monitoring and analyzing terrestrial species.
The first and foremost challenge of bioacoustic data is its sheer volume, which makes its data sharing especially difficult due to limited storage options and high costs. Urgent solutions are needed for cheaper and more reliable data hosting and sharing platforms.
Additionally, there’s a significant shortage of large and diverse annotated datasets, much more severe compared to image data like camera trap, drone, and crowd-sourced images.
The first and foremost challenge of bioacoustic data is its sheer volume, which makes its data sharing especially difficult due to limited storage options and high costs. Urgent solutions are needed for cheaper and more reliable data hosting and sharing platforms.
Additionally, there’s a significant shortage of large and diverse annotated datasets, much more severe compared to image data like camera trap, drone, and crowd-sourced images.
Data Gap Type
Data Gap Details
S1: Sufficiency > Insufficient Volume
One of the foremost challenges is the scarcity of publicly open and adequately annotated data for model training. It is worth mentioning that the lack of annotated datasets is a common and major challenge applied to almost every modality of biodiversity data and is particularly acute for bioacoustic data, where the availability of sufficiently large and diverse datasets, encompassing a wide array of species, remains limited for model training purposes.
This scarcity of publicly open and well annotated datasets arises from the labor-intensive nature of data annotation and the dearth of expertise necessary to accurately identify species. This is the case even for relatively well-studied taxa such as birds, and is even more notable for taxa such as Orthoptera (grasshoppers and relatives). The incompleteness of our current taxonomy also compounds this issue.
Though some data cannot be shared with the public because of concerns about poaching or other legal restrictions imposed by local governments, there are still considerable sources of well-annotated data that currently reside within the confines of individual researchers’ or organizations’ hard drives. Unlocking these data will partially address the data gap.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
S2: Sufficiency > Coverage
There exists a significant geographic imbalance in data collected, with insufficient representation from highly biodiverse regions. The data also does not cover a diverse spectrum of species, intertwining with taxonomy insufficiencies.
Addressing these gaps involves several strategies:
Data sharing: Unlocking existing datasets scattered across individual laboratories and organizations can partially alleviate the geographic and taxonomic gaps in data coverage.
Annotation Efforts:
Leveraging crowdsourcing platforms enables collaborative data collection and facilitates volunteer-driven annotation, playing a pivotal role in expanding annotated datasets.
Collaborating with domain experts, including ecologists and machine learning practitioners, is crucial. Incorporating local ecological knowledge, particularly from indigenous communities in target regions, enriches data annotation efforts.
Developing AI-powered methods for automatic cataloging, searching, and data conversion is imperative to streamline data processing and extract relevant information efficiently.
Data Collection:
Prioritize continuous data collection efforts, especially in underrepresented regions and for less documented species and taxonomic groups.
Explore innovative approaches like Automated Multisensor Stations for Monitoring of Species Diversity (AMMODs) to enhance data collection capabilities.
Strategically plan data collection initiatives to target species at risk of extinction, ensuring data capture before irreversible loss occurs.
U2: Usability > Aggregation
A lot of well-annotated datasets are currently scattered within the confines of individual researchers’ or organizations’ hard drives. Aggregating and sharing these datasets would largely address the lack of publicly open annotated data needed for model training. This initiative would also mitigate geographic and taxonomic imbalances in existing data.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
U6: Usability > Large Volume
One of the biggest challenges in bioacoustic data lies in its sheer volume, stemming from continuous monitoring processes. Researchers face significant hurdles in sharing and hosting this data, as existing online platforms often don’t provide sufficient long-term storage capacity or they are very expensive. Urgent solutions are needed to provide cheaper and more reliable hosting options. Moreover, accessing these extensive datasets demands advanced computing infrastructure and solutions. If there were enough funding sources for this, a lot of people would like to start sharing their bioacoustic data.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
Data Gap Type
Data Gap Details
S1: Sufficiency > Insufficient Volume
One of the foremost challenges is the scarcity of publicly open and adequately annotated data for model training. It is worth mentioning that the lack of annotated datasets is a common and major challenge applied to almost every modality of biodiversity data and is particularly acute for bioacoustic data, where the availability of sufficiently large and diverse datasets, encompassing a wide array of species, remains limited for model training purposes.
This scarcity of publicly open and well annotated datasets arises from the labor-intensive nature of data annotation and the dearth of expertise necessary to accurately identify species. This is the case even for relatively well-studied taxa such as birds, and is even more notable for taxa such as Orthoptera (grasshoppers and relatives). The incompleteness of our current taxonomy also compounds this issue.
Though some data cannot be shared with the public because of concerns about poaching or other legal restrictions imposed by local governments, there are still considerable sources of well-annotated data that currently reside within the confines of individual researchers’ or organizations’ hard drives. Unlocking these data will partially address the data gap.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
S2: Sufficiency > Coverage
There exists a significant geographic imbalance in data collected, with insufficient representation from highly biodiverse regions. The data also does not cover a diverse spectrum of species, intertwining with taxonomy insufficiencies.
Addressing these gaps involves several strategies:
Data sharing: Unlocking existing datasets scattered across individual laboratories and organizations can partially alleviate the geographic and taxonomic gaps in data coverage.
Annotation Efforts:
Leveraging crowdsourcing platforms enables collaborative data collection and facilitates volunteer-driven annotation, playing a pivotal role in expanding annotated datasets.
Collaborating with domain experts, including ecologists and machine learning practitioners, is crucial. Incorporating local ecological knowledge, particularly from indigenous communities in target regions, enriches data annotation efforts.
Developing AI-powered methods for automatic cataloging, searching, and data conversion is imperative to streamline data processing and extract relevant information efficiently.
Data Collection:
Prioritize continuous data collection efforts, especially in underrepresented regions and for less documented species and taxonomic groups.
Explore innovative approaches like Automated Multisensor Stations for Monitoring of Species Diversity (AMMODs) to enhance data collection capabilities.
Strategically plan data collection initiatives to target species at risk of extinction, ensuring data capture before irreversible loss occurs.
U2: Usability > Aggregation
A lot of well-annotated datasets are currently scattered within the confines of individual researchers’ or organizations’ hard drives. Aggregating and sharing these datasets would largely address the lack of publicly open annotated data needed for model training. This initiative would also mitigate geographic and taxonomic imbalances in existing data.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
The main challenge with community science data is its lack of diversity. Data tends to be concentrated in accessible areas and primarily focuses on charismatic or commonly encountered species.
The main challenge with community science data is its lack of diversity. Data tends to be concentrated in accessible areas and primarily focuses on charismatic or commonly encountered species.
Data Gap Type
Data Gap Details
S2: Sufficiency > Coverage
Data is often concentrated in easily accessible areas and focuses on more charismatic or easily identifiable species. Data is also biased towards more densely populated species.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
Data Gap Type
Data Gap Details
S1: Sufficiency > Insufficient Volume
One of the foremost challenges is the scarcity of publicly open and adequately annotated data for model training. It is worth mentioning that the lack of annotated datasets is a common and major challenge applied to almost every modality of biodiversity data and is particularly acute for bioacoustic data, where the availability of sufficiently large and diverse datasets, encompassing a wide array of species, remains limited for model training purposes.
This scarcity of publicly open and well annotated datasets arises from the labor-intensive nature of data annotation and the dearth of expertise necessary to accurately identify species. This is the case even for relatively well-studied taxa such as birds, and is even more notable for taxa such as Orthoptera (grasshoppers and relatives). The incompleteness of our current taxonomy also compounds this issue.
Though some data cannot be shared with the public because of concerns about poaching or other legal restrictions imposed by local governments, there are still considerable sources of well-annotated data that currently reside within the confines of individual researchers’ or organizations’ hard drives. Unlocking these data will partially address the data gap.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
S2: Sufficiency > Coverage
There exists a significant geographic imbalance in data collected, with insufficient representation from highly biodiverse regions. The data also does not cover a diverse spectrum of species, intertwining with taxonomy insufficiencies.
Addressing these gaps involves several strategies:
Data sharing: Unlocking existing datasets scattered across individual laboratories and organizations can partially alleviate the geographic and taxonomic gaps in data coverage.
Annotation Efforts:
Leveraging crowdsourcing platforms enables collaborative data collection and facilitates volunteer-driven annotation, playing a pivotal role in expanding annotated datasets.
Collaborating with domain experts, including ecologists and machine learning practitioners, is crucial. Incorporating local ecological knowledge, particularly from indigenous communities in target regions, enriches data annotation efforts.
Developing AI-powered methods for automatic cataloging, searching, and data conversion is imperative to streamline data processing and extract relevant information efficiently.
Data Collection:
Prioritize continuous data collection efforts, especially in underrepresented regions and for less documented species and taxonomic groups.
Explore innovative approaches like Automated Multisensor Stations for Monitoring of Species Diversity (AMMODs) to enhance data collection capabilities.
Strategically plan data collection initiatives to target species at risk of extinction, ensuring data capture before irreversible loss occurs.
U2: Usability > Aggregation
A lot of well-annotated datasets are currently scattered within the confines of individual researchers’ or organizations’ hard drives. Aggregating and sharing these datasets would largely address the lack of publicly open annotated data needed for model training. This initiative would also mitigate geographic and taxonomic imbalances in existing data.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
One gap in data is the incomplete barcoding reference databases.
Data Gap Type
Data Gap Details
S2: Sufficiency > Coverage
eDNA is an emerging new technique in biodiversity monitoring. There are still a lot of issues impeding the application of eDNA-based tools. One gap in data is the incomplete barcoding reference databases. However, a lot of attention and effort are being devoted to filling this data gap. For example, the BIOSCAN project. It is worth mentioning that BIOSCAN-5M is a comprehensive dataset containing multi-modal information, including DNA barcode sequenceses and taxonomic labels for over 5 million insect specimens, presenting as a large reference library on species- and genus-level classification tasks.
While GBIF provides a common standard, the accuracy of species classification in the data can vary, and classifications may change over time.
Data Gap Type
Data Gap Details
U5: Usability > Pre-processing
Though a common standard is provided at Gbif, the species classification of the data is not always accurate and consistent. The species were even classified into different groups over time.
Some commercial high-resolution satellite images can also be used to identify large animals such as whales, but those images are not open to the public.
Some commercial high-resolution satellite images can also be used to identify large animals such as whales, but those images are not open to the public.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
The resolution of publicly open satellite images is not high enough. High-resolution images are usually commercial and not open for free.
Variability analysis of wind power generation
Details (click to expand)
The shift from high-inertia generation sources such as thermal plants to low inertia distributed inverter-coupled generation from distributed energy resources introduces new stability and reliability issues. It is imperative to maintain the frequency of the system at a nominal level to prevent damage, instability, and blackouts. Wind generation from turbines can contribute some frequency response and inertia that may benefit the grid by providing a combination of synthetic inertial and primary frequency response to the grid system.
To gain access, particularly to NREL’s FESTIV model, permission must be requested. Since FESTIV is a simulation model, it may not account for all real-time system dynamics and complexities requiring validation and verification from real-world data. Furthermore, since the granularity of the model is hourly, it may not be able to account for very short-term impacts, frequencies, and reactive power flows that can affect power system stability.
To gain access, particularly to NREL’s FESTIV model, permission must be requested. Since FESTIV is a simulation model, it may not account for all real-time system dynamics and complexities requiring validation and verification from real-world data. Furthermore, since the granularity of the model is hourly, it may not be able to account for very short-term impacts, frequencies, and reactive power flows that can affect power system stability.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
To gain access to the FESTIV model, contact the group manager at: rui.yang@nrel.gov
R1: Reliability > Quality
The model may not account for all real-time system dynamics and complexities and may need verification from operational data. Additionally, since data relies on scenario-based forecasting it may not capture real-world uncertainties. Furthermore, operating reserve values may be inaccurate and need validation in practice.
S3: Sufficiency > Granularity
FESTIV is based on houly unit commitment time resolution which may not consider reliability impacts that occur on a subhourly scale. The focus on short-term operational impacts rather than very short term (sub-hourly) does not take into account the frequency response, voltage magnitudes, and reactive power flows which impact system stability and reliability.
Weather forecasting: Near-term (< 24 hours)
Details (click to expand)
Near-term weather forecasting (< 24 hours ahead) of temperature, precipitation, etc. at km-level spatial and minute-level temporal resolution, in an accurate and computationally-efficient manner, has implications for many climate change mitigation and adaptation applications. ML can help provide more accurate near-term weather forecasts.
Data volume is large and only data specific to the US is available.
Data Gap Type
Data Gap Details
U6: Usability > Large Volume
The large data volume, resulting from its high spatio-temporal resolution, makes transferring and processing the data very challenging. It would be beneficial if the data were accessible remotely and if computational resources were provided alongside it.
S2: Sufficiency > Coverage
This assimilated dataset currently covers only the continental US. It would be highly beneficial to have a similar dataset that includes global coverage.
Data volume is large, and only data covering the US is available.
Data Gap Type
Data Gap Details
U6: Usability > Large Volume
The large data volume, resulting from its high spatio-temporal resolution, makes transferring and processing the data very challenging. It would be beneficial if the data were accessible remotely and if computational resources were provided alongside it.
Data volume is large due to the high spatio-temporal resolution, which makes transfering and processing the data very difficult.
S2: Sufficiency > Coverage
This assimilated dataset currently covers only the continental US. It would be highly beneficial to have a similar dataset that includes global coverage.
Obtaining and integrating radar data from various sources is challenging.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Radar data from many countries are not open to the public. They have to be purchased or applied for. Also, different agencies tend to apply differing quality control protocols, posing challenges to scaling up the data analysis to global scale.
U1: Usability > Structure
Radar data from different sources vary in format, spatial resolution, and temporal resolution, making assimilation challenging.
U3: Usability > Usage Rights
Many radar data are rescrited for academic and research purpose only.
S2: Sufficiency > Coverage
There is no sufficient data or even no data from global south.
An enhanced version of ERA5 with higher granularity and fidelity is needed. In fact, a lot of surface observations and remote sensing data are in place for developing such a dataset.
An enhanced version of ERA5 with higher granularity and fidelity is needed. In fact, a lot of surface observations and remote sensing data are in place for developing such a dataset.
Data Gap Type
Data Gap Details
S6: Sufficiency > Missing Components
ERA5 is currently widely used in ML-based weather forecast and climate modeling because of its high resolution and ready-for-analysis characteristics. But large volumes of observations, e.g. data from radiosonde, balloons, and weather stations are largely under-utilized. It would be great to create a dataset well-structured like ERA5 but from more observations.
Weather forecasting: Short-to-medium term (1-14 days)
Details (click to expand)
Weather forecasting at 1-14 days ahead has implications for real-time response and planning applications within both climate change mitigation and adaptation. ML can help improve short-to-medium-term weather forecasts.
The biggest challenge of ENS is that only a portion of it is available to the public for free.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
The high-resolution real-time forecast is for purchase only and it is expensive to get them.
U6: Usability > Large Volume
Due to its high spatio-temporal resolution, HRES data is quite large, creating significant challenges for downloading, transferring, storing, and processing with standard computational infrastructure. To address this, the WeatherBench 2 benchmark dataset provides a solution for certain machine learning tasks involving ENS data.
ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.
ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Downloading ERA5 data from the Copernicus Climate Data Store can be very time-consuming, taking anywhere from days to months. This delay is due to the large size of the dataset, which results from its high spatio-temporal resolution, its high demand, and the fact that the data is stored on tape.
U6: Usability > Large Volume
The large volume of data poses significant challenges for downloading, transferring, storing, and processing with standard computational infrastructure. Although cloud-based access is available from various providers, it often comes with high costs and is not free. To address these challenges, it would be beneficial to make additional computational resources available alongside the storage of data.
R1: Reliability > Quality
ERA5 is often used as “ground truth” in bias-correction tasks, but it has its own biases and uncertainties. ERA5 is derived from observations combined with physics-based models to estimate conditions in areas with sparse observational data. Consequently, biases in ERA5 can stem from the limitations of these physics-based models. It is worth noting that precipitation and cloud-related fields in ERA5 are less accurate compared to other fields and are not suitable for validating ML models. A higher-fidelity atmospheric dataset, such as an enhanced version of ERA5, is greatly needed. Machine learning can play a significant role in this area by improving the assimilation of atmospheric observation data from various sources.
The biggest challenge of HRES is that only a portion of it is available to the public for free.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
The high-resolution real-time forecast is for purchase only and it is expensive to get them.
U6: Usability > Large Volume
Due to its high spatio-temporal resolution, HRES data is quite large, creating significant challenges for downloading, transferring, storing, and processing with standard computational infrastructure. To address this, the WeatherBench 2 benchmark dataset provides a solution for certain machine learning tasks involving HRES data.
Weather Bench 2 is based on ERA5, so the issues of ERA5 are also inherent here, that is, data has biases over regions where there are no observations.
Data Gap Type
Data Gap Details
R1: Reliability > Quality
ERA5 is often used as “ground truth” in bias-correction tasks, but it has its own biases and uncertainties. ERA5 is derived from observations combined with physics-based models to estimate conditions in areas with sparse observational data. Consequently, biases in ERA5 can stem from the limitations of these physics-based models. It is worth noting that precipitation and cloud-related fields in ERA5 are less accurate compared to other fields and are not suitable for validating ML models. A higher-fidelity atmospheric dataset, such as an enhanced version of ERA5, is greatly needed. Machine learning can play a significant role in this area by improving the assimilation of atmospheric observation data from various sources.
Weather forecasting: Subseasonal horizon
Details (click to expand)
High-fidelity weather forecasts at subseasonal to seasonal scales (3-4 weeks ahead) are important for a variety of climate change-related applications. ML can be used to postprocess outputs from physics-based numerical forecast models in order to generate higher-fidelity forecasts.
ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.
ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Downloading ERA5 data from the Copernicus Climate Data Store can be very time-consuming, taking anywhere from days to months. This delay is due to the large size of the dataset, which results from its high spatio-temporal resolution, its high demand, and the fact that the data is stored on tape.
U6: Usability > Large Volume
The large volume of data poses significant challenges for downloading, transferring, storing, and processing with standard computational infrastructure. Although cloud-based access is available from various providers, it often comes with high costs and is not free. To address these challenges, it would be beneficial to make additional computational resources available alongside the storage of data.
R1: Reliability > Quality
ERA5 is often used as “ground truth” in bias-correction tasks, but it has its own biases and uncertainties. ERA5 is derived from observations combined with physics-based models to estimate conditions in areas with sparse observational data. Consequently, biases in ERA5 can stem from the limitations of these physics-based models. It is worth noting that precipitation and cloud-related fields in ERA5 are less accurate compared to other fields and are not suitable for validating ML models. A higher-fidelity atmospheric dataset, such as an enhanced version of ERA5, is greatly needed. Machine learning can play a significant role in this area by improving the assimilation of atmospheric observation data from various sources.
More data is needed to develop a more accurate and robust ML model. It is also important to note that SUBX data contains biases and uncertainties, which can be inherited by ML models trained with this data.
More data is needed to develop a more accurate and robust ML model. It is also important to note that SUBX data contains biases and uncertainties, which can be inherited by ML models trained with this data.
Data Gap Type
Data Gap Details
S1: Sufficiency > Insufficient Volume
Larger models generally offer improved performance for developing data-driven sub-seasonal forecast models. However, with only a limited number of models contributing to the SUBX dataset, there is a scarcity of training data. To enhance ML model performance, more SUBX data generated by physics-based numerical weather forecast models is required.
Weather forecasting: Subseasonal-to-seasonal horizon
Details (click to expand)
High-fidelity weather forecasts at the subseasonal-to-seasonal (S2S) scale (i.e., 10-46 days ahead) are important for a variety of climate change-related applications. ML can be used to postprocess outputs from physics-based numerical forecast models in order to generate higher-fidelity forecasts.
CPC Global Unified gauge-based analysis of daily precipitation https://psl.noaa.gov/data/gridded/data.cpc.globalprecip.html
Data Gap Type
Data Gap Details
R1: Reliability > Quality
There is large uncertainty in data as data is derived via interpolating station data. There are large biases over areas where rain gauge stations are sparse
S3: Sufficiency > Granularity
Resolution is 0.5 deg (roughly 50km) and not sufficiently fine for many applications.
More data is needed to take advantage of the large ML models.
Data Gap Type
Data Gap Details
S1: Sufficiency > Insufficient Volume
Currently available data is not sufficient for training large ML models. More data is needed.
Wildfire prediction: Short-term (3-7 days)
Details (click to expand)
Wildfires are becoming more frequent and severe due to climate change. Accurate early prediction of these events enables timely evacuations and resource allocation, thereby reducing risks to lives and property. ML can enhance fire prediction by providing more precise forecasts quickly and efficiently.
A huge data gap is that there is no active fire data in the afternoon (1-5 pm) when most fires ignite.
Data Gap Type
Data Gap Details
S6: Sufficiency > Missing Components
There is currently no active fire data available for the afternoon period (1-5 pm), when most fires tend to ignite, due to a lack of satellite coverage during these hours (after 1:30 pm). Some companies are developing their own satellites to address this gap and provide crucial afternoon data.
U5: Usability > Pre-processing
Available active fire data are of various qualities and with false positives or negatives. They have to be cleaned, validated, and corrected before use. Some research companies are assimilating all available satellite data or active fire products to create their own dataset of active fire.
R1: Reliability > Quality
Available active fire data are of various qualities and with false positives or negatives. They have to be cleaned, validated, and corrected before use. Some research companies are assimilating all available satellite data or active fire products to create their own dataset of active fire.
ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.
ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months.
Socioeconomic data, eg. human behaviors are significant predictors of fire. Other than the inherent challenges and gaps of socioeconomic data, aggregating those datasets and harmonizing them with other predictors of fire data in the spatial domain is especially tricky.
Socioeconomic data, eg. human behaviors are significant predictors of fire. Other than the inherent challenges and gaps of socioeconomic data, aggregating those datasets and harmonizing them with other predictors of fire data in the spatial domain is especially tricky.
Data Gap Type
Data Gap Details
U2: Usability > Aggregation
Socio-economic data, e.g. population, building types, are usually in a different format and structure with other fire predictors and fire hazard data. Aggregating different kinds of socio-economic data and harmonizing them with other predictors of fire and fire hazard data is challenging, especially in the spatial dimension.
Data is not available in machine-readable formats and is limited to English-language literature from major journals.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Many data sources that should be open are not fully accessible. For instance, abstracts are generally expected to be openly available, even for proprietary data. However, in practice, for some papers, only a subset of abstracts is often accessible.
U1: Usability > Structure
Most of the data is in PDF format and should be converted to machine-readable formats.
S2: Sufficiency > Coverage
Research is currently limited to literature published in English (at least the abstracts) and from major journals. Many region-specific journals or literature indexed in other languages are not included. These should be translated into English and incorporated into the database.
Active fire data
Details (click to expand)
Active fire data derived from images taken by satellites such as MODIS, VIRRS, LANDSAT. They are at different spatial resolutions and temporal coverage. Data can be downloaded here: https://firms.modaps.eosdis.nasa.gov/active_fire.
A huge data gap is that there is no active fire data in the afternoon (1-5 pm) when most fires ignite.
Data Gap Type
Data Gap Details
S6: Sufficiency > Missing Components
There is currently no active fire data available for the afternoon period (1-5 pm), when most fires tend to ignite, due to a lack of satellite coverage during these hours (after 1:30 pm). Some companies are developing their own satellites to address this gap and provide crucial afternoon data.
U5: Usability > Pre-processing
Available active fire data are of various qualities and with false positives or negatives. They have to be cleaned, validated, and corrected before use. Some research companies are assimilating all available satellite data or active fire products to create their own dataset of active fire.
R1: Reliability > Quality
Available active fire data are of various qualities and with false positives or negatives. They have to be cleaned, validated, and corrected before use. Some research companies are assimilating all available satellite data or active fire products to create their own dataset of active fire.
Advanced metering infrastructure data
Details (click to expand)
Advanced Metering Infrastructure (AMI) facilitates communication between utilities and customers through smart meter device systems which collect, store, and analyze per building energy consumption.
AMI data can be retrieved through individual data collection, research partnerships with utilities such as the Irvine Smart Grid Demonstration (ISGD) project conducted by Southern California Edison (SCE) or the smart meter pilot test from the Sacramento Municipal Utility, and through aggregated and anonymized household consumption data such as the Commission for Energy Regulation (CER) hosted by the Irish Social Science Data Archive (ISSDA).
AMI data is challenging to obtain without pilot study partnerships with utilities since data collection on individual building consumer behavior can infringe upon customer privacy especially at the residential level. Granularity of time series data can also vary depending on the level of access to the data whether it be aggregated and anonymized or based on the resolution of readings and system. Additionally, coverage of data will be limited to utility pilot test service areas thereby restricting the scope and scale of demand studies.
AMI data is challenging to obtain without pilot study partnerships with utilities since data collection on individual building consumer behavior can infringe upon customer privacy especially at the residential level. Granularity of time series data can also vary depending on the level of access to the data whether it be aggregated and anonymized or based on the resolution of readings and system. Additionally, coverage of data will be limited to utility pilot test service areas thereby restricting the scope and scale of demand studies.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Access to real AMI data can be difficult to obtain due to privacy concerns. Even when partnered with a utility, the AMI data can undergo anonymization and aggregation to protect individual customers. Some ISOs are able to provide data provided that a written records request is submitted. If requesting personal consumption data, tiered pricing enrollment, for example time of use, may limit temporal resolution of data that utility can provide based on system. Open datasets, on the other hand may only be available for academic research or teaching use (ISSDA CER data).
U2: Usability > Aggregation
AMI data when used jointly with other data that may influence demand such as weather data, availability of rooftop solar, presence of electric vehicles, building specifications, and appliance inventory may require significant additional data collection or retrieval.
U3: Usability > Usage Rights
For ISSDA CER data use, a request form must be submitted for evaluation by issda@ucd.ie. For data obtained through utility collaborative partnerships, usage rights may vary.
U5: Usability > Pre-processing
Data cleanliness may vary depending on the data source. For individual private data collection through testbed development, cleanliness can depend on formats of data stream output from the sensor network system installed.
R1: Reliability > Quality
Anonymized data may not be verifiable or useful once it is open.
S2: Sufficiency > Coverage
S3: Sufficiency > Granularity
Meter resolution can vary based on the hardware ranging from 1 hour, 30 minute, to 15 minute measurement intervals. Depending on the level of anonymization and aggregation of data, the granularity may be constrained to other factors such as the cadence of time of use pricing and other tiered demand response programs employed by the partnering utility.
S4: Sufficiency > Timeliness
With respect to the CER Smart Metering Project and the associated Customer Behavior Trials (CBT), Electric Ireland and Bord Gais Energy smart meter installation and monitoring occurred from 2009-2010. This anonymized dataset may no longer be representative of current behavior usage as household compositions and associated loads change with time. Similarly, pilot programs through participating utilities are finite in nature.
Automatic surface observation (ASOS)
Details (click to expand)
Data volume is large and only data specific to the US is available.
Data Gap Type
Data Gap Details
U6: Usability > Large Volume
The large data volume, resulting from its high spatio-temporal resolution, makes transferring and processing the data very challenging. It would be beneficial if the data were accessible remotely and if computational resources were provided alongside it.
S2: Sufficiency > Coverage
This assimilated dataset currently covers only the continental US. It would be highly beneficial to have a similar dataset that includes global coverage.
Benchmark datasets for short-term wildfire prediction
Details (click to expand)
Benchmark datasets for wildfire prediction are standardized collections of data that include historical and real-time wildfire occurrences, remote sensing imagery, fuel information, and meteorological data. These datasets provide a common framework for training, validating, and testing machine learning models. By integrating various modalities and sources of data, benchmark datasets simplify the process of data collection, integration, and preprocessing, ensuring consistency and efficiency in developing and evaluating wildfire prediction models.
Use Case
Data Gap Summary
Benchmark datasets of building environmental conditions and occupancy
Details (click to expand)
The US Office of Energy Efficiency and Renewable Energy hosts 15 building datasets for 10 states covering 7 climate zones and 11 different building types. The data covers energy, indoor air quality, occupancy, environment, HVAC, lighting, and energy consumption to name a few. Datasets are organized by name and points of contact.
Datasets featured can vary in types of data gaps depending on the content, coverage area, location, building type, building spatial plan, quantity measured, ambient environment, or power consumption or metered data availability.
Datasets featured can vary in types of data gaps depending on the content, coverage area, location, building type, building spatial plan, quantity measured, ambient environment, or power consumption or metered data availability.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
All data featured on the platform is open access with standardization on metadata format to allow for ease of use and information specific to buildings based on type, location, and climate zone. Data quality and guidance on curation and cleaning in addition to access restrictions are specified in the metadata of each hosted dataset.
U3: Usability > Usage Rights
Licensing information for each individual featured dataset is provided.
S3: Sufficiency > Granularity
Dataset time resolution and period of temporal coverage vary depending on the dataset selected.
S6: Sufficiency > Missing Components
Building data typically does not include grid interactive data, or signals from the utility side with respect to control or demand side management. Such data can be difficult to obtain or require special permissions. By enabling the collection of utility side signals, utility-initiated auto-demand response (auto-DR) and load shifting could be better assessed.
S2: Sufficiency > Coverage
Featured datasets are from test-beds, buildings, and contributing households from the United States.
Bioacoustic recordings
Details (click to expand)
Passive acoustic recording provides continuous monitoring of both the environment and the species.
There is a lack of standardized protocol to guide data collection for restoration projects, including which variables to measure and how to measure them. Developing such protocols would standardize data collection practices, enabling consistent assessment and comparison of restoration successes and failures on a global scale.
There is a lack of standardized protocol to guide data collection for restoration projects, including which variables to measure and how to measure them. Developing such protocols would standardize data collection practices, enabling consistent assessment and comparison of restoration successes and failures on a global scale.
Data Gap Type
Data Gap Details
U1: Usability > Structure
For restoration projects, there is an urgent need of standardized protocol to guide data collection and curation for measuring biodiversity and other complex ecological outcomes of restoration projects. For example, there is an urgent need of a clearly written guidance on what variables to collect and how to collect them.
Turning raw data into usable insights is also a significant challenge. There is a lack of standardized tools and workflows to turn the raw data into some analysis-ready data and analyze the data in a consistent way across projects.
U4: Usability > Documentation
For restoration projects, there is an urgent need of standardized protocol to guide data collection and curation for measuring biodiversity and other complex ecological outcomes of restoration projects. For example, there is an urgent need of a clearly written guidance on what variables to collect and how to collect them.
Turning raw data into usable insights is also a significant challenge. There is a lack of standardized tools and workflows to turn the raw data into some analysis-ready data and analyze the data in a consistent way across projects.
The major data gap is the limited institutional capacity to process and analyze data promptly to inform decision-making.
Data Gap Type
Data Gap Details
M: Misc/Other
There is a significant institutional challenge in processing and analyzing data promptly to inform decision-making. To enhance institutional capacity for leveraging global data sources and analytical methods effectively, a strategic, ecosystem-building approach is essential, rather than solely focusing on individual researcher skill development. This approach should prioritize long-term sustainability over short-term project-based funding.
Funding presents a major bottleneck for ecosystem monitoring initiatives. While most funding allocations are short-term, there is a critical need for sustained and adequate funding to support ongoing monitoring efforts and maintain data processing capabilities.
The first and foremost challenge of bioacoustic data is its sheer volume, which makes its data sharing especially difficult due to limited storage options and high costs. Urgent solutions are needed for cheaper and more reliable data hosting and sharing platforms.
Additionally, there’s a significant shortage of large and diverse annotated datasets, much more severe compared to image data like camera trap, drone, and crowd-sourced images.
The first and foremost challenge of bioacoustic data is its sheer volume, which makes its data sharing especially difficult due to limited storage options and high costs. Urgent solutions are needed for cheaper and more reliable data hosting and sharing platforms.
Additionally, there’s a significant shortage of large and diverse annotated datasets, much more severe compared to image data like camera trap, drone, and crowd-sourced images.
Data Gap Type
Data Gap Details
S1: Sufficiency > Insufficient Volume
One of the foremost challenges is the scarcity of publicly open and adequately annotated data for model training. It is worth mentioning that the lack of annotated datasets is a common and major challenge applied to almost every modality of biodiversity data and is particularly acute for bioacoustic data, where the availability of sufficiently large and diverse datasets, encompassing a wide array of species, remains limited for model training purposes.
This scarcity of publicly open and well annotated datasets arises from the labor-intensive nature of data annotation and the dearth of expertise necessary to accurately identify species. This is the case even for relatively well-studied taxa such as birds, and is even more notable for taxa such as Orthoptera (grasshoppers and relatives). The incompleteness of our current taxonomy also compounds this issue.
Though some data cannot be shared with the public because of concerns about poaching or other legal restrictions imposed by local governments, there are still considerable sources of well-annotated data that currently reside within the confines of individual researchers’ or organizations’ hard drives. Unlocking these data will partially address the data gap.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
S2: Sufficiency > Coverage
There exists a significant geographic imbalance in data collected, with insufficient representation from highly biodiverse regions. The data also does not cover a diverse spectrum of species, intertwining with taxonomy insufficiencies.
Addressing these gaps involves several strategies:
Data sharing: Unlocking existing datasets scattered across individual laboratories and organizations can partially alleviate the geographic and taxonomic gaps in data coverage.
Annotation Efforts:
Leveraging crowdsourcing platforms enables collaborative data collection and facilitates volunteer-driven annotation, playing a pivotal role in expanding annotated datasets.
Collaborating with domain experts, including ecologists and machine learning practitioners, is crucial. Incorporating local ecological knowledge, particularly from indigenous communities in target regions, enriches data annotation efforts.
Developing AI-powered methods for automatic cataloging, searching, and data conversion is imperative to streamline data processing and extract relevant information efficiently.
Data Collection:
Prioritize continuous data collection efforts, especially in underrepresented regions and for less documented species and taxonomic groups.
Explore innovative approaches like Automated Multisensor Stations for Monitoring of Species Diversity (AMMODs) to enhance data collection capabilities.
Strategically plan data collection initiatives to target species at risk of extinction, ensuring data capture before irreversible loss occurs.
U2: Usability > Aggregation
A lot of well-annotated datasets are currently scattered within the confines of individual researchers’ or organizations’ hard drives. Aggregating and sharing these datasets would largely address the lack of publicly open annotated data needed for model training. This initiative would also mitigate geographic and taxonomic imbalances in existing data.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
U6: Usability > Large Volume
One of the biggest challenges in bioacoustic data lies in its sheer volume, stemming from continuous monitoring processes. Researchers face significant hurdles in sharing and hosting this data, as existing online platforms often don’t provide sufficient long-term storage capacity or they are very expensive. Urgent solutions are needed to provide cheaper and more reliable hosting options. Moreover, accessing these extensive datasets demands advanced computing infrastructure and solutions. If there were enough funding sources for this, a lot of people would like to start sharing their bioacoustic data.
Building data genome project
Details (click to expand)
The Building Data Genome Project 2 dataset contains hourly whole building data from 3,053 energy meters from 1,636 non-residential buildings covering two years worth of metered data with respect to electricity, water, and solar in addition to logistical metadata with respect to area, primary building use category, floor area, time zone, weather, and smart meter type. The goal of the dataset to to allow for the development of generalizable building models for energy efficiency analysis studies.
The building data genome project 2 compiles building data from public open datasets along with privately curated building data specific to university and higher education institutions. While the dataset has clear documentation, the lack of diversity in building types necessitates further development of the dataset to include other types of buildings as well as expansion to coverage areas and times beyond those currently available.
The building data genome project 2 compiles building data from public open datasets along with privately curated building data specific to university and higher education institutions. While the dataset has clear documentation, the lack of diversity in building types necessitates further development of the dataset to include other types of buildings as well as expansion to coverage areas and times beyond those currently available.
Data Gap Type
Data Gap Details
U2: Usability > Aggregation
Data was collated from 7 open access public data sources as well as 12 privately curated datasets from facilities management at different college sites requiring manual site visits which are not included in the data repository at this time.
U3: Usability > Usage Rights
S2: Sufficiency > Coverage
The dataset is curated from buildings on university campuses thereby limiting the diversity of building representation. To overcome the lack of diversity in building data, data sharing incentives and community open source contributions can allow for the expansion of the of the dataset.
S3: Sufficiency > Granularity
The granularity of the meter data is hourly which may not be adequate for short term load forecasting and efficiency studies.
S4: Sufficiency > Timeliness
The dataset covers hourly measurements from January 1, 2016 to December 31, 2018.
More information, such as age of the building, should be included in the dataset.
Data Gap Type
Data Gap Details
U1: Usability > Structure
Building footprint datasets are usually in different formats other than the format/coordinate system used by the government. To ensure these datasets are usable for local government applications, it would be helpful to have them align with the government’s referred format and coordinate system.
S6: Sufficiency > Missing Components
More information about the building, such as the age of the building and the source of the data should be included in the dataset.
R1: Reliability > Quality
The building footprint data can contain errors due to detection inaccuracies in the models used to generate the dataset, as well as limitations of satellite imagery. These limitations include outdated images that may not reflect recent developments and visibility issues such as cloud cover or obstructions that can prevent accurate identification of buildings.
CMIP6
Details (click to expand)
Climate simulations from a consortium of state-of-art climate models. Data can be found here.
The large uncertainties in future climate projection is a big problem of CMIP6. The large volume of data and the lack of uniform structure—such as inconsistent variable names, data formats, and resolutions across different CMIP6 models—make it challenging to utilize data from multiple models effectively.
The large uncertainties in future climate projection is a big problem of CMIP6. The large volume of data and the lack of uniform structure—such as inconsistent variable names, data formats, and resolutions across different CMIP6 models—make it challenging to utilize data from multiple models effectively.
Data Gap Type
Data Gap Details
U6: Usability > Large Volume
The large volume size of CMIP6 data poses significant challenges for downloading, transferring, storing, and processing with standard computational infrastructure. Although cloud-based access is available from various providers, it often comes with high costs and is not free. To address these challenges, it would be beneficial to make additional computational resources available alongside the storage of data.
U1: Usability > Structure
Data from different models is of different resolutions and variable names, which makes assimilating data from multiple models challenging.
R1: Reliability > Quality
There are large biases and uncertainties in the data, which can be improved by improving the climate models used to generate the simulations.
The large data volume and lack of uniform structure (no consistent variable names, data strucuture, and data resolution across all models) makes it difficult to use data from more than one model of CMIP6.
The large data volume and lack of uniform structure (no consistent variable names, data strucuture, and data resolution across all models) makes it difficult to use data from more than one model of CMIP6.
Data Gap Type
Data Gap Details
U6: Usability > Large Volume
The large volume size of CMIP6 data poses significant challenges for downloading, transferring, storing, and processing with standard computational infrastructure. Although cloud-based access is available from various providers, it often comes with high costs and is not free. To address these challenges, it would be beneficial to make additional computational resources available alongside the storage of data.
U1: Usability > Structure
Data from different models is of different resolutions and variable names, which makes assimilating data from multiple models challenging.
R1: Reliability > Quality
There are large biases and uncertainties in the data, which can be improved by improving the climate models used to generate the simulations.
CPC Precipitation
Details (click to expand)
CPC Global Unified gauge-based analysis of daily precipitation https://psl.noaa.gov/data/gridded/data.cpc.globalprecip.html
High-fidelity weather forecasts at the subseasonal-to-seasonal (S2S) scale (i.e., 10-46 days ahead) are important for a variety of climate change-related applications. ML can be used to postprocess outputs from physics-based numerical forecast models in order to generate higher-fidelity forecasts.
High-fidelity weather forecasts at the subseasonal-to-seasonal (S2S) scale (i.e., 10-46 days ahead) are important for a variety of climate change-related applications. ML can be used to postprocess outputs from physics-based numerical forecast models in order to generate higher-fidelity forecasts.
Data Gap Type
Data Gap Details
R1: Reliability > Quality
There is large uncertainty in data as data is derived via interpolating station data. There are large biases over areas where rain gauge stations are sparse
S3: Sufficiency > Granularity
Resolution is 0.5 deg (roughly 50km) and not sufficiently fine for many applications.
Cable inspection robot data
Details (click to expand)
Cable inspection robot LiDAR data is beneficial for Specific Power Line (SPL) partitions which include dampers, insulators, broken strands, and attachments which may have degraded due to exposure to natural elements. Specific Fitting Detection partition data focuses on assessing risk at the lowest part of the power line near trees, roofs, and other power lines that may cross. Since the robots physically crawl on the lines degradation detection of high voltage transmission lines are useful for maintenance scheduling and obstruction detection at the lower levels of he power line.
Grid inspection robot imagery may require coordination efforts with local utilities to gain access over multiple robot trips, image preprocessing to remove ambient artifacts, position and location calibration, as well as limitations in the identification of degradation patterns based on the resolution of the robot mounted camera.
Grid inspection robot imagery may require coordination efforts with local utilities to gain access over multiple robot trips, image preprocessing to remove ambient artifacts, position and location calibration, as well as limitations in the identification of degradation patterns based on the resolution of the robot mounted camera.
Data Gap Type
Data Gap Details
U2: Usability > Aggregation
Data needs to be aggregated and collated for multiple cable inspection robots for generalizability and requires multiple robot trips to an initial inspection to identify target locations that need further data collection, followed by a second trip at target locations for camera capture.
U3: Usability > Usage Rights
Data is proprietary and requires coordination with utility.
U5: Usability > Pre-processing
Data may need significant preprocessing and thresholding to perform image segmentation task.
S2: Sufficiency > Coverage
It is necessary to supplement data with position orientation system data to locate the cable inspection robot (this may be by having the robot complete two inspections–a preliminary one to identify inspection targets, followed by a more detailed autonomous inspection of targets with additional PTZ camera image capture of inspection targets on device).
S3: Sufficiency > Granularity
Spatial resolution depends on the type of cable inspection robot utilized.
Camera trap images
Details (click to expand)
Camera traps are likely the most widely used sensors in automated biodiversity monitoring due to their low cost and simple installation. This medium offers close-range monitoring over long-time scales. The image sequences can be used to not only classify species but to identify specifics about the individual, e.g. sex, age, health, behavior, and predator-prey interactions. Camera trap data has been used to estimate species occurrence, richness, distribution, and density.
In general, the raw images from camera traps need to be annotated before they can be used to train ML models. Some of the available annotated camera trap images are shared via Wildlife Insights (www.wildlifeinsights.org) and LILA BC (www.lila.science), while others are listed on GBIF (https://www.gbif.org/dataset/search?q=). However, the majority of camera trap data is likely scattered across individual research labs or organizations and not publicly available. Sharing such images shared could provide significant progress towards fill the gaps associated with the lack of annotated data that currently hinders the progress of efficiently using ML in biodiversity studies. This is what initiatives like Wildlife Insights are looking to do.
There is a lack of standardized protocol to guide data collection for restoration projects, including which variables to measure and how to measure them. Developing such protocols would standardize data collection practices, enabling consistent assessment and comparison of restoration successes and failures on a global scale.
There is a lack of standardized protocol to guide data collection for restoration projects, including which variables to measure and how to measure them. Developing such protocols would standardize data collection practices, enabling consistent assessment and comparison of restoration successes and failures on a global scale.
Data Gap Type
Data Gap Details
U1: Usability > Structure
For restoration projects, there is an urgent need of standardized protocol to guide data collection and curation for measuring biodiversity and other complex ecological outcomes of restoration projects. For example, there is an urgent need of a clearly written guidance on what variables to collect and how to collect them.
Turning raw data into usable insights is also a significant challenge. There is a lack of standardized tools and workflows to turn the raw data into some analysis-ready data and analyze the data in a consistent way across projects.
U4: Usability > Documentation
For restoration projects, there is an urgent need of standardized protocol to guide data collection and curation for measuring biodiversity and other complex ecological outcomes of restoration projects. For example, there is an urgent need of a clearly written guidance on what variables to collect and how to collect them.
Turning raw data into usable insights is also a significant challenge. There is a lack of standardized tools and workflows to turn the raw data into some analysis-ready data and analyze the data in a consistent way across projects.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
Data Gap Type
Data Gap Details
S1: Sufficiency > Insufficient Volume
One of the foremost challenges is the scarcity of publicly open and adequately annotated data for model training. It is worth mentioning that the lack of annotated datasets is a common and major challenge applied to almost every modality of biodiversity data and is particularly acute for bioacoustic data, where the availability of sufficiently large and diverse datasets, encompassing a wide array of species, remains limited for model training purposes.
This scarcity of publicly open and well annotated datasets arises from the labor-intensive nature of data annotation and the dearth of expertise necessary to accurately identify species. This is the case even for relatively well-studied taxa such as birds, and is even more notable for taxa such as Orthoptera (grasshoppers and relatives). The incompleteness of our current taxonomy also compounds this issue.
Though some data cannot be shared with the public because of concerns about poaching or other legal restrictions imposed by local governments, there are still considerable sources of well-annotated data that currently reside within the confines of individual researchers’ or organizations’ hard drives. Unlocking these data will partially address the data gap.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
S2: Sufficiency > Coverage
There exists a significant geographic imbalance in data collected, with insufficient representation from highly biodiverse regions. The data also does not cover a diverse spectrum of species, intertwining with taxonomy insufficiencies.
Addressing these gaps involves several strategies:
Data sharing: Unlocking existing datasets scattered across individual laboratories and organizations can partially alleviate the geographic and taxonomic gaps in data coverage.
Annotation Efforts:
Leveraging crowdsourcing platforms enables collaborative data collection and facilitates volunteer-driven annotation, playing a pivotal role in expanding annotated datasets.
Collaborating with domain experts, including ecologists and machine learning practitioners, is crucial. Incorporating local ecological knowledge, particularly from indigenous communities in target regions, enriches data annotation efforts.
Developing AI-powered methods for automatic cataloging, searching, and data conversion is imperative to streamline data processing and extract relevant information efficiently.
Data Collection:
Prioritize continuous data collection efforts, especially in underrepresented regions and for less documented species and taxonomic groups.
Explore innovative approaches like Automated Multisensor Stations for Monitoring of Species Diversity (AMMODs) to enhance data collection capabilities.
Strategically plan data collection initiatives to target species at risk of extinction, ensuring data capture before irreversible loss occurs.
U2: Usability > Aggregation
A lot of well-annotated datasets are currently scattered within the confines of individual researchers’ or organizations’ hard drives. Aggregating and sharing these datasets would largely address the lack of publicly open annotated data needed for model training. This initiative would also mitigate geographic and taxonomic imbalances in existing data.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
The major data gap is the limited institutional capacity to process and analyze data promptly to inform decision-making.
Data Gap Type
Data Gap Details
M: Misc/Other
There is a significant institutional challenge in processing and analyzing data promptly to inform decision-making. To enhance institutional capacity for leveraging global data sources and analytical methods effectively, a strategic, ecosystem-building approach is essential, rather than solely focusing on individual researcher skill development. This approach should prioritize long-term sustainability over short-term project-based funding.
Funding presents a major bottleneck for ecosystem monitoring initiatives. While most funding allocations are short-term, there is a critical need for sustained and adequate funding to support ongoing monitoring efforts and maintain data processing capabilities.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
Data Gap Type
Data Gap Details
S1: Sufficiency > Insufficient Volume
One of the foremost challenges is the scarcity of publicly open and adequately annotated data for model training. It is worth mentioning that the lack of annotated datasets is a common and major challenge applied to almost every modality of biodiversity data and is particularly acute for bioacoustic data, where the availability of sufficiently large and diverse datasets, encompassing a wide array of species, remains limited for model training purposes.
This scarcity of publicly open and well annotated datasets arises from the labor-intensive nature of data annotation and the dearth of expertise necessary to accurately identify species. This is the case even for relatively well-studied taxa such as birds, and is even more notable for taxa such as Orthoptera (grasshoppers and relatives). The incompleteness of our current taxonomy also compounds this issue.
Though some data cannot be shared with the public because of concerns about poaching or other legal restrictions imposed by local governments, there are still considerable sources of well-annotated data that currently reside within the confines of individual researchers’ or organizations’ hard drives. Unlocking these data will partially address the data gap.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
S2: Sufficiency > Coverage
There exists a significant geographic imbalance in data collected, with insufficient representation from highly biodiverse regions. The data also does not cover a diverse spectrum of species, intertwining with taxonomy insufficiencies.
Addressing these gaps involves several strategies:
Data sharing: Unlocking existing datasets scattered across individual laboratories and organizations can partially alleviate the geographic and taxonomic gaps in data coverage.
Annotation Efforts:
Leveraging crowdsourcing platforms enables collaborative data collection and facilitates volunteer-driven annotation, playing a pivotal role in expanding annotated datasets.
Collaborating with domain experts, including ecologists and machine learning practitioners, is crucial. Incorporating local ecological knowledge, particularly from indigenous communities in target regions, enriches data annotation efforts.
Developing AI-powered methods for automatic cataloging, searching, and data conversion is imperative to streamline data processing and extract relevant information efficiently.
Data Collection:
Prioritize continuous data collection efforts, especially in underrepresented regions and for less documented species and taxonomic groups.
Explore innovative approaches like Automated Multisensor Stations for Monitoring of Species Diversity (AMMODs) to enhance data collection capabilities.
Strategically plan data collection initiatives to target species at risk of extinction, ensuring data capture before irreversible loss occurs.
U2: Usability > Aggregation
A lot of well-annotated datasets are currently scattered within the confines of individual researchers’ or organizations’ hard drives. Aggregating and sharing these datasets would largely address the lack of publicly open annotated data needed for model training. This initiative would also mitigate geographic and taxonomic imbalances in existing data.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
Changes in marine ecosystems
Details (click to expand)
Annual data on changes (e.g. extent) in marine ecosystems such as mangroves, seagrasses, salt marshes, and wetlands due to various factors including coastal erosion, aquaculture, and others.
Use Case
Data Gap Summary
ClimSim
Details (click to expand)
An ML-ready benchmark dataset designed for hybrid ML-physics research, e.g. emulation of subgrid clouds and convection processes.
Physics-based climate models incorporate numerous complex components, such as radiative transfer, subgrid-scale cloud processes, deep convection, and subsurface ocean eddy dynamics. These components are computationally intensive, which limits the spatial resolution achievable in climate simulations. ML models can emulate these physical processes, providing a more efficient alternative to traditional methods. By integrating ML-based emulations into climate models, we can achieve faster simulations and enhanced model performance.
Physics-based climate models incorporate numerous complex components, such as radiative transfer, subgrid-scale cloud processes, deep convection, and subsurface ocean eddy dynamics. These components are computationally intensive, which limits the spatial resolution achievable in climate simulations. ML models can emulate these physical processes, providing a more efficient alternative to traditional methods. By integrating ML-based emulations into climate models, we can achieve faster simulations and enhanced model performance.
Data Gap Type
Data Gap Details
U6: Usability > Large Volume
A common challenge for emulating climate model components, especially subgrid scale processes is the large data volume, which makes data downloading, transferring, processing, and storing challenging. Computation resources, including GPUs and storage, are urgently needed for most ML practitioners. Technical help on optimizing code for large volumes of data would also be appreciated.
S3: Sufficiency > Granularity
The current resolution is still sufficient to resolve some physical processes, e.g. turbulence, and tornado. Extremely high-resolution simulations, like large-eddy-simulations, are what needed.
Climate-related laws and regulations
Details (click to expand)
Laws and regulations for climate action that are published through national and subnational governments. There are some centralized databases, such as Climate Policy Radar, International Energy Agency, and New Climate Institute that have selected, aggregated, and structured these data into comprehensive resources.
Laws and regulations for climate action are published in various formats through national and subnational governments, and most are not labeled as a “climate policy”. There are a number of initiatives that take on the challenge of selecting, aggregating, and structuring the laws to provide a better overview of the global policy landscape. This, however, requires a great deal of work, needs to be permanently updated, and datasets are not complete.
Laws and regulations for climate action are published in various formats through national and subnational governments, and most are not labeled as a “climate policy”. There are a number of initiatives that take on the challenge of selecting, aggregating, and structuring the laws to provide a better overview of the global policy landscape. This, however, requires a great deal of work, needs to be permanently updated, and datasets are not complete.
Data Gap Type
Data Gap Details
U1: Usability > Structure
Much of the data are in PDF format and need to be structured into machine-readable format. Much of the data in original languages of the publishing country and needs to be translated into English.
U2: Usability > Aggregation
Legislation data is published through national and subnational governments, and often is not explicitly labeled as “climate policy”. Determing whether it is climate-related is not simple.
This information is usually published on local websites and must be downloaded or scraped manually. There are a number of initiatives, such as Climate Policy Radar, International Energy Agency, and New Climate Institute that are working to address this by selecting, aggregating, and structuring these data to provide a better overview of the global policy landscape. However, this process is labor-intensive, requires continuous updates, and often results in incomplete datasets.
ClimateBench v1.0
Details (click to expand)
A benchmark dataset derived from a full complexity Earth System Model (NorESM2; participant of CMIP 6) for for emulation of key climate variables https://zenodo.org/records/7064308.
The dataset currently includes simulations from only one model. To enhance accuracy and reliability, it is important to include simulations from multiple models.
The dataset currently includes simulations from only one model. To enhance accuracy and reliability, it is important to include simulations from multiple models.
Data Gap Type
Data Gap Details
S2: Sufficiency > Coverage
Currently, the dataset includes information from only one model. Training a machine learning model with this single source of data may result in limited generalization capabilities. To improve the model’s robustness and accuracy, it is essential to incorporate data from multiple models. This approach not only enhances the model’s ability to generalize across different scenarios but also helps reduce uncertainties associated with relying on a single model.
Community science data
Details (click to expand)
Images and recordings contributed by citizen scientists and volunteers represent another significant source of data in biodiversity and ecosystem. Crowdsourcing platforms, such as iNaturalist, eBird, Zooniverse, and Wildbook, facilitate the sharing of community science data. Many of these platforms also serve as hubs for collating and annotating datasets.
The main challenge with community science data is its lack of diversity. Data tends to be concentrated in accessible areas and primarily focuses on charismatic or commonly encountered species.
The main challenge with community science data is its lack of diversity. Data tends to be concentrated in accessible areas and primarily focuses on charismatic or commonly encountered species.
Data Gap Type
Data Gap Details
S2: Sufficiency > Coverage
Data is often concentrated in easily accessible areas and focuses on more charismatic or easily identifiable species. Data is also biased towards more densely populated species.
Copernicus Marine Data Store
Details (click to expand)
https://data.marine.copernicus.eu/products Free-of-charge state-of-the-art data on the state of the Blue (physical), White (sea ice) and Green (biogeochemical) ocean, on a global and regional scale.
Data downloading is a bottleneck because it requires familiarity with APIs, which not all users possess.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
API is needed to download data but many ecologists are not familiar with scripting languages.
M: Misc/Other
It would be ideal if Copernicus also made biodiversity data available on its website. Having access to both biodiversity data and associated environmental ocean data on the same platform would significantly enhance efficiency and accessibility. This integration would eliminate the need to download massive datasets for local analysis, streamlining the process for users.
DOE Atmospheric Radiation Measurement (ARM) research facility data products
Details (click to expand)
ARM represents data from various field measurement programs sponsored by the US Department of Energy with a focus on ground-based pyrheliometer and spectrometer data which is useful for solar radiation time series forecasting and solar potential assessment.
ARM dataset includes data from various DOE sites that include sensor information from sun-tracking photometers, radiometers, spectrometer data which is helpful in understanding hyperspectral solar irradiance and cloud dynamics. ARM sites generate large datasets which can be challenging to store, stream, analyze and archive, may be sensitive to sensor noise, and require further measurement verification especially with respect to aerosol composition. Additionally, ARM data coverage is limited to ARM sites motivating future collaboration with partner networks to enhance observational spatial coverage.
ARM dataset includes data from various DOE sites that include sensor information from sun-tracking photometers, radiometers, spectrometer data which is helpful in understanding hyperspectral solar irradiance and cloud dynamics. ARM sites generate large datasets which can be challenging to store, stream, analyze and archive, may be sensitive to sensor noise, and require further measurement verification especially with respect to aerosol composition. Additionally, ARM data coverage is limited to ARM sites motivating future collaboration with partner networks to enhance observational spatial coverage.
Data Gap Type
Data Gap Details
U6: Usability > Large Volume
ARM sites generate large datasets which can be challenging to to store, analyze, stream, and archive. Automating ingestion and analysis of data using artificial intelligence can alleviate volume by compressing/reducing data storage and provide novel ways to index and access the data.
R1: Reliability > Quality
Data quality from ARM site sensors can be sensitive to noise and calibration issues requiring field specialists to identify potential problems. Since data volume is large, ingestion of data and identification of measurement drift benefit from automation.
S2: Sufficiency > Coverage
Spatial coverage of radiation and associated ground based atmospheric phenomena are limited to ARM sites within the United States. To increase spatial context collaboration with partner sensor network sites with the DOE and ARM program can expand coverage within the United States. Similar initiatives outside the United States can enable better solar potential studies in regions with different environments.
S3: Sufficiency > Granularity
There is a need for the retrieval of enhanced aerosol composition measurements in addition to ice nucleating particle measurements for better understanding cloud and weather dynamics jointly with solar irradiance for DER site planning and solar potential surveying.
DYAMOND (DYnamics of the Atmospheric general circulation Modeled On Non-hydrostatic Domains)
Details (click to expand)
Intercomparison of global storm-resolving (5km or less) model simulations; used as the target of the emulator. Data can be found here.
Physics-based climate models incorporate numerous complex components, such as radiative transfer, subgrid-scale cloud processes, deep convection, and subsurface ocean eddy dynamics. These components are computationally intensive, which limits the spatial resolution achievable in climate simulations. ML models can emulate these physical processes, providing a more efficient alternative to traditional methods. By integrating ML-based emulations into climate models, we can achieve faster simulations and enhanced model performance.
Physics-based climate models incorporate numerous complex components, such as radiative transfer, subgrid-scale cloud processes, deep convection, and subsurface ocean eddy dynamics. These components are computationally intensive, which limits the spatial resolution achievable in climate simulations. ML models can emulate these physical processes, providing a more efficient alternative to traditional methods. By integrating ML-based emulations into climate models, we can achieve faster simulations and enhanced model performance.
Data Gap Type
Data Gap Details
U6: Usability > Large Volume
A common challenge for emulating climate model components, especially subgrid scale processes is the large data volume, which makes data downloading, transferring, processing, and storing challenging. Computation resources, including GPUs and storage, are urgently needed for most ML practitioners. Technical help on optimizing code for large volumes of data would also be appreciated.
S3: Sufficiency > Granularity
The current resolution is still sufficient to resolve some physical processes, e.g. turbulence, and tornado. Extremely high-resolution simulations, like large-eddy-simulations, are what needed.
Direct measurement of methane emission of rice paddies
Details (click to expand)
Direct measurement of methane emission of rice paddies by instruments and sampling systems placed in rice paddies to directly measure methane concentrations in the air above the fields or in the soil.
There is a lack of direct observation of methane emissions.
Data Gap Type
Data Gap Details
S6: Sufficiency > Missing Components
Direct measurement of methane emissions is often expensive and labor-intensive. But this data is essential as it provides the ground truth for training and constraining ML models. Increased funding is needed to support and encourage comprehensive data collection efforts.
Distribution system simulators
Details (click to expand)
Distribution system simulators such as OpenDSS and GridLab-D are crucial for understanding the hosting capacity of distribution level substation feeders because they allow for the analysis of various factors that can affect the stability and reliability of the power grid. These factors include voltage limits, thermal capability, control parameters, and fault current, among others. By simulating different scenarios and conditions, such as the integration of distributed energy resources (DERs) like photovoltaic (PV) solar panels, these tools can provide insights into how the grid can be optimized to accommodate these resources without compromising safety and reliability.
While OpenDSS and GridLab-D are free to use as an alternative when real distribution substation circuit feeder data is unavailable, to perform site-specific or scenario studies, data from different sources may be needed to verify simulation results. Actual hosting capacity may vary from simulation results due to differences in load, environmental conditions, and the level of DER penetration.
While OpenDSS and GridLab-D are free to use as an alternative when real distribution substation circuit feeder data is unavailable, to perform site-specific or scenario studies, data from different sources may be needed to verify simulation results. Actual hosting capacity may vary from simulation results due to differences in load, environmental conditions, and the level of DER penetration.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
OpenDSS is free to use as an alternative when distribution utility real circuit feeder data is unavailable.
U2: Usability > Aggregation
To perform a realistic distribution system-level study for a particular region of interest, data concerning topology, loads, and penetration of DERs needs to be aggregated and collated from external sources.
U3: Usability > Usage Rights
Rights to external data for use with OpenDSS or GridLab-D may require purchase or partnerships with utilities to perform scenario studies with high DER penetration and load demand.
R1: Reliability > Quality
OpenDSS and GridLab-D studies require real deployment data for verification of results from substations. Additionally, distribution level substation feeder hosting capacity may vary based on load, environmental conditions, and the level of DER penetration in a service area.
Drone images for biodiversity
Details (click to expand)
Like camera traps, drone images can offer high-resolution and relatively close-range images for species identification, individual identification, and environment reconstruction. As with camera traps, most drone images are scattered across disparate sources. Some such data is hosted on www.lila.science。
There is a lack of standardized protocol to guide data collection for restoration projects, including which variables to measure and how to measure them. Developing such protocols would standardize data collection practices, enabling consistent assessment and comparison of restoration successes and failures on a global scale.
There is a lack of standardized protocol to guide data collection for restoration projects, including which variables to measure and how to measure them. Developing such protocols would standardize data collection practices, enabling consistent assessment and comparison of restoration successes and failures on a global scale.
Data Gap Type
Data Gap Details
U1: Usability > Structure
For restoration projects, there is an urgent need of standardized protocol to guide data collection and curation for measuring biodiversity and other complex ecological outcomes of restoration projects. For example, there is an urgent need of a clearly written guidance on what variables to collect and how to collect them.
Turning raw data into usable insights is also a significant challenge. There is a lack of standardized tools and workflows to turn the raw data into some analysis-ready data and analyze the data in a consistent way across projects.
U4: Usability > Documentation
For restoration projects, there is an urgent need of standardized protocol to guide data collection and curation for measuring biodiversity and other complex ecological outcomes of restoration projects. For example, there is an urgent need of a clearly written guidance on what variables to collect and how to collect them.
Turning raw data into usable insights is also a significant challenge. There is a lack of standardized tools and workflows to turn the raw data into some analysis-ready data and analyze the data in a consistent way across projects.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
Data Gap Type
Data Gap Details
S1: Sufficiency > Insufficient Volume
One of the foremost challenges is the scarcity of publicly open and adequately annotated data for model training. It is worth mentioning that the lack of annotated datasets is a common and major challenge applied to almost every modality of biodiversity data and is particularly acute for bioacoustic data, where the availability of sufficiently large and diverse datasets, encompassing a wide array of species, remains limited for model training purposes.
This scarcity of publicly open and well annotated datasets arises from the labor-intensive nature of data annotation and the dearth of expertise necessary to accurately identify species. This is the case even for relatively well-studied taxa such as birds, and is even more notable for taxa such as Orthoptera (grasshoppers and relatives). The incompleteness of our current taxonomy also compounds this issue.
Though some data cannot be shared with the public because of concerns about poaching or other legal restrictions imposed by local governments, there are still considerable sources of well-annotated data that currently reside within the confines of individual researchers’ or organizations’ hard drives. Unlocking these data will partially address the data gap.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
S2: Sufficiency > Coverage
There exists a significant geographic imbalance in data collected, with insufficient representation from highly biodiverse regions. The data also does not cover a diverse spectrum of species, intertwining with taxonomy insufficiencies.
Addressing these gaps involves several strategies:
Data sharing: Unlocking existing datasets scattered across individual laboratories and organizations can partially alleviate the geographic and taxonomic gaps in data coverage.
Annotation Efforts:
Leveraging crowdsourcing platforms enables collaborative data collection and facilitates volunteer-driven annotation, playing a pivotal role in expanding annotated datasets.
Collaborating with domain experts, including ecologists and machine learning practitioners, is crucial. Incorporating local ecological knowledge, particularly from indigenous communities in target regions, enriches data annotation efforts.
Developing AI-powered methods for automatic cataloging, searching, and data conversion is imperative to streamline data processing and extract relevant information efficiently.
Data Collection:
Prioritize continuous data collection efforts, especially in underrepresented regions and for less documented species and taxonomic groups.
Explore innovative approaches like Automated Multisensor Stations for Monitoring of Species Diversity (AMMODs) to enhance data collection capabilities.
Strategically plan data collection initiatives to target species at risk of extinction, ensuring data capture before irreversible loss occurs.
U2: Usability > Aggregation
A lot of well-annotated datasets are currently scattered within the confines of individual researchers’ or organizations’ hard drives. Aggregating and sharing these datasets would largely address the lack of publicly open annotated data needed for model training. This initiative would also mitigate geographic and taxonomic imbalances in existing data.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
Data Gap Type
Data Gap Details
S1: Sufficiency > Insufficient Volume
One of the foremost challenges is the scarcity of publicly open and adequately annotated data for model training. It is worth mentioning that the lack of annotated datasets is a common and major challenge applied to almost every modality of biodiversity data and is particularly acute for bioacoustic data, where the availability of sufficiently large and diverse datasets, encompassing a wide array of species, remains limited for model training purposes.
This scarcity of publicly open and well annotated datasets arises from the labor-intensive nature of data annotation and the dearth of expertise necessary to accurately identify species. This is the case even for relatively well-studied taxa such as birds, and is even more notable for taxa such as Orthoptera (grasshoppers and relatives). The incompleteness of our current taxonomy also compounds this issue.
Though some data cannot be shared with the public because of concerns about poaching or other legal restrictions imposed by local governments, there are still considerable sources of well-annotated data that currently reside within the confines of individual researchers’ or organizations’ hard drives. Unlocking these data will partially address the data gap.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
S2: Sufficiency > Coverage
There exists a significant geographic imbalance in data collected, with insufficient representation from highly biodiverse regions. The data also does not cover a diverse spectrum of species, intertwining with taxonomy insufficiencies.
Addressing these gaps involves several strategies:
Data sharing: Unlocking existing datasets scattered across individual laboratories and organizations can partially alleviate the geographic and taxonomic gaps in data coverage.
Annotation Efforts:
Leveraging crowdsourcing platforms enables collaborative data collection and facilitates volunteer-driven annotation, playing a pivotal role in expanding annotated datasets.
Collaborating with domain experts, including ecologists and machine learning practitioners, is crucial. Incorporating local ecological knowledge, particularly from indigenous communities in target regions, enriches data annotation efforts.
Developing AI-powered methods for automatic cataloging, searching, and data conversion is imperative to streamline data processing and extract relevant information efficiently.
Data Collection:
Prioritize continuous data collection efforts, especially in underrepresented regions and for less documented species and taxonomic groups.
Explore innovative approaches like Automated Multisensor Stations for Monitoring of Species Diversity (AMMODs) to enhance data collection capabilities.
Strategically plan data collection initiatives to target species at risk of extinction, ensuring data capture before irreversible loss occurs.
U2: Usability > Aggregation
A lot of well-annotated datasets are currently scattered within the confines of individual researchers’ or organizations’ hard drives. Aggregating and sharing these datasets would largely address the lack of publicly open annotated data needed for model training. This initiative would also mitigate geographic and taxonomic imbalances in existing data.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
The scarcity of publicly available and well-annotated data poses a significant challenge for applying ML in biodiversity study. Addressing this requires fostering a culture of data sharing, implementing incentives, and establishing standardized pipelines within the ecology community. Incentives such as financial rewards and recognition for data collectors are essential to encourage sharing, while standardized pipelines streamline efforts to leverage existing annotated data for training models.
Data Gap Type
Data Gap Details
S1: Sufficiency > Insufficient Volume
One of the foremost challenges is the scarcity of publicly open and adequately annotated data for model training. It is worth mentioning that the lack of annotated datasets is a common and major challenge applied to almost every modality of biodiversity data and is particularly acute for bioacoustic data, where the availability of sufficiently large and diverse datasets, encompassing a wide array of species, remains limited for model training purposes.
This scarcity of publicly open and well annotated datasets arises from the labor-intensive nature of data annotation and the dearth of expertise necessary to accurately identify species. This is the case even for relatively well-studied taxa such as birds, and is even more notable for taxa such as Orthoptera (grasshoppers and relatives). The incompleteness of our current taxonomy also compounds this issue.
Though some data cannot be shared with the public because of concerns about poaching or other legal restrictions imposed by local governments, there are still considerable sources of well-annotated data that currently reside within the confines of individual researchers’ or organizations’ hard drives. Unlocking these data will partially address the data gap.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
S2: Sufficiency > Coverage
There exists a significant geographic imbalance in data collected, with insufficient representation from highly biodiverse regions. The data also does not cover a diverse spectrum of species, intertwining with taxonomy insufficiencies.
Addressing these gaps involves several strategies:
Data sharing: Unlocking existing datasets scattered across individual laboratories and organizations can partially alleviate the geographic and taxonomic gaps in data coverage.
Annotation Efforts:
Leveraging crowdsourcing platforms enables collaborative data collection and facilitates volunteer-driven annotation, playing a pivotal role in expanding annotated datasets.
Collaborating with domain experts, including ecologists and machine learning practitioners, is crucial. Incorporating local ecological knowledge, particularly from indigenous communities in target regions, enriches data annotation efforts.
Developing AI-powered methods for automatic cataloging, searching, and data conversion is imperative to streamline data processing and extract relevant information efficiently.
Data Collection:
Prioritize continuous data collection efforts, especially in underrepresented regions and for less documented species and taxonomic groups.
Explore innovative approaches like Automated Multisensor Stations for Monitoring of Species Diversity (AMMODs) to enhance data collection capabilities.
Strategically plan data collection initiatives to target species at risk of extinction, ensuring data capture before irreversible loss occurs.
U2: Usability > Aggregation
A lot of well-annotated datasets are currently scattered within the confines of individual researchers’ or organizations’ hard drives. Aggregating and sharing these datasets would largely address the lack of publicly open annotated data needed for model training. This initiative would also mitigate geographic and taxonomic imbalances in existing data.
To achieve this goal:
A concerted effort to foster a culture of data sharing at both the individual and cross-institutional levels within the ecology community is imperative.
Effective incentives, including financial rewards, funding opportunities, and incentives for publication, must be instituted to incentivize data sharing within the ecology domain. Moreover, due recognition should be accorded to data collectors, facilitated through appropriate data attribution practices and the assignment of digital object identifiers to datasets.
The establishment of standardized pipelines, protocols, and computational infrastructures is essential to streamline efforts aimed at integrating and archiving data from disparate sources.
Drone images for wildfire
Details (click to expand)
Equipped with advanced cameras and sensors, drones capture real-time, high-resolution aerial images of wildfires, helping monitor fire behavior and assess its damage. Drones can access hard-to-reach areas and offer timely data, which is crucial for effective response and management of wildfires
Thermal images captured by drones have high value but the cost of good sensors is high.
Data Gap Type
Data Gap Details
S3: Sufficiency > Granularity
Thermal images are highly valuable, but their resolution is often too low (commonly 120x90 pixels) and their field of vision is limited. Commercially available sensors can achieve 640x480 pixels, but they are much more expensive (~$10K). There are even Higher-resolution sensors are available but are currently restricted to military use due to security, ethical, and privacy concerns. Those seeking such high-resolution sensors should carefully weigh the benefits and drawbacks of their request.
U6: Usability > Large Volume
Data volume is a concern for those collecting drone images and seeking to share them with the public. Finding a platform that offers adequate storage for hosting the data is challenging, as it must ensure that users can download the data efficiently without issues.
ENS
Details (click to expand)
Ensemble forecast up to 15 days ahead, generated by ECMWF numerical weather prediction model; used as a benchmark/baseline for evaluating ML-based weather forecasts. Data can be found here.
Same as HRES, the biggest challenge of ENS is that only a portion of it is available to the public for free.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
The high-resolution real-time forecast is for purchase only and it is expensive to get them.
U6: Usability > Large Volume
Due to its high spatio-temporal resolution, HRES data is quite large, creating significant challenges for downloading, transferring, storing, and processing with standard computational infrastructure. To address this, the WeatherBench 2 benchmark dataset provides a solution for certain machine learning tasks involving ENS data.
The biggest challenge of ENS is that only a portion of it is available to the public for free.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
The high-resolution real-time forecast is for purchase only and it is expensive to get them.
U6: Usability > Large Volume
Due to its high spatio-temporal resolution, HRES data is quite large, creating significant challenges for downloading, transferring, storing, and processing with standard computational infrastructure. To address this, the WeatherBench 2 benchmark dataset provides a solution for certain machine learning tasks involving ENS data.
EPRI10: Transmission control center alarm and operational data set
Details (click to expand)
Supervisory Control and Data Acquisition (SCADA) systems collect data from sensors throughout the power grid. Alarm operational data, a portion of the data received by SCADA, provides discrete event-based information on the status of protection and monitoring devices in a tabular format which includes semi-structured text descriptions of individual alarm events. Often the data is formatted based on timestamp (in milliseconds), station, signal identification information, location, description, and action. Encoded within the identification information is the alarm message.
Access to EPRI10 grid alarm data is currently limited within EPRI. Data gaps with respect to usability are the result of redundancies in grid alarm codes, requiring significant preprocessing and analysis of code ids, alarm priority, location, and timestamps. Alarm codes can vary based on sensor, asset and line. Resulting actions as a consequence of alarm trigger events need field verifications to assess the presence of fault or non-fault events.
Access to EPRI10 grid alarm data is currently limited within EPRI. Data gaps with respect to usability are the result of redundancies in grid alarm codes, requiring significant preprocessing and analysis of code ids, alarm priority, location, and timestamps. Alarm codes can vary based on sensor, asset and line. Resulting actions as a consequence of alarm trigger events need field verifications to assess the presence of fault or non-fault events.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Data access is limited within EPRI due to restrictions with respect to data provided by utilities.
Anonymization and aggregation of data to a benchmark or toy dataset by EPRI to the wider community can be a means of circumventing the security issues at the cost of operational context.
U1: Usability > Structure
Grid alarm codes may be non-unique for different lines and grid assets. In other words, two different codes could represent equivalent information due to differences in naming conventions requiring significant alarm data pre-processing and analysis in identifying unique labels from over 2000 code words. Additional labels expressing alarm priority, for example high alarm type indicative of events such as fire, gas, or lightning, are also encoded into the grid alarm trigger event code.
Creation of a standard structure for operational text data such as those already utilized in operational systems by companies such as General Electric or Siemens can avoid inconsistencies in data.
U3: Usability > Usage Rights
Usage rights are currently constrained to those working within EPRI at this time.
U4: Usability > Documentation
Remote signaling identification information from monitoring sensors and devices encode data with respect to the alarm trigger event in the context of fault priority. Based on the asset, line, or sensor, this identification code can vary depending on naming conventions used. Documentation on remote signal ids associated with a dictionary of finite alarm code types can facilitate pre-processing of alarm data and assessment on the diversity of fault events occurring in real-time systems (as different alarm trigger codes may correspond to redundant events similar in nature).
U5: Usability > Pre-processing
In addition to challenges with respect to the decoding of remote signal identification data, the description fields associated with alarm trigger events are unstructured and vary in the amount of text detail provided. Typically the details cover information with respect to the grid asset and its action. For example, a text description from a line monitoring device may describe the power, temperature, and the action taken in response to the grid alarm trigger event. Often in real world systems the majority of grid alarm trigger events are comprised of short circuit faults and non-fault events, limiting the diversity of fault types found in the data.
To combat these issues, data pre-processing becomes necessary. For remote signal identification data this includes parsing and hashing through text codes, assessing code components for redundancies, and building an associated reduced dictionary of alarm codes. For textual description fields, and post-fault field reports, the use of natural language processing techniques to extract key information can provide more consistency between sensor data. Additionally, techniques like diverse sampling can account for the class imbalance with respect to the associated fault that can trigger the alarm.
U6: Usability > Large Volume
Operational alarm data volume is large given the cadence of measurements made in the system at every millisecond. This can result in high data volume that is tabular in nature, but also unstructured with respect to text details associated with alarm trigger events, sensor measurements, and controller actions. Since the data also contains locations and grid asset information, spatiotemporal analysis can be made with respect to a single sensor and the conditions over which that sensor is operating. Therefore indexing and mining time series data can be an approach for facilitating faster search over alarm data leading up to a fault event. Additionally natural language processing and text mining techniques can also be utilized to facilitate search over alarm text and details.
R1: Reliability > Quality
Alarm trigger events and the corresponding action taken by the events, require post assessment by field workers especially in cases of faults or perceived faults for verification.
U2: Usability > Aggregation
Reports on location, asset, and time can result in false alarm triggers requiring operators to send field workers to investigate, fix, and recalibrate field sensors. The data with respect to field assessments can be incorporated into the data to provide greater context.
ERA5
Details (click to expand)
Atmospheric reanalysis data integrates both in-situ and remote sensing observations, including data from weather stations, satellites, and radar. This comprehensive dataset can be downloaded from the provided link.
ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.
ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Downloading ERA5 data from the Copernicus Climate Data Store can be very time-consuming, taking anywhere from days to months. This delay is due to the large size of the dataset, which results from its high spatio-temporal resolution, its high demand, and the fact that the data is stored on tape.
U6: Usability > Large Volume
The large volume of data poses significant challenges for downloading, transferring, storing, and processing with standard computational infrastructure. Although cloud-based access is available from various providers, it often comes with high costs and is not free. To address these challenges, it would be beneficial to make additional computational resources available alongside the storage of data.
R1: Reliability > Quality
ERA5 is often used as “ground truth” in bias-correction tasks, but it has its own biases and uncertainties. ERA5 is derived from observations combined with physics-based models to estimate conditions in areas with sparse observational data. Consequently, biases in ERA5 can stem from the limitations of these physics-based models. It is worth noting that precipitation and cloud-related fields in ERA5 are less accurate compared to other fields and are not suitable for validating ML models. A higher-fidelity atmospheric dataset, such as an enhanced version of ERA5, is greatly needed. Machine learning can play a significant role in this area by improving the assimilation of atmospheric observation data from various sources.
ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.
ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Downloading ERA5 data from the Copernicus Climate Data Store can be very time-consuming, taking anywhere from days to months. This delay is due to the large size of the dataset, which results from its high spatio-temporal resolution, its high demand, and the fact that the data is stored on tape.
U6: Usability > Large Volume
The large volume of data poses significant challenges for downloading, transferring, storing, and processing with standard computational infrastructure. Although cloud-based access is available from various providers, it often comes with high costs and is not free. To address these challenges, it would be beneficial to make additional computational resources available alongside the storage of data.
R1: Reliability > Quality
ERA5 is often used as “ground truth” in bias-correction tasks, but it has its own biases and uncertainties. ERA5 is derived from observations combined with physics-based models to estimate conditions in areas with sparse observational data. Consequently, biases in ERA5 can stem from the limitations of these physics-based models. It is worth noting that precipitation and cloud-related fields in ERA5 are less accurate compared to other fields and are not suitable for validating ML models. A higher-fidelity atmospheric dataset, such as an enhanced version of ERA5, is greatly needed. Machine learning can play a significant role in this area by improving the assimilation of atmospheric observation data from various sources.
ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.
ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Downloading ERA5 data from the Copernicus Climate Data Store can be very time-consuming, taking anywhere from days to months. This delay is due to the large size of the dataset, which results from its high spatio-temporal resolution, its high demand, and the fact that the data is stored on tape.
U6: Usability > Large Volume
The large volume of data poses significant challenges for downloading, transferring, storing, and processing with standard computational infrastructure. Although cloud-based access is available from various providers, it often comes with high costs and is not free. To address these challenges, it would be beneficial to make additional computational resources available alongside the storage of data.
R1: Reliability > Quality
ERA5 is often used as “ground truth” in bias-correction tasks, but it has its own biases and uncertainties. ERA5 is derived from observations combined with physics-based models to estimate conditions in areas with sparse observational data. Consequently, biases in ERA5 can stem from the limitations of these physics-based models. It is worth noting that precipitation and cloud-related fields in ERA5 are less accurate compared to other fields and are not suitable for validating ML models. A higher-fidelity atmospheric dataset, such as an enhanced version of ERA5, is greatly needed. Machine learning can play a significant role in this area by improving the assimilation of atmospheric observation data from various sources.
ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.
ERA5 is now the most widely used reanalysis data due to its high resolution, good structure, and global coverage. The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months. The sheer volume of data is another common challenge for users of this dataset.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
The most urgent challenge that needs to be resolved is that downloading ERA5 from Copernicus Climate Data Store is very time-consuming, taking from days to months.
Wildfires are becoming more frequent and severe due to climate change. Accurate early prediction of these events enables timely evacuations and resource allocation, thereby reducing risks to lives and property. ML can enhance fire prediction by providing more precise forecasts quickly and efficiently.
Wildfires are becoming more frequent and severe due to climate change. Accurate early prediction of these events enables timely evacuations and resource allocation, thereby reducing risks to lives and property. ML can enhance fire prediction by providing more precise forecasts quickly and efficiently.
Data Gap Type
Data Gap Details
ESRI land cover map
Details (click to expand)
Sentinel-2 10-m annual map of Earth’s land surface from 2017-2023.
There are also other land cover maps available: https://gisgeography.com/free-global-land-cover-land-use-data/.
Use Case
Data Gap Summary
Emission dataset compiled from FAO statistics
Details (click to expand)
Dataset taken from FAO statistics and extrapolated spatially
Data is extrapolated from statistics on a national level. It is unknown how accurate this data is when focusing on local information.
Data Gap Type
Data Gap Details
R1: Reliability > Quality
Data is extrapolated from statistics on a national level. It is unknown how accurate this data is when focusing on local information.
Exposure data
Details (click to expand)
Exposure is defined as the representative value of assests potentially exposed to a natural hazard occurrence. It can be described by a wide range of features, such as GDP, population, buildings, agriculture, depending on the risk exposed to.
There are global open data as well as proprietary data with more detailed information coming from well estabilished insurance markets.
It can be socio-economic data or structural (building occupancy and construction class) data. Two open-source structural data are OpenStreetMap and OpenQuake GEM project.
Country-specific exposure data can range from extensive and detailed to almost completely unavailable, even if they exist as hard copies in government offices.
U3: Usability > Usage Rights
OpenQuake GEM project provides comprehensive data on the residential, commercial, and industrial building stock worldwide, but it is restricted to non-commercial use only.
R1: Reliability > Quality
For some data, e.g. population data, there are several datasets available and they all differ from each other by a lot. Validation is needed before the data can be used comfortably and confidently.
Some data, e.g. geospatial socioeconomic data provided by the UNEP Global Resource Information Database, are not always current or complete.
S3: Sufficiency > Granularity
For open global data, the resolution and completeness are usually not sufficient for desired purposes, e.g. GDP data from the World Bank or US CIA is not sufficiently detailed for assessing risks from natural hazards.
Faraday: Synthetic smart meter data
Details (click to expand)
Due to consumer privacy protections, advanced metering infrastructure (AMI) data is unavailable for realistic demand response studies. In an effort to open smart meter data, Octopus Energy’s Centre for Net Zero, has generated a synthetic dataset conditioned on presence of low carbon technologies, energy efficiency, and property type from a model trained on 300 million actual smart meter readings from a United Kingdom (UK) energy supplier.
Faraday synthetic AMI data is a response to the bottlenecks faced in retrieval of building level demand data based on consumer privacy. However, despite the model being trained on actual AMI data, synthetic data generation requires a priori knowledge with respect to building type, efficiency, and presence of low carbon technology. Furthermore, since the dataset is trained on UK building data, AMI time series generated may not be an accurate representation of load demand in regions outside the UK. Finally, since data is synthetically generated studies will require validation and verification using real data or aggregated per substation level data to assess its effectiveness.
Faraday synthetic AMI data is a response to the bottlenecks faced in retrieval of building level demand data based on consumer privacy. However, despite the model being trained on actual AMI data, synthetic data generation requires a priori knowledge with respect to building type, efficiency, and presence of low carbon technology. Furthermore, since the dataset is trained on UK building data, AMI time series generated may not be an accurate representation of load demand in regions outside the UK. Finally, since data is synthetically generated studies will require validation and verification using real data or aggregated per substation level data to assess its effectiveness.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Faraday is currently accessible through Centre for Net Zero’s API.
The Variational Autoencoder Model can generate synthetic AMI data based on several conditions. The presence of low carbon technology (LCT) for a given household or property type depends on access to battery storage solutions, solar rooftop panels, and presence of electric vehicles. This type of data may require curation of LCT purchases by type and household. Building type and efficiency at the residential and commercial/industrial level may also be difficult to access requiring the user to set initial assumptions or seek additional datasets. Furthermore, data verification requires a performance metric based in actual readings. This may be done through access to substation level load demand data.
U3: Usability > Usage Rights
Faraday is open for alpha testing by request only.
R1: Reliability > Quality
Verification of AMI synthetic data requires verification which can be done in a bottom up grid modeling manner. For example, load demand at the substation level can be estimated based on the sum of individual building loads which the substation services. This value can then be compared to the actual substation load demand provided through a private partnerships with distribution network operators (DNOs). However, accuracy of a specific demand profile per property or section of properties would require identification of a population of buildings, a connected real-world substation, and residential low carbon technology investment for the set of properties under study.
S2: Sufficiency > Coverage
Faraday is trained from utility provided AMI data from the UK which may not be representative of load demand and corresponding building type and temperate zone of other global regions.
Coverage of data is restricted to the pilot test bed whether it be through private collection, partnership with the utility, or use of pre-existing demand data.
S3: Sufficiency > Granularity
Data granularity is limited to the granularity of data the model was trained on.
S4: Sufficiency > Timeliness
Timeliness of dataset would require continuous integration and development of the model using MLOps best practices to avoid data and model drift. By contributing to Linux Foundation Energy’s OpenSynth initiative, Centre for Net Zero hopes to build a global community of contributors to facilitate research.
FathomNet
Details (click to expand)
FathomNet is an open-source image database that standardizes and aggregates expertly curated labeled data. It can be used to train, test, and validate state-of-the-art artificial intelligence algorithms to help us understand our ocean and its inhabitants.
The biggest challenge for the developer of FathomNet is enticing different institutions to contribute their data.
Data Gap Type
Data Gap Details
M: Misc/Other
The biggest challenge for the developer of FathomNet is enticing different institutions to contribute their data.
Financial loss datasets related to the impacts of disasters
Details (click to expand)
Financial loss datasets related to disasters track the economic impacts of catastrophic events, including insurance claims and damages to infrastructure. They help assess financial repercussions and guide risk management and preparedness strategies.
Data tends to be proprietary, as the most consistent loss data is produced by the insurance industry.
O2: Obtainability > Accessibility
Even for a single event, collecting a robust set of homogeneous loss data poses a significant challenge.
U4: Usability > Documentation
With existing data, determining whether the data is complete can be a challenge as it is common that little or no metadata is associated with the loss data.
Financial loss data is typically proprietary and held by insurance and reinsurance companies, as well as financial and risk management firms. Some of the data should be made available for research purposes.
Floating INfrastructure for Ocean observations FINO3
Details (click to expand)
FINO3 is an off-shore wind mast based wind speed and wind direction research platform datasets which include time series data with respect to temperature, air pressure, relative humidity, global radiation, and precipitation. Images from the perspective of the platform provide a snapshot of of environmental conditions directly. The platform is located in the northern part of the German Bight, 80km northwest of the island of Sylt in the midst of wind farms. Wind measurements are taken between 32 to 102 meters above sea level with wind speed measurements taken every 10meters. Data is collected from August 2009 until the present day.
Due to the location, FINO platform sensors are prone to failures of measurement sensors due to adverse outdoor conditions such as high wind and high waves. Often when sensors fail manual recalibration or repair is nontrivial requiring weather conditions to be amenable for human intervention. This can directly affect the data quality which can last for several weeks to a season. Coverage is also constrained to the platform and associated wind farm location.
Due to the location, FINO platform sensors are prone to failures of measurement sensors due to adverse outdoor conditions such as high wind and high waves. Often when sensors fail manual recalibration or repair is nontrivial requiring weather conditions to be amenable for human intervention. This can directly affect the data quality which can last for several weeks to a season. Coverage is also constrained to the platform and associated wind farm location.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
The data is free to use but requires sign up through a login account at: https://login.bsh.de/fachverfahren/
U5: Usability > Pre-processing
The dataset is prone to failures of measurement sensors. Issues with data loggers, power supplies, and effects of adverse conditions such as low aerosol concentrations can influence data quality. High wind and wave conditions impact the ability to correct or recalibrate sensors creating data gaps that can last for several weeks or seasons.
S2: Sufficiency > Coverage
Coverage of wind farms is relegated to the dimensions of the platform itself and the wind farm that it is built in proximity to. For locations with different offshore characteristics similar testbed platforms or buoys can be developed.
S5: Sufficiency > Proxy
Due to the nature of sensors exposed to environmental ocean conditions and storms, often FINO sensors may need maintenance and repair but are difficult to physically access. Gaps in the data from lack of data points can be addressed by utilizing mesoscale wind modeling output.
GBIF
Details (click to expand)
GBIF—the Global Biodiversity Information Facility—is an international network and data infrastructure funded by the world’s governments. It offers open access to global biodiversity data. It sets common standards for sharing species records collected from various sources, like museum specimens and modern technologies. Using standards like Darwin Core, GBIF.org indexes millions of species records, accessible under open licenses, supporting scientific research and policy-making.
While GBIF provides a common standard, the accuracy of species classification in the data can vary, and classifications may change over time.
Data Gap Type
Data Gap Details
U5: Usability > Pre-processing
Though a common standard is provided at Gbif, the species classification of the data is not always accurate and consistent. The species were even classified into different groups over time.
GEDI lidar
Details (click to expand)
The Global Ecosystem Dynamics Investigation (GEDI) is a joint mission between NASA and the University of Maryland. It uses three lasers to capture and then construct detailed three-dimensional (3D) maps of forest canopy height and the distribution of branches and leaves. By accurately measuring forests in 3D, GEDI data play an important role in estimating the forest height as well as canopy height, and thus understanding the amounts of biomass and carbon forests store and how much they lose when disturbed.
GEDI is globally available but has some intricacies, e.g. geolocation errors, and weak return signal if the forest is dense, which bring uncertainties and errors into the estimate of canopy height.
Grid event signature library
Details (click to expand)
Grid2Op is a power systems simulation framework to perform reinforcement learning for electricity network operation, that focuses on the use of topology to control the flows of the grid. Grid2Op allows users to control voltages by manipulating shunts or changing setpoint values of generators, influence active generation by use of redispatching, and manipulate storage units such as batteries or pumped storage to produce or absorb energy from grid when needed. The grid is represented as a graph with nodes being buses and edges corresponding to powerlines and transformers. Grid2Op has several available environments with different network topologies as well as variables that can be monitored as observations. The environment is designed for reinforcement learning agents to act upon with a variety of actions some of which are binary or continuous. This includes changes in topology such as changing bus, changing line status, setting storage, curtailment, redispatching, setting bus values, and setting line status. Multiple reward functions are also available in the platform for experimentation with different agents. It is important to note that Grid2op has no internal modeling of equations of the grids or what kind of solver is necessary to adopt. Data on how the powergrid is evolving is represented by the Chronics. The solver that computes the state of the grid is represented by the “Backend” which utilizes PandaPower to compute power flows.
Grid2Op is a reinforcement learning framework that builds an environment based on topologies, selected grid observations, a selected reward function, and selected actions for an agent to select from. The framework relies on control laws rather than direct system observations which are subject to multiple constraints and changing load demand. Time steps representing 5minutes are unable to capture complex transients and can limit the effectiveness of certain actions within the action space over others. Furthermore, customization of the Grid2Op can be challenging as the platform does not allow for single to multiagent conversion, and is not a suitable environment for cascading failure scenarios due to game over rules.
Grid2Op is a reinforcement learning framework that builds an environment based on topologies, selected grid observations, a selected reward function, and selected actions for an agent to select from. The framework relies on control laws rather than direct system observations which are subject to multiple constraints and changing load demand. Time steps representing 5minutes are unable to capture complex transients and can limit the effectiveness of certain actions within the action space over others. Furthermore, customization of the Grid2Op can be challenging as the platform does not allow for single to multiagent conversion, and is not a suitable environment for cascading failure scenarios due to game over rules.
Data Gap Type
Data Gap Details
U4: Usability > Documentation
In the customization of the reward function, there are several TODOs in place concerning the units and attributes of the reward function related to redispatching. Documentation and code comments can sometimes provide conflicting information. Modularity of reward, adversary, action, environment, and backend are nonintuitive requiring pregenerated dictionaries rather than dynamic inputs or conversion from single agent to multiagent functionality.
U5: Usability > Pre-processing
The game over rules and constraints are difficult to adapt when customizing the environment for cascading failure scenarios that may be interesting to conduct for more complex adversaries such as natural disasters. Code base variations between versions especially between the native and Gym formated framework lose features present in the legacy version including topology graphics.
S2: Sufficiency > Coverage
Coverage is limited to the network topologies provided by the grid2op environment which is based on different IEEE bus topologies. While customization of the environment in terms of the “Backend,” “Parameters,” and “Rules” are possible, there may be dependent modules that may still enforce game-over rules. Furthermore, since backend modeling is not the focus of grid2op, verification that customization obeys physical laws or models is necessary.
S3: Sufficiency > Granularity
The time resolution of 5-minute increments may not represent realistic observation time series grid data or chronics. Furthermore, the granularity may limit the effectiveness of specific actions in the provided action space. For example, the use of energy storage devices in the presence of overvoltage has little effect on energy absorption incentivizing the agent to select from grid topology actions such as line changing line status or changing bus rather than setting storage.
R1: Reliability > Quality
The grid2op framework relies on mathematical robust control laws and rewards which train the RL agent based on set observation assumptions rather than actual system dynamics which are susceptible to noise, uncertainty, and disturbances not represented in the simulation environment. It has no internal modeling of the equations of the grids, or what kind of solver that needs to be adopted to solve traditional nonlinear optimal power flow equations. Specifics concerning modeling and preferred solver require users to customize or create a new “Backend.” Additionally, such RL human-in-the-loop systems in practice require trustworthiness and quantification of risk.
Ground survey of building information
Details (click to expand)
On-site collection of data to accurately map and measure the physical dimensions and boundaries of buildings. This survey is typically conducted using a variety of methods and tools to ensure precise and detailed mapping.
Use Case
Data Gap Summary
Ground survey of land use and land management
Details (click to expand)
The direct collection of data through field observations to understand how land is utilized and managed.
Data access is restricted due to institutional barriers and other restrictions.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
Access to the data is restricted, with limited availability to the public. Users often find themselves unable to access the comprehensive information they require and must settle for suboptimal or outdated data. Addressing this challenge necessitates a legislative process to facilitate broader access to data.
Ground-survey based forest inventory data
Details (click to expand)
Forest information collected directly from forested areas through on-the-ground observations and measurements serves as ground truth for training and validating estimates. This data is crucial for accurate assessments, such as estimating forest canopy height using machine learning models.
The data is manually collected and recorded, resulting in errors, missing values, and duplicates. Additionally, it is limited in coverage and collection frequency.
The data is manually collected and recorded, resulting in errors, missing values, and duplicates. Additionally, it is limited in coverage and collection frequency.
Data Gap Type
Data Gap Details
U5: Usability > Pre-processing
There is a lot of missing data and duplicates.
S2: Sufficiency > Coverage
Since data is collected manually, it is hard to scale and limited to certain regions only.
HRES
Details (click to expand)
Single high-resolution forecast up to 10 days ahead generated by ECMWF numerical weather prediction model, the Integrated Forecasting system (IFS). It is usually used as a benchmark/baseline for evaulating ML-based weather forecast. Data can be found here.
The biggest challenge with using HRES data is that only a portion of it is available to the public for free.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
The high-resolution real-time forecast is for purchase only and it is expensive to get them.
U6: Usability > Large Volume
Due to its high spatio-temporal resolution, HRES data is quite large, creating significant challenges for downloading, transferring, storing, and processing with standard computational infrastructure. To address this, the WeatherBench 2 benchmark dataset provides a solution for certain machine learning tasks involving HRES data.
The biggest challenge of HRES is that only a portion of it is available to the public for free.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
The high-resolution real-time forecast is for purchase only and it is expensive to get them.
U6: Usability > Large Volume
Due to its high spatio-temporal resolution, HRES data is quite large, creating significant challenges for downloading, transferring, storing, and processing with standard computational infrastructure. To address this, the WeatherBench 2 benchmark dataset provides a solution for certain machine learning tasks involving HRES data.
Hazard data
Details (click to expand)
Hazard data used for risk assessments usually are presented in the form of a catalog of hypothetical events with characteristics derived from, and statistically consistent with, the observational record. Some hazard data catalog can be found here https://sedac.ciesin.columbia.edu/theme/hazards/data/sets/browse, as well as from the Risk Data Library of the World Bank.
Resolution of current hazard data is not sufficient for effectively physical risks assessment
Data Gap Type
Data Gap Details
S3: Sufficiency > Granularity
Hazard (e.g. floods, tropical cyclones) data of global coverage tends to be of coarse resolution and variable quality. More detailed data and models with higher resolution should be used in risk assessments for the design of specific disaster risk management projects.
R1: Reliability > Quality
Projection of future climate hazards is essential for assessing the long-term risks, but those data are currently of large uncertainties. For example, there is large uncertainty in the wildfire projection in CMIP6 data.
Health data
Details (click to expand)
Health data refers to information related to individuals’ physical and mental well-being. This can include a wide range of data, such as medical records, health surveys, healthcare utilization, and epidemiological data.
The biggest issue for health data is its limited and restricted access.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
There is, in general, not a lot of datasets one can use to cover the spectrum of population, age, gender, economic, etc. To make good use of available data, there should be more efforts to integrate available data from disparate data sources, such as the creation of data repositories and the open community data standard.
U4: Usability > Documentation
There are some data repositories available. The existing issues are that data is not always accompanied by the source code that created the data or other types of good documentation.
U2: Usability > Aggregation
Integrating climate data and health data is challenging. Climate data is usually in raster files or gridded format, whereas health data is usually in tabular format. Mapping climate data to the same geospatial entity of health data is also computationally expensive.
High-resolution weather forecast (HRRR)
Details (click to expand)
Data volume is large, and only data covering the US is available.
Data Gap Type
Data Gap Details
U6: Usability > Large Volume
The large data volume, resulting from its high spatio-temporal resolution, makes transferring and processing the data very challenging. It would be beneficial if the data were accessible remotely and if computational resources were provided alongside it.
Data volume is large due to the high spatio-temporal resolution, which makes transfering and processing the data very difficult.
S2: Sufficiency > Coverage
This assimilated dataset currently covers only the continental US. It would be highly beneficial to have a similar dataset that includes global coverage.
Historical climate observations
Details (click to expand)
Climate observations of the past. Reanalysis dataset like ERA5 provides a global-scale data at coarse-resolution. Climate data aggregated from local weather station observations offer a more granular view.
Processing climate data and Integrating climate data with health data is a big challenge.
Data Gap Type
Data Gap Details
O1: Obtainability > Findability
For people without expertise in climate data only, it is hard to find the data right for their needs, as there is no centralized platform where they can turn for all available climate data.
U1: Usability > Structure
Datasets are of different formats and structures.
O1: Obtainability > Findability
For people without expertise in climate data only, it is hard to find the data right for their needs, as there is no centralized platform where they can turn for all available climate data.
Integrating climate data and health data is challenging. Climate data is usually in raster files or gridded format, whereas health data is usually in tabular format. Mapping climate data to the same geospatial entity of health data is also computationally expensive.
For highly biodiverse regions, like tropical rainforests, there is a lack of high-resolution data that captures the spatial heterogeneity of climate, hydrology, soil, and the relationship with biodiversity.
For highly biodiverse regions, like tropical rainforests, there is a lack of high-resolution data that captures the spatial heterogeneity of climate, hydrology, soil, and the relationship with biodiversity.
Data Gap Type
Data Gap Details
S3: Sufficiency > Granularity
There is a lack of high-resolution data that captures the spatial heterogeneity of climate, hydrology, soil, and so on, which are important for biodiversity patterns. This is because of a lack of observation systems that are dense enough to capture the subtleties in those variables caused by terrain. It would be helpful to establish decentralized monitoring networks to cost-effectively collect and maintain high-quality data over time, which cannot be done by one single country.
LBNL: Solar panel PV system dataset
Details (click to expand)
Lawrence Berkeley National Lab (LBNL) Solar Panel PV System Dataset is a small tabular dataset that includes specific feature data on PV system size, rebate, construction, tracking, mounting, module types, number of inverters and types, capacity, electricity pricing, and battery rated capacity. The LBNL solar panel PV system dataset was created by collecting and cleaning data for 1.6 million individual PV systems, representing 81% of all U.S. distributed PV systems installed through 2018. The analysis of installed prices focused on a subset of roughly 680,000 host-owned systems with available installed price data, of which 127,000 were installed in 2018. The dataset was sourced primarily from state agencies, utilities, and organizations administering PV incentive programs, solar renewable energy credit registration systems, or interconnection processes.
The LBNL solar panel PV system dataset excluded third party owned systems and systems with battery backup. Since data was self-reported, component cost data may be imprecise. The dataset also has a data gap in timeliness as some of the data includes historical data which may not reflect current pricing and costs of PV systems.
The LBNL solar panel PV system dataset excluded third party owned systems and systems with battery backup. Since data was self-reported, component cost data may be imprecise. The dataset also has a data gap in timeliness as some of the data includes historical data which may not reflect current pricing and costs of PV systems.
Data Gap Type
Data Gap Details
S2: Sufficiency > Coverage
The dataset excluded third-party owned systems, systems with battery backup, self-installed systems, and data that was missing installation prices. Data was self reported and may be inconsistent based on the reporting of component costs. Furthermore, some state markets were under represented or missing which can be alleviated by new data collection or use of dataset jointly with simulation studies.
S4: Sufficiency > Timeliness
Dataset includes historical data which may not reflect current pricing for PV systems. To alleviate this, updated pricing may be incorporated in the form of external data or as additional synthetic data from simulation.
Large-eddy simulations
Details (click to expand)
Very high resolution (finer than 150 m) atmospheric simulations where atmospheric turbulence is explicitly resolved in the model.
Extremely high-resolution simulations, such as large-eddy simulations, are needed. By explicitly resolving processes that are not resolved in current climate models, these simulations may serve as better ground truth for training machine learning models that emulate the physical processes and offer a more accurate basis for understanding and predicting climate phenomena.
Extremely high-resolution simulations, such as large-eddy simulations, are needed. By explicitly resolving processes that are not resolved in current climate models, these simulations may serve as better ground truth for training machine learning models that emulate the physical processes and offer a more accurate basis for understanding and predicting climate phenomena.
Data Gap Type
Data Gap Details
S6: Sufficiency > Missing Components
The resolution of current high-resolution simulations is still insufficient for resolving many physical processes, such as turbulence. To address this, extremely high-resolution simulations, like large-eddy simulations (with sub-kilometer or even tens of meter resolution), are needed. By explicitly resolving those turbulent processes, these simulations represent a more realistic realization of the atmosphere and therefore theoretically give better model results. These simulations may serve as ground truth for training machine learning models and offer a more accurate basis for understanding and predicting climate phenomena. Long-term climate simulations at this ultra-high resolution would significantly enhance both hybrid climate modeling and climate emulation, providing deeper insights into global warming scenarios.
Given the high computational cost of running such simulations, creating and sharing benchmark datasets based on these simulations is essential for the research community. This would facilitate model development and validation, promoting more accurate and efficient climate studies.
LiDAR
Details (click to expand)
LiDAR (Light Detection and Ranging) data provides high-resolution, three-dimensional information about surfaces and objects captured using LiDAR technology. Some open datasets that can be used for roof classification include OpenTopography and USGS 3D Elevation Program (3DEP). Many cities, like Boston, London also have their own LiDAR datasets.
Use Case
Data Gap Summary
Micro-synchrophasors (µPMU data)
Details (click to expand)
Micro-phasor measurement units (µPMUs) provide synchronized voltage and current measurements with higher accuracy, precision, and sampling rate making it ideal for distribution network monitoring. For example, µPMUs have an angle accuracy to the allowance of .01 degrees and a total vector error allowance of .05% in contrast to 1 degree and 1% total vector error allowance for classic PMUs. With sampling rates of 10-120 samples per second, µPMUs are capable of capturing dynamic and transient states within the low voltage distribution network allowing for improved event and fault detection and localization. Today most µPMU datasets can be accessed through manual field deployments in test-beds, collaborative research studies, or through publicly available datasets.
For effective fault localization using µPMU data, it is crucial that distribution circuit models supplied by utility partners are accurate, though they often lack detailed phase identification and impedance values, leading to imprecise fault localization and time series contextualization. Noise and errors from transformers can degrade µPMU measurement accuracy. Integrating additional sensor data affects data quality and management due to high sampling rates and substantial data volumes, necessitating automated analysis strategies. Cross-verifying µPMU time series data, especially around fault occurrences, with other data sources is essential due to the continuous nature of µPMU data capture. Additionally, µPMU installations can be costly, limiting deployments to pilot projects over a small network. Furthermore, latencies in the monitoring system due to communication and computational delays challenge real-time data processing and analysis.
For effective fault localization using µPMU data, it is crucial that distribution circuit models supplied by utility partners are accurate, though they often lack detailed phase identification and impedance values, leading to imprecise fault localization and time series contextualization. Noise and errors from transformers can degrade µPMU measurement accuracy. Integrating additional sensor data affects data quality and management due to high sampling rates and substantial data volumes, necessitating automated analysis strategies. Cross-verifying µPMU time series data, especially around fault occurrences, with other data sources is essential due to the continuous nature of µPMU data capture. Additionally, µPMU installations can be costly, limiting deployments to pilot projects over a small network. Furthermore, latencies in the monitoring system due to communication and computational delays challenge real-time data processing and analysis.
Data Gap Type
Data Gap Details
O2: Obtainability > Accessibility
For µPMU data to be utilized for fault localization, the distribution circuit model must be provided by the partnering utility. Typically the distribution circuit lacks notation with respect to the phase identification and impedance values, often providing rough approximations which can ultimately influence the accuracy of localization as well as time series contextualization of a fault. Decreased accuracy of localization can then affect downstream control mechanisms to ensure operational reliability.
U5: Usability > Pre-processing
µPMU data is sensitive to noise especially from geomagnetic storms which can induce electric currents in the atmosphere and impacting measurement accuracy. They can also be compromised by errors introduced by current an potential transformers.
Depending on whether additional data from other sensors or field reports is being used to classify µPMU time series data, creation of a joint sensor dataset can influence quality based on the overall sampling rate and format of the additional non-µPMU data.
U6: Usability > Large Volume
Due to the high sampling rates, data volume from each individual µPMU can be challenging to manage and analyze due to its continuous nature. Coupled with the number of µPMUs required to monitor a portion of the distribution network, the amount of data can easily exceed terabytes. Automation of indexing and mining time series by transient characteristics can facilitate domain specialist verification efforts.
R1: Reliability > Quality
Since µPMU data is continously captured, time series data leading up to or even identifying a fault or potential fault requires verification from other data sources.
Digital Fault Recorders (DFRs) capture high resolution event driven data such as disturbances due to faults, switching and transients. They are able to detect rapid events like lightning strikes and breaker trips while also recording the current and voltage magnitude with respect to time. Additionally, system dynamics over a longer period following a disturbance can also be captured. When used in conjunction with µPMU data, DFR data can assist in verifying significant transients found in the µPMU data which can facilitate improved analysis of both signals leading up to and after an event from the perspective of distribution-side state.
S2: Sufficiency > Coverage
Currently µPMU installation to existing distribution grids have significant financial costs so most deployments have been in the form of pilot projects with utilities. Pilot studies include the Flexgrid testing facility at Lawrence Berkeley National Laboratory (LBNL), Philadelphia Navy Yard microgrid (2016-2017), the micro-synchrophasors for distribution systems plus-up project (2016-2018), resilient electricity grids in the Philippines (2016), the GMLC 1.2.5- sensing and measurement strategy (2016), the bi-national laboratory for the intelligent management of energy sustainability and technology education in Mexico City (2017-2018) based on North American Synchrophasor Initiative (NASPI) reports.
Coverage is also limited by acceptance to this technology due to a pre-existing reliance on SCADA systems which measure grid conditions on a 15 minute cadence. As transients become more common, especially on the low voltage distribution grid, transition to monitoring with higher resolution will become necessary.
S4: Sufficiency > Timeliness
µPMU data can suffer from multiple latencies within the monitoring system of the grid that are unable to keep up with the high sampling rate of the continuous measurements that µPMUs generate. Latencies occur in the context of the system communications surrounding signals as they are being recorded, processed, sent, and received. This can be due to the communication medium used, cable distance, amount of processing, and computational delay. More specifically, the named latencies are measurement, transmission, channel, receiver, and algorithm related.
NASA-USDA global soil moisture data
Details (click to expand)
The NASA-USDA Global soil moisture data offers detailed global soil moisture information at a 0.25°x0.25° resolution, including surface and subsurface moisture, moisture profiles, and anomalies. This dataset integrates satellite observations from SMAP and SMOS with a modified Palmer model using the Ensemble Kalman Filter to enhance soil moisture predictions, especially in areas with sparse precipitation data.
The major change involves managing the size of the data. While cloud platforms offer convenience, they come with costs. Additionally, handling large datasets requires specific techniques, such as distributed computing and occasionally large-memory computing nodes (for certain statistics).
NIST campus photovoltaic (PV) arrays and weather station data sets
Details (click to expand)
National Institute of Standards and Technology (NIST) campus photovoltaic arrays collected from August 2014-July 2017 measure electrical, temperature, meteorological, spectral curves, UV light, infrared radiation, from PV sensors along with solar inverter power data from multiple testbeds on the National Institute of Standards and Technology campus. The testbeds include a parking lot canopy array, a ground mount array, a roof-tilted array, a rooftop weather station, and a rooftop module test station. Measurements are sampled and saved at high frequency, with one minute averages. The dataset includes metadata with respect to latitude, longitude, and elevation.
Data coverage is limited to Gaithersburg, MD NIST campus and is no longer being maintained after July 2017.
Data Gap Type
Data Gap Details
S2: Sufficiency > Coverage
Since testbeds are located on the NIST campus spatial coverage is limited to the institution’s site. Similar datasets outside which combine sensor information from the solar irradiance conditions, and the associated solar generated power at the output of the inverter would require investment in similar site-specific testbeds in different regions.
S4: Sufficiency > Timeliness
The dataset is no longer being maintained after July 2017 which given the investment in equipment for the project may be worth visiting to study the long term changes in the solar efficiency of panels with respect to time and operational degradation.
NOAA's SOLRAD network
Details (click to expand)
The National Oceanic and Atmospheric Administration’s SOLRAD Network monitors the surface radiation of various regions in the united states as part of NOAA’s SURface RADiation budget measurement network. The data includes measurements from different types of instruments and sensors, such as pyrheliometers, pyranometers, radiometers, and UV radiometers. These instruments collect data on incoming radiation, including both visible and UV components, with specific measurement resolutions and accuracy requirements to characterize the Earth’s surface radiation budget. By taking minute interval measurements of incoming solar radiation and accounting for reflection, absorption, and emission, solar energy available for power generation can be accurately forecast for solar farms and large scale solar grid planning projects.
While NOAA’s SOLRAD is an excellent data source for long term solar irradiance and climate studies, data gaps exist for the short term solar forecasting use case (which requires hourly averages). Data quality of hourly averages is lower than that of native resolution data impacting effective short-term forecasting for real-time energy management for grid stability, demand response, real-time market price predictions, and dispatch. Coverage area is also constrained to certain parts of the United States based on the SURFRAD network location.
While NOAA’s SOLRAD is an excellent data source for long term solar irradiance and climate studies, data gaps exist for the short term solar forecasting use case (which requires hourly averages). Data quality of hourly averages is lower than that of native resolution data impacting effective short-term forecasting for real-time energy management for grid stability, demand response, real-time market price predictions, and dispatch. Coverage area is also constrained to certain parts of the United States based on the SURFRAD network location.
Data Gap Type
Data Gap Details
S2: Sufficiency > Coverage
Coverage area is constrained to SOLRAD network locations in the United States, namely: Albuquerque, NM
Bismark, ND
Hanford, CA
Madison, WI
Oak Ridge, TN
Salt Lake City, UT
Seattle, WA
Sterling, VA
Tallahassee, FL
For dataset to be generalize to other regions, regions with similar climates and temperate zones would have to be identified.
S3: Sufficiency > Granularity
Data quality of the hourly averages is lower than that of the native resolution data which can impact effective short-term forecasting for real-time energy management for grid stability, demand response, real-time market price predictions, and dispatch. To solve this problem either using the data in the very short term may be better or utilization of additional data such as the data from sky imagers and other sensors with frequent measurement outputs.
NREL Solar Radiation Database (NSRDB)
Details (click to expand)
National Renewable Energy Laboratory (NREL)’s Solar Radiaion Database is a part of the NREL solar radiation resource assessment project which includes hourly and half hour data modeled using NREL’s Physical Solar Model (PSM) with measurements derived from the Geostationary Operational Environmental Satellite (GOES) of National Oceanic and Atmospheric Administration (NOAA), Interactive Multisensor Snow and Ice Mapping System (IMS), and Moderate Resolution Imaging Spectroradiometer (MODIS) and Modern Era Retrospective Analysis for Research and Applications v2 (MERRA-2). PSM derives cloud and aerosol properties and then feeds values as input into a radiative trasfer model (Fast All-sky Radiation Model for Solar applications (FARMS). Dataset can provide users with spectral on demand irradiances based on time, location, and photovoltaic (PV) orientation.
While data coverage is global and based on data derived from satellite imagery as input to the Fast All-sky Radiation Model (FARM), a radiative transfer model, the output is calculated over specific time frames and would require to be calculated and updated for modern times. Furthermore, data is unbalanced as the region that has the longest temporal coverage is the United States. Satellite based estimation of solar resource information may be susceptible to cloud cover, snow, and bright surfaces which would require additional verification from ground based measurements and collation of outside data sources. Additionally, since data is derived from satellites, data may require preprocessing to account for parallax effects when looking at particular regions based on the field of view of the coverage satellite and the region of interest which may not be expressed in the FARM higher level tabular products.
While data coverage is global and based on data derived from satellite imagery as input to the Fast All-sky Radiation Model (FARM), a radiative transfer model, the output is calculated over specific time frames and would require to be calculated and updated for modern times. Furthermore, data is unbalanced as the region that has the longest temporal coverage is the United States. Satellite based estimation of solar resource information may be susceptible to cloud cover, snow, and bright surfaces which would require additional verification from ground based measurements and collation of outside data sources. Additionally, since data is derived from satellites, data may require preprocessing to account for parallax effects when looking at particular regions based on the field of view of the coverage satellite and the region of interest which may not be expressed in the FARM higher level tabular products.
Data Gap Type
Data Gap Details
U5: Usability > Pre-processing
Since data is derived from satellite imagery data may require pre-processing to account for pixel variability, parallax effects, and additional modeling using radiative transfer to improve solar radiation estimates.
R1: Reliability > Quality
Satellite based estimation of solar resource information for sites susceptible to cloud cover, snow, and bright surfaces may not be accurate thereby requiring verification from ground based measurements.
S4: Sufficiency > Timeliness
Data flow from satellite imagery to solar radiation measurement output from Fast All-sky Radiation Model for Solar applications needs to be recalculated and updated to expand beyond the coverage years of the represented global regions. For information with respect to the coverage area and years covered visit this link.
NREL Solar Radiation Research Laboratory (SRRL): Baseline Measurement System (BMS)
Details (click to expand)
SRRL BMS has 130 data at 60sec intervals for joint variable studies on Golden, CO site specific environmental factors that may be used in photovoltaic potential studies and renewable resource climatology studies. Joint datasets available include co-located data from sensors detecting temperature, pressure, precipitation, wind speed, wind direction, humidity, UV index, aerosol optical depth (AOD), albedo, and percent cloud cover (by category consisting of opaque, thin, and clear).