Data Gaps (Beta) - More Info
About
Artificial intelligence (AI) and machine learning (ML) offer a powerful suite of tools to accelerate climate change mitigation and adaptation across different sectors. However, the lack of high-quality, easily accessible, and standardized data often hinders the impactful use of AI/ML for climate change applications.
In this project, Climate Change AI, with the support of Google DeepMind, aims to identify and catalog critical data gaps that impede AI/ML applications in addressing climate change, and lay out pathways for filling these gaps. In particular, we identify candidate improvements to existing datasets, as well as "wishes" for new datasets whose creation would enable specific ML-for-climate use cases. We hope that researchers, practitioners, data providers, funders, policymakers, and others will join the effort to address these critical data gaps.
Our list of critical data gaps is available at the following link: www.climatechange.ai/dev/datagaps. This page provides more details on the methodology through which this list was compiled, as well as on our taxonomy of data gaps.
This project is currently in its beta phase, with ongoing improvements to content and usability. We encourage you to provide input and contributions via the routes listed below, or by emailing us at datagaps@climatechange.ai. We are grateful to the many stakeholders and interviewees who have already provided input.
- Contribute a new data gap by filling out this form.
- Provide updates to an existing data gap by clicking the "Give feedback" button within the Details view for that data gap on the Data Gaps main page.
- Provide general feedback (e.g., on content, usability, or actionability) by filling out this form.
Methodology
Climate Change AI's list of critical data gaps was compiled via a combination of desk research and stakeholder interviews. Please check back soon for more details on our methodology, as well as a list of stakeholders interviewed.Taxonomy of Data Gaps
Data gaps are classified within six categories: Wish, Obtainability, Usability, Reliability, Sufficiency, and Miscellaneous/Other.
‣Type W: Wish - Dataset does not exist.
‣Type O: Obtainability - Dataset is not easily obtainable.
- O1: Findability - Dataset is not easy to find for humans and/or computers.
- Dataset is not Findable according to FAIR Principles (“metadata and data should be easy to find for both humans and computers”).
- O2: Accessibility - Dataset is difficult to access.
- Dataset is not Accessible according to FAIR Principles (“once the user finds the required data, she/he/they need to know how they can be accessed, possibly including authentication and authorisation”).
- Dataset is not freely available to the public.
- Obtaining access to the dataset requires lengthy bureaucratic approval or may otherwise be difficult/infeasible in practice.
‣Type U: Usability - Data is not readily usable.
- U1: Structure - Dataset is not machine-readable, well-formatted, and/or interoperable.
- Dataset is not in a machine-readable format.
- Dataset is not in a uniform, consistent, and standardized format. (See also FAIR Principles R1.3.)
- Dataset is not Interoperable according to FAIR Principles (“data [are amenable to being] integrated with other data. In addition, the data need to interoperate with applications or workflows for analysis, storage, and processing”).
- U2: Aggregation - Data is scattered and requires consolidation.
- Data is scattered and requires consolidation into a centralized dataset.
- U3: Usage Rights - Data usage rights are unclear or restrictive.
- Data usage rights are unclear (see also FAIR Principles R1.1).
- Dataset is released under a restrictive usage license.
- U4: Documentation - Documentation on data usage requires improvement.
- Documentation or other types of metadata required to help users understand how to use the data are incomplete, lacking in detail, or unclear (see also FAIR Principles R1 and Datasheets for Datasets).
- U5: Pre-processing - Data needs to be processed or cleaned prior to analysis.
- Data contains an excessive amount of missing values, noise, and duplicates that need to be cleaned.
- Data needs to be annotated.
- U6: Large Volume - Usability is impeded by large data volume.
- Dataset requires significant computational resources to process, presenting challenges for users who lack sufficient computing power.
- Dataset is too large to easily be downloaded and transferred to computing infrastructure that is not already co-located with the dataset.
- Data provider is unable to store or host the dataset effectively due to insufficient storage space.
- Dataset is not partitioned or searchable, requiring bulk download with significant computational resources.
‣Type R: Reliability - Data needs to be improved, validated, and/or verified.
- R1: Quality - Data quality needs to be improved, validated, and/or verified.
- Data may contain significant errors or inaccuracies.
- Data needs to be “ground truthed” or otherwise validated/verified.
- R2: Provenance - Data integrity needs to be validated/verified due to provenance.
- Data provenance is not properly documented (see also FAIR Principles R1.2 and Datasheets for Datasets).
- Integrity of data needs to be validated/verified by a trustworthy source due to provenance-related issues (e.g., data is self-reported or comes from an unverified source).
‣Type S: Sufficiency - Data is insufficient and needs to be collected or simulated.
- S1: Insufficient Volume - Data volume is insufficient for intended tasks.
- Amount of data is insufficient for intended machine learning tasks.
- S2: Coverage - Data coverage is limited (e.g., geographically, temporally, or demographically).
- Data is only available for certain regions, time periods, demographic groups, etc., thereby limiting its usefulness.
- S3: Granularity - Data is lacking in granularity/resolution.
- More granular data is needed, e.g., with respect to spatial or temporal resolution.
- S4: Timeliness - Data is not released promptly or is otherwise out of date.
- Data is not released promptly or is otherwise out of date.
- S5: Proxy - Data needs to be inferred or simulated.
- Ground truth data is difficult or impossible to collect, and instead needs to be inferred or simulated.
- S6: Missing Components - Dataset is missing important variables or types of information.
- Dataset is missing additional variables or types of information that are important for downstream analysis (e.g., because this information has not yet been collected).
‣Type M: Miscellaneous/Other - Challenges or gaps that do not fit into the other categories, including challenges that arise from the use of multiple datasets.