6.5 Data catalogs

Choosing an approach to structure the inventory of your data assets

Why should I do this?

To provide a central accessible repository of metadata and information about the various data assets, thus making it easy for users to discover, understand and effectively use the available data resources.

 

How do data catalogs relate to FAIR?

 

  • Findability: Publishing a granular metadata schema corresponding to your dataset can enhance its findability on dataset search engines.
  • Interoperability: Cataloging can increase semantic interoperability by building a coherent administrative structure around your data that can work alongside the CDM for easier collaboration between team members and synergy with other projects.
  • Reusability: By enhancing findability, cataloging methods innately also enhance reusability, by making it easy for others to use your dataset in their work.

Download this data catalogs factsheet for more insights.

 

What is a data catalog?

A data catalog is a structured inventory of data assets within an organization or for a particular project.

 

1) If you are a Program Officer (PO), you may want to share this page directly with your grantee, so they can act on it.

2) If you are a grantee, ensure you have technical team members involved in this process. While the content is accessible to both technical and non-technical members, technical expertise will be required to make decisions for the investment in this step.

3) If you have not already downloaded ‘Project SIS’ or ‘Waterways’, the illustrative scenarios provide examples on how each theme is navigated. These scenarios are frequently referred to across the content in Step 6 to help you understand how different aspects within a theme are applied.

 

Things to consider for your investment:

©Gates Archive/Mansi Midha ©Gates Archive/Mansi Midha
  • Refer to the illustrative scenario that you have downloaded to see how this has been considered.
  • Ensure any work notes or decisions taken are being documented as this would be useful to refer to at later stages or for someone new joining the team.

Only the specific theme related content has been highlighted here. To get a feel for the scenario, read here.

 

1. Data onboarding

As per the CDM, the input datasets will be .csv files. To ensure efficient findability and accessibility, they will each be stored in a folder with metadata in a JSON. We have designed a descriptive metadata schema for this purpose, making all further steps of the project easier by providing easy inventory and discovery of datasets within the centralized repository. This schema will be carried across the CDM for all datasets.

 

 

2. Data processing

Once the datasets are updated, their metadata will require updating too. The ‘state’ will be ‘clean’ or ‘processed’, and the column names will be different, as well as the modification date.

 

3. Data enrichment

The metadata for the final dataset will need to be updated.

 

4. Data products

The metadata will be audited to ensure accuracy and useability for potential users of SIS. This also involves checking that metadata is easy to discover and read on the platform that SIS is hosted on. Prior to release, we might be able to bring together a focus group of researchers to understand if the final metadata schema needs any modification.

Only the specific theme related content has been highlighted here. To get a feel for the scenario, read here.

 

1. Data onboarding

We will not build a specific metadata schema for this project as there are only four data sources, none of which will be published as part of the final outputs. Nonetheless, in our repository we will keep a diary and readme for ease of use and potential onboarding of new researchers.

 

2. Data enrichment

Cataloging interviews and attaching data to them is extremely important, to provide context. The simple recording of details like the date of the interview, the location of the farm, and how long the farmer has been working in Waterways, builds a bigger picture for the interview content to sit in. This therefore enriches the interview data and makes it easier to use in the topographical map and as a starting point for empirical analysis. Cataloging will also enable further analysis of the interviews with perhaps textual analysis techniques.

The theme of data catalogs can be important at different stages of your project, whether or not you expect that to be the case. To help you incorporate them into your project planning, this section provides suggestions about where you should think about the theme, structured using the stages from the Data Value Chain (DVC).

 

The DVC is a way of viewing the process of running a project from the point of view of the data, thereby identifying how it is onboarded, processed, enriched, analyzed and released in a product. In doing so, the DVC shows the moving parts in project implementations, making it a useful framework regarding the general steps of any project working with data.

 

 

FAIR data principles ensure that data is shared in a manner that enhances reuse by both humans and machines, fostering further research and innovation.

Learn more
Was this page helpful?
YesNo