6.4 Data standards

Establishing conventions or rules that specify how data is structured, represented and exchanged within a particular context or domain

Why should I do this?

To ensure consistency and interoperability, enabling different systems and applications to communicate and share data effectively. Before being able to comply with data standards, there is a need for your investment to identify what is relevant for your project.

 

How do data standards relate to FAIR?

 

  • Interoperability: With your dataset adhering to a certain ontology, linkage with other datasets that follow the same ontology is very easy.
  • Reusability: Likewise, if your dataset adheres to an ontology, it is easy to integrate into other projects that might be using the same ontology, or something similar.

Download this data standards factsheet for more insights.

What are data standards?

Data standards are established conventions or rules that specify how data is structured, represented and exchanged within a particular context or domain. Data standards can cover various aspects of data, including its format, syntax, semantics and transmission protocols.

 

1) If you are a Program Officer (PO), you may want to share this page directly with your grantee, so they can act on it.

2) If you are a grantee, ensure you have technical team members involved in this process. While the content is accessible to both technical and non-technical members, technical expertise will be required to make decisions for the investment in this step.

3) If you have not already downloaded ‘Project SIS’ or ‘Waterways’, the illustrative scenarios provide examples on how each theme is navigated. These scenarios are frequently referred to across the content in Step 6 to help you understand how different aspects within a theme are applied.

 

Things to consider for your investment:

©Gates Archive/Mansi Midha ©Gates Archive/Mansi Midha
  • Refer to the illustrative scenario that you have downloaded to see how this has been considered.
  • Ensure any work notes or decisions taken are being documented, as this would be useful to refer to at later stages or for someone new joining the team.

Only the specific theme related content has been highlighted here. To get a feel for the scenario, read here.

 

1. Data processing

SoilScience research has previously used the AgrO data standard, so it is likely that we will use it again for this project. The Plant Experimental Conditional Ontology, a sub-component of AgrO, will be particularly useful for standardizing the results of tests on soil samples. We are in the process of checking if it has a suitable ontology for meteorological data, but this should not be an issue for this project, as meteorological data is likely to come from one source only (Visual Crossing).

 

Further processing will be done to ensure numerical data is formatted to three decimal places, incomplete entries are cleaned out of the dataset, and anomalies are identified and double-checked. The standardization processes will mostly be done with an IPython Notebook. The aggregation and averaging of meteorological data will be completed to a uniform standard in an IPython notebook, resulting in measures like ‘aggregate rainfall’, ‘average sun per day’ and ‘average wind speed’—all useful measures to understand soil erosion in Dataland’s highlands.

 

2. Data products

Within the metadata schema, the data standards will need to be made clear for users (in the ‘ontology’ field).

Only the specific theme-related content has been highlighted here. To get a feel for the scenario, read here.

 

1. Data processing

Satellite data can come as KMLs, shapefiles, or ordinary CSVs, whereas ground field data from WRO will likely come as a CSV. Data processing at this stage of the project will involve standardizing all files so that they can be easily integrated into a topographical map via Geographic Information System (GIS) software.

 

2. Data enrichment

The interview transcripts will be stored in the SoilScience repository in a folder that is open to the public. Links to individual transcripts will be created and attached to the topographical map.

 

The means of this attachment is likely a shapefile, not a KML, though we will explore our options once we have the interviews. The interviews will not be time-differentiated, so the main strength of KMLs is not necessary, while they will be displayed as ‘points’ in planar geometry with which shapefiles work well. The points are simply the latitude and longitude of the center of each interviewed farmer’s farm.

The theme of data standards can be important at different stages of your project, whether or not you expect that to be the case. To help you incorporate them into your project planning, this section provides suggestions about where you should think about the theme, structured using the stages from the Data Value Chain (DVC).

 

The DVC is a way of viewing the process of running a project from the point of view of the data, thereby identifying how it is onboarded, processed, enriched, analyzed and released in a product. In doing so, the DVC shows the moving parts in project implementations, making it a useful framework regarding the general steps of any project working with data.

 

 

FAIR data fundamentally aims to improve data access, increasing the data available for AI to learn from. This is critical, as more data-intensive AI innovation (such as generative AI—for example, ChatGPT—or more generally language-learning models or LLMs) becomes commonplace in Agricultural Development.

Ameen Jauhar, Data Governance Lead, CABI

Learn more
Was this page helpful?
YesNo