The Lifeguards of the Data Lake | by Stephen Horgan

Published on July 25, 2023

In this blog we will be looking at the use of Metadata and how it can be used to improve the reliability and value of a dataset.

What is Metadata?  ‘Data that provides information about other data’

Metadata summarises basic information about data, making finding & working with particular instances of data easier.

Trust
Can you trust the quality of your data? A key factor for datasets is the reliability of the data collected and its accuracy in measuring and predicting trends. This can be achieved by following core observations throughout the process of data collection, mining, formatting, and storage.

Metadata is useful in creating uniform data that can easily be categorised and sorted, creating the starting point for a stable base to build your dataset. Below is a breakdown of the key factors relating to Metadata and structuring a dataset with a referencing mindset.

What…

What is the Dataset about?

Who…

Who created it?

Why…

Why does the data exist?

How…

How should the data be used?

When…

What is the timeline of the data?

Where…

Where is the data covering?

 

Some machine learning algorithms can use Metadata to improve their processes by learning Datasets in the same way you or I would. An interesting case study I researched is the development of semantic technology. Semantic technology was created to define and link data on a large web of raw data controlled by the meta parameters to create knowledge graphs that hold vast amounts of information all connected and mapped using the data profiles created by following the Metadata categories. The aim of this is to process vast amounts of descriptive data forms into a language and format that machines can process and understand and in doing so create meaningful and logical relationships. (See website for details) https://www.w3.org/2001/sw/Activity

Reliable
Data engineers and scientists monitor and report findings during the processing and profiling of datasets from the pipeline. Producing alerts and notifications on any anomalies. This is to improve the flow of data and also work as an early warning system for rogue, or faulty data streams.

Value
The value of this system is to improve data flow and management and create Machine Learning programs to automate these processes in the future with the same accuracy and attention. Creating a more reliable and trustworthy dataset that stakeholders can use to identify trends and improvements.

Together this creates a central network of data observation. With this network, data can be observed and monitored closely. Quickly resolving issues with early identifiers, predictive engines, and stop gaps.

Our roles in this process will vary, but our reliance on accurate data means that no matter where we are in the observation cycle we have others around us that can assist and observe the constant flow of data. 

What we learn from this as apprentices and data collectors is to be mindful of the data we collect, be vigilant in how we record and process, and always strive to improve our methods of monitoring, collecting, and presenting data.

Image retrieved from 4 Pillars of Modern Data Quality (datasciencecentral.com)


Stephen Horgan is a Data Fellowship Apprentice at Jaguar Land Rover and is writing for the Apprentice Lens as part of the Blogging Team. Stephen is based in Halewood, Liverpool. Here is a little more about him:

“Hi, I'm Stephen. I'm a time-served Maintenance Engineer. I currently work for JLR where I'm trying to progress into an engineering role. I hope to gain experience in working with data and develop the skills needed to progress into a role where using data is an everyday task. I'm inspired by technology and the innovations that are occurring on a daily basis in all areas of life.”