Storing Data

Clifford Frempong
5 min readFeb 15, 2021

--

A picture representing storage

Let’s continue to explore the world of Data Engineering. This time, I will focus on data storage.

In this article, you will learn about the different data structures and implement appropriate data storage solutions with data lakes and data warehouses.

Let’s begin!

With all the hullabaloo over Big Data, you may be wondering what types of Data we have. Well, the first thing you should know is that not all data is created the same. Data generated from social media apps are way different from data generated from a POS(Point-Of-Sale). Some data are structured and others are unstructured. Some may even be semi-structured.

An I don’t “understand” meme

Take a deep breath. Let’s break them down

We say data is structured when it has defined schemas and can be organized neatly into rows and columns.

Note: A schema is the skeleton structure that represents the logical view of the entire database

This is the kind of data you see typically in excel sheets for example. It is easy to organize structured data. We store structured data in relational databases and we use Structured Query Language to query such data.

Unstructured data is the opposite of structured data.

It is data that is complex and mostly qualitative information that is impossible to reduce to rows and columns. Examples are photos, videos, pdfs, and social media content. It is difficult to organize unstructured data. We store unstructured data in data lakes although they can appear in data warehouses or databases.

Here’s a fun fact: Most of the data around us is unstructured.

In a day, over 500 million tweets and 294 billion emails are sent.

If you think these numbers are fascinating, see what happens in an internet minute.

Unstructured data can be extremely valuable but because it is difficult to organize, we could not extract this value until the advent of Machine Learning and Artificial Intelligence.

Semi-structured data resembles structured data but there’s freedom.

It is a mix of data that has consistent characteristics and that does not conform to a rigid structure. An example is an e-mail. It consists of structured data such as sender and recipient name and the content of the email, which is unstructured

A picture showing the difference between structured, semi-structured and unstructured data

Data Lake vs Data Warehouses

A picture showing Data lake vs Data warehouse

Data Lake is where all the raw data extracted from the various sources are stored.

Think of Data Lake as if it were a real lake. Just as a lake has multiple tributaries flowing into it, a Data Lake has structured, semi-structured and unstructured data flowing through it in real-time. It is unprocessed and messy.

While Data Lakes stores all the raw data, data warehouses store specific data for a specific use.

Data warehouses are built using dimensional data models. These dimensional models consist of dimension and facts tables.

Dimension tables contain dimension keys, values and attributes which are used to describe dimensions.

For example, the product dimension could contain the name and description of the product you sell, their price per unit and other attributes as applicable.

A fact table, on the other hand, contains the measures of interest.

The sales amount, for example, could be a measure and it will be stored in the fact table

As a result of this, Data lake can contain petabytes (1 million GBs) of data while Data warehouses usually contain pretty small data. By small, I mean small on the scale of Big Data. They can still be bigger than your hard drives.

Data Lake can store any kind of data be it structured, semi-structured or unstructured. You’re free to put anything in there. Data Lake stores all kinds of messy data and can be disorganized, which makes them difficult to analyze but hey! they are definitely a cost-efficient data storage solution. The Data Warehouses on the other hand usually store structured data and are optimized for analytics to drive business decisions.

Data Lakes are used by Data Scientists for real-time analytics on Big Data while Data Warehouses are used by analysts for ad-hoc, read-only queries like aggregation and summarization

Because any data structure can be stored in a Data Lake, it is important to keep a data catalog up to date.

Hmm before you try to go online to check what a Data catalog is, let me explain…

The Data Catalog is a source of truth that makes up for the lack of structure in the data lake.

Among other things, it keeps track of where the data comes from, how it is used, who owns the data and how often the data is updated.

It is good practice in terms of data governance and ensures reproducibility of the processes in case something happens or if someone wants to reproduce an analysis from the beginning, starting from the ingestion of data.

As a result of Data Lake’s flexibility in storing data, a Data Catalog is necessary because it prevents Data Lakes from becoming a data swamp.

It is also good practice to have a Data Catalog that references any data that moves through an organization. That way, we don’t have to rely on tribal knowledge. This makes us autonomous and makes working with data more scalable. We can go from finding data to processing it without relying on any human source of information anytime we have a question.

Before I end this article, let’s take a step back, shall we?

A meme representing “wait”

I mentioned Database earlier so let me say something briefly about it.

A database is a very general term and it can be loosely defined as an organized data stored and accessed on a computer.

As I said, it’s a general term but guess what? A data warehouse is a type of database

I hope you enjoyed this read. Thanks for your time.

Connect with me via Twitter and Linkedin

--

--