What in the world is Data Engineering?

Clifford Frempong
4 min readFeb 6, 2021

--

A man looking at a world of data

I recently started exploring Data Engineering and I decided to document my learning path.

At the end of this article, you would understand:

  • What in the world Data Engineering is
  • The difference between data engineers and data scientists
  • Data pipelines

Let’s get right into it!

A “let’s do this!” meme

There are four general steps in which data flows in an organization. They are:

  1. Data Collection and Storage: Here, data is collected from various sources such as sensor data and data from social media and is stored in their raw format.
  2. Data Preparation: Data is prepared for analysis and by this, I mean making sure that the data is in the right format. For instance, cleaning the data to remove null or missing values.
  3. Data exploration and visualization: Here, data is exploited to derive insights. After which it is then visualized.
  4. Experimentation and Prediction: Experiments are then performed on the data or predictive models are sometimes built.

Now the question is, “Where does a Data Engineer come in?

Well, Data Engineers are responsible for the first step of the Data Workflow i.e. Data collection and Storage. They deliver the correct data in the right form to the right people as efficiently as possible.

Data Engineers are responsible for:

  • Ingesting data from different sources
  • Optimizing databases for analysis
  • Removing corrupt data
  • Developing, constructing, testing and maintaining data architecture.

With the advent of Big Data, Data Engineers have become hot cakes.

Wait a minute! What in the world is Big Data?

Big data is a term used to refer to data — both structured and unstructured that is so large that it cannot be processed using the various traditional methods we have.

Big data is characterized by five Vs:

1. Volume, which refers to the quantity of data points

2. Variety, which refers to the various types of data such as video data or audio data

3. Velocity, which refers to how fast data is generated

4. Veracity, which refers to how trustworthy the data is

5. Value, which refers to our ability and need to turn data into value

Data Engineer vs. Data Scientist

A venn diagram of data scientists and data engineers

Earlier, I mentioned that data engineers are responsible for the first step of the data workflow. Their role is to ingest the data and store it so that it can be easily accessed by data scientists for analysis.

Data scientists on the other hand are responsible for the rest of the steps in the data workflow. Their role is to prepare the data according to their analysis needs, explore it, visualize it and perform experiments or even build predictive models.

Data engineers lay the groundwork that makes any data science activity possible.

Differences between a Data Engineer and a Data Scientist

  • Data engineers ingest data from many sources and store it so that Data scientists can exploit the data stored.
  • Data engineers set up and optimize databases for analysis so that Data scientists can access the database to extract the data it contains.
  • Data engineers build pipelines so that Data scientists can use the pipeline output.

Data Pipelines

A picture of a pipeline
Photo by Victor Garcia on Unsplash

Data, as first coined by the economist, is the new oil.

Let’s quickly look at how oil is processed to help us understand pipelines.

Crude oil is extracted from an oil field. It is then sent to a distillation unit where it is separated into several products which are then sent to their users. Some pipes go straight to airports to deliver kerosene. Others go to gas storage facilitates to deliver gasoline which will be stored in big tanks before they are then distributed to gas stations. There are many pipelines tying all this together.

Data engineers maintain data by following a procedure similar to oil processing.

Companies ingest data from different sources which need to be processed and stored in various ways. To handle that, we need data pipelines.

Data pipelines ensure that data flows efficiently from one station to another in an organization. They automate the extraction, transformation, validation, combination and loading of data to reduce human intervention, errors and the time it takes for data to flow in an organization.

ETL is a very common term you’ll hear a lot. Designing data pipelines usually involves an ETL process. It breaks the flow of data into 3 separate steps:

  1. E — Extraction of data
  2. T — Transforming the extracted data
  3. L — Loading the transformed data

Note that here, the data is processed before it is stored. Data pipelines may follow ETL but that is not the case all the time since data may be directly loaded in applications without transformation.

You made it to the end wow!!! Nice!!!

I hope you enjoyed this read. Thanks for your time.

Connect with me via Twitter

--

--