Data Engineering: Scheduling Data

Clifford Frempong
3 min readJul 9, 2021

--

Photo by k on Unsplash

We have seen how data engineers process data. Today, let’s look at how data is scheduled.

Scheduling is the glue of a Data Engineering process.

A confused reaction gif

I say this because it sometimes holds together each small piece and organizes how they work together.

How does it do that?

A confused reaction gif

It does that by running tasks in a specific order and resolves all dependencies.

We could run tasks manually. For example, if an employee is moving to a new office in another location of the same company, we can manually update the employee database but there are downsides with human dependencies. Ideally, we want our pipelines to be automated as much as possible.

Automation is when you set a task to run at a specific time or condition.

For example we can update our employee database at 6am every morning. That way, when a new employee is added the previous day, the changes will reflect the next morning.

We can also set a task to run when a specific condition is met. This is known as sensor scheduling. For example, we can update the employee database only when a new user is added. Otherwise, there wouldn’t be any reason to update the database.

This sounds like the best option right?

Excited reaction gif

But then, this would require more resources because you will need to have sensors always listening for changes.

A sad reaction gif

Manual and automated systems can also work together. For example, if a Netflix user manually upgrades his/her subscription tier, automated systems need to propagate this information to other parts of the system so that new features can be unlocked and information updated.

Let’s talk about how data is ingested.

An anticipation gif

Data can be ingested in batches, meaning that they are sent in groups at specific intervals. Batch processing is often cheap because you can schedule it when resources are not used elsewhere. For example, updates to the employee table can be batched every morning at 6am.

Data can also be streamed. This means that individual records are sent through the pipeline as soon as they are updated. For example, if a user signs up for Netflix, he/she would want to use the app right away so we need to write their profiles to the database immediately. Imagine having to wait for a day before you can use a service you just signed up for.

Annoying right?

An angry reaction gif

If you listen to a song on spotify online, parts of the song can be streamed one after the other for you but the moment you decide to save the song and listen offline, spotify will have to batch all parts of the song for you so that you can save it.

There’s a third option called REAL TIME used in fraud detection for example but for the sake of simplicity, we would consider it to be the same as streaming since streaming is almost always real time.

Simple innit? Now you know the various ways we can schedule jobs and which one to use in your pipeline.

I hope you enjoyed this read.

Thanks for your time.

Connect with me via Linkedin and Twitter

--

--