Data Engineering: Parallel Computing
We have seen how data engineers schedule data.
Today, let’s look at Parallel Computing
Parallel computing is one term that you will often hear in Data Engineering.
It forms the basis of almost all modern processing tools. It is important mainly for memory concerns but also for processing power.
When Big data processing tools perform a task, they split the tasks into several subtasks and distribute those subtasks to several computers.
Imagine you own a tailor shop and you need to make 1000 clothes. Your senior assistant can make 100 clothes in 15 mins and your junior assistants use 30mins for the same number of clothes. If one assistant can work at a time, you will pick the quickest one to do the job right? Cool
But what if we could split the batch into 250 clothes each. Having 4 junior assistants work on that in parallel will be faster. They will finish in 1 hour 15 minutes while it would take your senior assistant 2 hours 30 minutes to complete the task.
This is the same thing that happens with big data processing tasks.
In this case, the task becomes the making of the 1000 clothes and the employees become the processing units or the computers that the subtasks were distributed to. Makes sense right?
One advantage of parallel computing is the extra processing power we get.
Also, since we do not have to load all the data into one computer’s memory, the memory footprint on each computer is reduced.
Parallel computing sounds like the best solution to every task right? Nope!
Moving data incurs a cost. What’s more, is that splitting the task into subtasks and merging the various outputs into one final output requires some additional time.
So if the gains of splitting the tasks are minimal, then using parallel computing may not be worth it.
I hope this article served it’s purpose as a comprehensible introduction to parallel computing. It does get more advanced and complicated but the aim of this article is to build the foundational knowledge you need to understand Parallel computing.
That’s it for today folks.