Data Engineering: Processing Data
We have seen what Data Engineering is and how to implement appropriate data storage solutions.
Let us look at moving and processing data.
Now, what then does it mean to process data?
Processing data consists of converting raw data into meaningful information.
Okay got it! But why should I process data?
Well, there may be some data you may not need. When building a feature, for example, we tend to look at many indicators to ensure that it works well but once we are done building it, we do not need that data anymore.
Also, processing and storage are not free and so you may want to optimize your memory, processing and network cost. Uncompressed files can be 10 times bigger than compressed files. Imagine processing that! That will cost a lot of money and your business model may collapse.
Some data may come in a format but would be easier and better to use in another. For example, artistes may upload songs in a .wav file format, which is a high-quality master file. If they should let users stream this kind of file, it will incur a lot of network cost. So, the data is processed by converting the master file format to the .ogg format, which is a lighter format with slightly lower sound quality. It is this file that would be streamed to users.
Also, we may want to move and organize data so that data scientists and analysts can access them easily and derive insights from them. For example, product files contain metadata like product names and product type. This data can be processed again to extract the metadata like the name and type and store them in a database so that the data scientist can find it easy to analyse them. The values Data Scientists add to the company comes from their analysis and so we want them to focus on just that.
Data scientists need to say thank you to Data engineers every day.
How Data Engineers Process Data
Data Engineers process data by performing data manipulation, cleaning and tidying tasks. These tasks can be automated so that whenever data flows into the organization, these tasks would be applied to them regardless of the analysis needs.
For example, deciding what happens with missing data. Should it be discarded? Should it be replaced with a value? Or should it be left blank?
Data Engineers also ensure that data is stored in a structured database and build views on top of the database tables so that it is easily accessed by data scientists.
Oh no! What do you mean by views?
Views are more or less a copy of the original table.
We have them so that users will not unintentionally alter the original table which may cause some problems.
For example, a user may want to query the database for student data. In order for the user to not be able to have the ability to alter the original table, the view is created.
Makes sense right?
There are a zillion data processing tools that Data engineers use but that is beyond the scope of this article.