Think BIG, think BIG DATA!
Over the past years, the amount of data generated has increased exponentially.
What do you think is the reason for this?
If you thought it’s because we have newly created data sources that didn’t exist back then, you’re right! If you didn’t get it right, worry not! I got you.
Pause for a moment and think about what you’ve done today…
You probably sent a text to a friend or you posted a picture on social media. Maybe, you even liked or commented on a post. All these activities you did, together with the machine and tools you used generated data.
You may be wondering what the sources of big data are.
Well, the bulk of big data comes from three primary sources: human, machine and organizational data. Let’s look at each one separately.
Human-generated: This is data humans create and share.
Examples of Human-generated data are posts on social media and emails.
Social media has been the leading source for the propagation of human-generated data. Over 350,000 tweets are sent per minute each day. Check out more about this here
Machine-generated: This is data generated from machines without any active human intervention.
But hey! What does it mean for data to be generated without active human intervention?
Think of a satellite! It generates optical photos of the earth, the temperature of the earth’s surface and many more. It is programmed to do so and does so on its own without a human telling it to do so.
Examples of machine-generated data sources are sensors on cars, satellites, security cameras, fitness apps, etc
Organization — generated: This is data generated as organizations run their businesses.
For instance, when you make a purchase online, data about the date and time you purchased the item, the number of items you purchased and even a unique customer number is generated. This is an example of organization-generated data.
Why do we call some data “BIG”?
Big data refers to data that is nearly impossible to process using traditional methods, like a single computer, because there’s so much of it, being generated so quickly, in many different formats.
Big data is characterized by 3 Vs: namely, Volume, Velocity and Variety.
These characteristics are key to understanding how we can measure big data. Let’s tackle each one of them separately.
Volume refers to the massive amount of data generated every day.
The International Data Corporation (IDC), forecasts that the amount of data that exists in the world is growing from 33 zettabytes in 2018 to 177 zettabytes by 2025. Just to put that into perspective, the laptop I used to type this has 256GB of storage. That’s equivalent to just 0.000000000256 (9 zeros) zettabytes.
Velocity refers to the speed at which new data is generated and the speed at which data moves around.
A social media post going viral is a good example of data velocity.
Variety refers to the diversity of the data. In other words, the many different types of data that exist today.
What Are The Types Of Big Data?
Structured data: This refers to any data that conforms to a certain format/schema.
This kind of data fits neatly into rows and columns. A popular example of structured data is a spreadsheet like Excel.
In the spreadsheet below, for example, we can see that the data in the Price column are up to 2 decimal places and the Product_IDs are all 5 digit long numeric values.
Because structured data are organized, they are generally easier to analyze.
Unstructured data: This is the opposite of structured data. They are also referred to as “messy data”.
If I asked you to tell me how much money KFC makes in a day and I give you the spreadsheet that has the daily sales, it would be a very easy thing for you to do. But assuming I gave you camera footage of each transaction and asked you to tell me how much revenue KFC makes. Now that’s a really difficult task and you may probably call me “wicked Cliff”.
Unstructured data: This is the most widespread type of data. About 90% of the data we have today is unstructured. Because the nature of unstructured data is unsorted, disorganized and left in its original state, most businesses just keep it all. Audio files, emails, pictures and social media posts are examples of unstructured data.
Semi-structured: This type of data fits somewhere in between structured and unstructured data.
It does not fit neatly into rows and columns but has some level of organization. A very good example is HTML. This is because, in HTML, we can organize different kinds of data in tags. For example <p> for paragraphs and <ul> for lists.
I hope this article served its purpose as a comprehensible introduction to Big Data. This article aims to build your foundation in Big Data.
That’s it for today folks.