Big Data: An Introduction

Big Data: An Introduction

What is Big Data?

Big Data is described simply and clearly using the 5V's concept. It could also in some places be described in 3V's.

  • Velocity

The change in the data is rapid. The speed at which data is being generated in the present world is unimaginable. The total amount of data created, captured, copied, and consumed globally is forecast to increase rapidly, reaching 64.2 zettabytes in 2020. Over the next five years up to 2025, global data creation is projected to grow to more than 180 zettabytes

  • Volume

Volume refers to the size of the Data. Large amounts of data are being generated every second. In the context of Facebook, we can think of each photo uploaded, each post shared, and each reaction to a post as user data. Such interactions of 3.3 billion users(as of December 2023) generate an immense amount of data.

  • Variety

Data comes in many forms. Text, photo, video, tables, etc. The structures of data can be divided into structured, unstructured, and semi-structured data.

Structured data is often referred to as data that is labeled in a proper meaningful way. It is typically tabular with rows and columns that clearly define data attributes. Computers can effectively process structured data for insights due to its quantitative nature.

Unstructured data is often data sources like images, videos, and audio files. There is a vast amount of unstructured textual data too.

Semi-structured data is a type of data that combines features of both structured data and unstructured data. It has both features of being structured and unstructured. Email is a good example of such semi-structured data. The textual data included in an email is often unstructured while the recipient's email, sender's email, and the subject can be referred to as structured data. HTML pages too are good examples of semi-structured data.

  • Veracity

Can the data be trusted? Is it giving accurate information? These kinds of flavors are addressed under veracity.

  • Value

Data alone is not enough. How valuable is the data that we have collected? Is it possible to create value out of it?

How Big is Big Data?

This flood of data is coming from many sources. Consider the following:[4] The New York Stock Exchange generates about 4−5 terabytes of data per day. Facebook hosts more than 240 billion photos, growing at 7 petabytes per month. Ancestry.com, the genealogy site, stores around 10 petabytes of data. The Internet Archive stores around 18.5 petabytes of data. The Large Hadron Collider near Geneva, Switzerland, produces about 30 petabytes of data per year.

Think of the data generated on Netflix. Netflix has 247 Million subscribers as of January 2024. All these users have a particularly large dataset with them. The total watch hours of their watched content. The resume times of each content they watch. The devices which they have watched. The liked content. Think of 247 Million Users with all of the above data.

Similarly, you can interpret the amount of data generated every second through websites and applications like Facebook, Instagram, TikTok, Trip Advisor, Booking, etc. Keeping aside these major data platforms that come into mind, there are other scientific applications that generate vast amounts of data. The Hadron Collider generates more data per year than the number of tweets generated in a year.

Other scientific applications such as weather forecasting, Google Flu Trends, and Earthquake warnings, also rely on Big Data.

What is Big Data Analytics?

Big data analytics refers to the process of collecting, organizing, and analyzing big data to discover patterns and other useful information.

Note

I would like to quote the following from the book "Hadoop, The Definitive Guide" which leaves a nice touch for the introduction.

"It has been said that “more data usually beats better algorithms,” which is to say that for some problems (such as recommending movies or music based on past preferences), however fiendish your algorithms, often they can be beaten simply by having more data (and a less sophisticated algorithm). The good news is that big data is here. The bad news is that we are struggling to store and analyze it."

The interesting part of how Big Data is handled, and what concepts lie behind tackling these vast amounts of data will be discussed in my next article. Stay Tuned!