Hadoop Introduction

Hadoop Introduction

·

2 min read

This article explores big data and its definition technically and how Hadoop fits into the big data world

Introduction to Big data

Big data is exactly what the name suggests, a big amount of data. Big Data means a data set that is large in terms of volume and is more complex. Because of the large volume and higher complexity of Big Data, traditional data processing software cannot handle it. Big Data simply means datasets containing a large amount of diverse data, both structured as well as unstructured.

Definition of bigdata contain major three v's:

  1. Volume: Volume of data really matters. This can be data of business transactions, social media feed, and images, sensors data. organizations may have tons of petabytes of data to process. Think of one person's data such as Facebook, Instagram, Linkedin, Snapchat, Twitter, GDrive alone. It contains few Gigs.
  2. Variety: Big data can contain different types of data. it may be structured data such as databases and tables. it may be semistructured data as XML, JSON data or it can contain unstructured data such as audios, images, videos, etc.
  3. Velocity: Here velocity means, how frequently the data is capturing and how it is handling in a timely manner.

There are other 2 V's added to this list: Variability defines that data is not a constant, it is always changing. Veracity defies uncertainty or truthfulness of your data.

What is Hadoop?

In simple terms, Hadoop is the framework written in java to process big data effectively using distributed storage and process.

Major components of Hadoop:

  1. HDFS - abbreviated as Hadoop distributed file system is a file system to store the data in a distributed manner

  2. Map Reduce - The processing engine is to process the stored data in parallel.

There is a new addition to the Hadoop after 2.x. It is

  • Yarn (MR V2)- It is a framework for resource management and job scheduling

I will be covering hdfs (hadoop distributed file system) architecture and how it works in next article -> alwayscode.hashnode.dev/hdfs-architecture