Abstract Big Data has earned a place of great


In this world
of information the term Big Data has emerged with new opportunities and
challenges to deal with the massive amount of data. Big Data has earned a place
of great importance and is becoming the choice for new researches. Big
data can be structured, unstructured or semi-structured, resulting in
incapability of conventional data management methods. The volume and the
diversity of data it is generated with speed, makes it difficult for the
present computing infrastructure to manage Big Data. Traditional data
management, warehousing and analysis systems fall short of tools to analyze
this data. Due to its specific nature of Big Data, it is stored in distributed
file system architectures. Hadoop is widely used for storing and managing Big
Data. Hadoop with the help of Mapreduce and HDFS is the core platform for
structuring Big Data, and solves the problem of making it useful for analytic
purposes. Hadoop is an open source software project that enables the
distributed processing of large data sets across clusters of commodity servers.

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!

order now

Keywords: Big Data, Hadoop, Mapreduce, HDFS

1. Introduction

Big Data  is  a 
vague  topic  and 
there  is  no 
exact definition  which  is followed 
by everyone. Big Data describes techniques and technologies to store,
distribute, manage large sized datasets with high-velocity and different
structures. Data that has large Volume, comes from Variety of sources, Variety
of formats and comes at us with a great Velocity is normally refer to as Big
Data. All this data is coming from smartphones, social networks, trading
platforms, machines, and other sources. One of the largest technological
challenges in software systems research today is to provide mechanisms for
storage, manipulation, and information retrieval on large amounts of data. Web
services and social media produce together an impressive amount of data,
reaching the scale of petabytes daily. To harness
the power of big data, you would require an infrastructure that can manage and
process huge volumes of structured and unstructured data in real time. 50% to 80% of big data work is
converting and cleaning the information so that is searchable and
sortable. Only a few thousand experts on our planet fully know how to do
this data cleanup. These experts also need very specialized tools, like Hadoop, to do their craft 1. There are
two ingredients that are driving organizations into investigating Hadoop.  One is a lot of data, generally larger than
10 Terabytes. The other is high calculation complexity, like statistical
simulations. Any combination of those two ingredients with the need to get
results faster and cheaper will drive your return on investment.

2. Big Data

Right now we have
some problems that is called Big Data so what is that problem that we have big
data is nothing but simple a huge data that we are putting together that is
called Big Data.

 As you know on this world of internet enabled
devices we are generating a lot of data and we are getting data  from different data sources so existing
systems are not able to handle this Big Data .Big data is having two issues the
data is coming from the different sources continuously so the data we are not
able to handle it using our existing computational techniques so what are the
big data related issues, we are getting the tons of data from different devices
and we are not able to store the data on time and the second thing is
processing work somehow we are able to manage the data how to store the data by
adding different and many servers to our system we are able to manage somehow
but in many times when it comes to process the data we are not able to do it on


The 4 Vs that define Big Data are
Volume, Velocity,
Variety and Veracity (as shown in fig 1) :


Volume of data –
volume of data is the amount of data that we are adding from different data
sources is called volume of data example probably 100 bytes hundred megabytes
or 100 terabytes or 1 petabytes


Velocity of data – Velocity  is 
the  speed  at 
which  data  is 
generated  and processed 3.

Variety of data 
– what kind of data that we are getting from the different data sources
to the warehouse or what type of data that we are trying to add from different
data sources to the warehouse

Veracity – Veracity means anxiety or accuracy of
data. Data is uncertain due to the inconsistency and in completeness 7.

3. Hadoop


Hadoop software
library is a framework that allows for the distributed processing of large data
sets across clusters of computers using simple programming models.5
Hadoop is fundamentally infrastructure software for storing and processing
large datasets it’s an open source project under Apache. Hadoop is influenced
by Google’s architecture, Google File System and MapReduce. Hadoop processes
the large data sets in a distributed computing environment 3. Hadoop is an open
source framework which is written in Java completely it does not mean that you
have to know Java in order to use Hadoop. To understand Hadoop you have to
understand two fundamental things about it you have to understand one how it
stores files how it stores data and two how it processes data. Hadoop also can
store both structured and unstructured data because it fundamentally is just a
file system now.


Hadoop architecture


Hadoop works
based on master/slave architecture so what I mean by master/slave architecture
you have one master computer that computer is going to take care of all slave
computers within the network topology whenever you have more than one computer
together within the network where they can talk to each other we call it as a

Whenever you are going to process
terabytes of data using single computer it’s going to take a lot of time to
find out your end results and it is going to take long time to process.

My entire terabytes of data
for example if I get my data from data source called amazon.com and I put the
data onto my cluster then what I am going to do instead of storing this entire
data together my Hadoop cluster master computer is the one who is going to
divide this data into chunks of data instead of storing my entire data into one
machine my master computer is the one who is going to be in touch with some of
the configuration files and it’s going to come up with the mechanism saying
that instead of storing my entire terabytes of data into single machine I am
going to divide into small so instead of just storing the entire data into the
one computer I am going to divide into pieces and that request will go to the
master computer and  master computer is
the one who is going to be allocating each slice that has to go into what
machine . The way we distribute the data and divide
the data into multiple computers that concept we call it as a HDFS (Hadoop
distributed file system).

Whenever I want
to increase the storage capacity simply we add one more computer to the cluster
that is possible that is how we address storage problem.

With the
processing way the program that we are going write to the master computer.
Master computer is the one who is going to send the same kind of logic to each
computer wherever I have stored the data wherever each slice is that program
will be sent and that will be processed there. This is different techniques or
paradigms that we use from Hadoop methodology so we are not going to pull all
of the entire data together to process in a single computer instead of pulling
the data all together and processing your data in a single machine, Hadoop
methodology it is going to send the program to individual machines. The
broken-down file however we divided our file into chunks of data on each sync
and we can go ahead and process the data the way we are doing it is called
MapReduce. Instead of processing the ten terabytes of data out of single
computer we are going to process the data in individual computers. If you want
more storage capacity we add one computer to the Hadoop cluster one slave
computer, if you want more processing capacity we add one more computer to the cluster.

3.2 HDFS (Hadoop distributed file system)


HDFS as part of
Hadoop that stands for Hadoop distributed file system Hadoop lets you store
files bigger than what can be stored on one particular node or particular
server or a computer so you can store very large files, imagine your same PC
could only store 50,000 files but you had a million files so with the help of
Hadoop you can store them. HDFS system supports parallel reading and processing of data it
supports read, write, rename operations it doesn’t support random write
operations the other key important thing with HDFS is it is fault
tolerant and hence easy to manage the data has built-in redundancy, typically
multiple replicas of the data is kept in the system and it tolerates disk and
node failures.

 The cluster manages addition and removal of
nodes automatically without requiring any operational

one operator can support up to 3,000 nodes in a cluster and that’s a lot of
nodes supported by a single person. In HDFS files are broken into blocks and
these blocks are typically large and they are of the size 128 megabytes the
blocks are stored as files on the data nodes on the local storage after data
nodes the blocks are replicated for reliability so typically the block
replication factor is three in HDFS cluster. There is a node called name node
that manages the file system, namespace are directories and files of the file
system, the name node also manages the mapping of file to the blocks that
belong to it. The name node has complete information about every piece of data
available in the cluster there is always only one name node per Hadoop cluster
its main duty is to get the data stored in the cluster with the help of data
nodes when a new data is submitted in a Hadoop cluster the name node divides it
into smaller parts then it identifies the data nodes which can actually store
this partitioned data the name node then transfers the data to each data node
and it maintains a table of data allocation when an application wishes to
retrieve the data from HDFS. It will contact the name node for the location of
data, the name node looks into its allocation table and conveys the locations.
The name node periodically receives the status of stored data from the data
nodes it helps the name node to maintain its data storage up-to-date. Data
nodes are actual storage locations in Hadoop cluster. A Hadoop cluster can have
thousands or even more data nodes. Data nodes actually store data and name node
coordinates all the operations related to data and data nodes (fig.3).


3.3 Mapreduce


MapReduce is a programming model
introduced by Google for processing and generating large data sets on clusters
of computers.

MapReduce is the execution
engine of Hadoop its duty is to get the jobs executed. There are two main components
of MapReduce job tracker and the task tracker.

Jobtracker – Central
manager for running MapReduce jobs.

– accept and runs map, reduce and shuffle.

The job tracker is hosted
inside the master computer and it receives the job execution request from the
client its main duties are to break down the received job that is big
computations in small parts, allocate the partial computations that are tasks
to the slave nodes, monitoring the progress and report of task execution from
the slave. Task tracker this is the MapReduce component on the slave machine as
there are multiple slave machines many task trackers are available in a cluster
its duty is to perform computation given by job tracker on the data available
on the slave machine the task tracker will communicate the progress and report
the results to the job tracker. The master node contains the job tracker and
name node whereas all slaves contain the task tracker and data node

MapReduce program runs in three stages: map stage,
shuffle stage, and reduce stage.


the Map stage the input data is stored in
the form of files in the Hadoop file system (HDFS). This input file is then
passed to the mapper function line by line. These data is then processed by the
mapper and several small chunks of data is created.

stage: Reduce stage is the
combination of the Shuffle stage and the Reduce stage. The Reducer takes the
output of mapper and the process it and generates new set of output which will
be stored in HDFS.

Hadoop sends the Map and Reduce tasks to the
appropriate servers in the cluster when MapReduce job is executing. The
framework manages all the details of data-passing such as verifying task
completion, issuing tasks and copying data around the cluster between the
nodes. The cluster accumulates and cuts the data to form an appropriate result,
and sends it back to the Hadoop server when the task is completed.


Advantages of Hadoop for the companies



Probably you
might be thinking here instead of just using or having one computer here I use
different computers so obviously instead of buying one machine here I will buy
more than one computer here that is the obvious question that we’ll be getting
in your mind.

There is a more
expensive for the organization of Hadoop features.


If we use
existing legacy systems techniques the server should be always up and running
if the server is going down the data that was there will be inaccessible that
is the reason we have to use always the enterprise hardware.

Enterprise Hardware
server are highly reliable hardware they will not go down very frequently so if
we want to have the higher level hardware you have to spend a lot of money but
Hadoop framework it can store and process the data using the commodity
hardware. Commodity hardware server is cheap hardware in terms of price if you
want to buy one terabytes of hard drive in commodity Hardware you may have to
spend very little, whereas if you are going with the enterprise hardware in
order to buy one terabyte of hard drive space you have to spend thousands of
dollars because this is a very high level hardware that is what you have to use
for legacy systems in order to make sure they are up and running in our


Then how about if
one of the machines is going down how I am going to recover the data? Because
of my Hadoop framework the nature itself it says is a commodity hardware the
hardware metal itself is a commodity hardware is a cheap hardware that it may
go down at any time then how we are going to recover the data. Using the
concept called fault tolerance using the concept called replication factor we
are going to make sure our data availability, we are going to increase the data
availability using our replication factor


Whenever your
organization is trying to adopt any methodology or a framework or any
technology what they are going to do first and foremost thing how much money do
I need to spend in order to maintain this one so if you are going with Hadoop
you don’t have to pay too much for the software as well as we don’t have to buy
very high reliable machines you can go and buy the normal machines whatever we
are using generally in our day to day activities you can just go ahead and use
them and the data that you are getting from your different sources you can go
ahead and store the data in Hadoop framework.

Hadoop framework
is going to like the data from different data source it can be satellite data
it can be the sensor data it can be trucks data it can be servers data whatever
you take it can easily collect the data and put it into that database.


have entered an era of Big Data. The paper describes the concept of Big Data
along with 4 Vs, Volume, Velocity, Variety and Veracity of Big Data. The paper
describes Hadoop which is an open source software used for processing Big Data.
Over the
long run, Hadoop will become part of our day-to-day information architecture.
We will start to see Hadoop playing a central role in statistical analysis, ETL
processing, and business intelligence.