AbstractIn this worldof information the term Big Data has emerged with new opportunities andchallenges to deal with the massive amount of data. Big Data has earned a placeof great importance and is becoming the choice for new researches. Bigdata can be structured, unstructured or semi-structured, resulting inincapability of conventional data management methods. The volume and thediversity of data it is generated with speed, makes it difficult for thepresent computing infrastructure to manage Big Data. Traditional datamanagement, warehousing and analysis systems fall short of tools to analyzethis data. Due to its specific nature of Big Data, it is stored in distributedfile system architectures.
Hadoop is widely used for storing and managing BigData. Hadoop with the help of Mapreduce and HDFS is the core platform forstructuring Big Data, and solves the problem of making it useful for analyticpurposes. Hadoop is an open source software project that enables thedistributed processing of large data sets across clusters of commodity servers.Keywords: Big Data, Hadoop, Mapreduce, HDFS1. IntroductionBig Data is a vague topic and there is no exact definition which is followed by everyone. Big Data describes techniques and technologies to store,distribute, manage large sized datasets with high-velocity and differentstructures.
Data that has large Volume, comes from Variety of sources, Varietyof formats and comes at us with a great Velocity is normally refer to as BigData. All this data is coming from smartphones, social networks, tradingplatforms, machines, and other sources. One of the largest technologicalchallenges in software systems research today is to provide mechanisms forstorage, manipulation, and information retrieval on large amounts of data. Webservices and social media produce together an impressive amount of data,reaching the scale of petabytes daily. To harnessthe power of big data, you would require an infrastructure that can manage andprocess huge volumes of structured and unstructured data in real time. 50% to 80% of big data work isconverting and cleaning the information so that is searchable andsortable.
Only a few thousand experts on our planet fully know how to dothis data cleanup. These experts also need very specialized tools, like Hadoop, to do their craft 1. There aretwo ingredients that are driving organizations into investigating Hadoop. One is a lot of data, generally larger than10 Terabytes. The other is high calculation complexity, like statisticalsimulations. Any combination of those two ingredients with the need to getresults faster and cheaper will drive your return on investment.2.
Big DataRight now we havesome problems that is called Big Data so what is that problem that we have bigdata is nothing but simple a huge data that we are putting together that iscalled Big Data. As you know on this world of internet enableddevices we are generating a lot of data and we are getting data from different data sources so existingsystems are not able to handle this Big Data .Big data is having two issues thedata is coming from the different sources continuously so the data we are notable to handle it using our existing computational techniques so what are thebig data related issues, we are getting the tons of data from different devicesand we are not able to store the data on time and the second thing isprocessing work somehow we are able to manage the data how to store the data byadding different and many servers to our system we are able to manage somehowbut in many times when it comes to process the data we are not able to do it ontime.
The 4 Vs that define Big Data areVolume, Velocity,Variety and Veracity (as shown in fig 1) : Volume of data -volume of data is the amount of data that we are adding from different datasources is called volume of data example probably 100 bytes hundred megabytesor 100 terabytes or 1 petabytes Velocity of data – Velocity is the speed at which data is generated and processed 3.Variety of data – what kind of data that we are getting from the different data sourcesto the warehouse or what type of data that we are trying to add from differentdata sources to the warehouseVeracity – Veracity means anxiety or accuracy ofdata. Data is uncertain due to the inconsistency and in completeness 7.
3. Hadoop Hadoop softwarelibrary is a framework that allows for the distributed processing of large datasets across clusters of computers using simple programming models.5Hadoop is fundamentally infrastructure software for storing and processinglarge datasets it’s an open source project under Apache. Hadoop is influencedby Google’s architecture, Google File System and MapReduce.
Hadoop processesthe large data sets in a distributed computing environment 3. Hadoop is an opensource framework which is written in Java completely it does not mean that youhave to know Java in order to use Hadoop. To understand Hadoop you have tounderstand two fundamental things about it you have to understand one how itstores files how it stores data and two how it processes data. Hadoop also canstore both structured and unstructured data because it fundamentally is just afile system now.
3.1Hadoop architecture Hadoop worksbased on master/slave architecture so what I mean by master/slave architectureyou have one master computer that computer is going to take care of all slavecomputers within the network topology whenever you have more than one computertogether within the network where they can talk to each other we call it as acluster.Whenever you are going to processterabytes of data using single computer it’s going to take a lot of time tofind out your end results and it is going to take long time to process.My entire terabytes of datafor example if I get my data from data source called amazon.com and I put thedata onto my cluster then what I am going to do instead of storing this entiredata together my Hadoop cluster master computer is the one who is going todivide this data into chunks of data instead of storing my entire data into onemachine my master computer is the one who is going to be in touch with some ofthe configuration files and it’s going to come up with the mechanism sayingthat instead of storing my entire terabytes of data into single machine I amgoing to divide into small so instead of just storing the entire data into theone computer I am going to divide into pieces and that request will go to themaster computer and master computer isthe one who is going to be allocating each slice that has to go into whatmachine . The way we distribute the data and dividethe data into multiple computers that concept we call it as a HDFS (Hadoopdistributed file system).
Whenever I wantto increase the storage capacity simply we add one more computer to the clusterthat is possible that is how we address storage problem.With theprocessing way the program that we are going write to the master computer.Master computer is the one who is going to send the same kind of logic to eachcomputer wherever I have stored the data wherever each slice is that programwill be sent and that will be processed there. This is different techniques orparadigms that we use from Hadoop methodology so we are not going to pull allof the entire data together to process in a single computer instead of pullingthe data all together and processing your data in a single machine, Hadoopmethodology it is going to send the program to individual machines. Thebroken-down file however we divided our file into chunks of data on each syncand we can go ahead and process the data the way we are doing it is calledMapReduce. Instead of processing the ten terabytes of data out of singlecomputer we are going to process the data in individual computers.
If you wantmore storage capacity we add one computer to the Hadoop cluster one slavecomputer, if you want more processing capacity we add one more computer to the cluster.3.2 HDFS (Hadoop distributed file system) HDFS as part ofHadoop that stands for Hadoop distributed file system Hadoop lets you storefiles bigger than what can be stored on one particular node or particularserver or a computer so you can store very large files, imagine your same PCcould only store 50,000 files but you had a million files so with the help ofHadoop you can store them. HDFS system supports parallel reading and processing of data itsupports read, write, rename operations it doesn’t support random writeoperations the other key important thing with HDFS is it is faulttolerant and hence easy to manage the data has built-in redundancy, typicallymultiple replicas of the data is kept in the system and it tolerates disk andnode failures.
The cluster manages addition and removal ofnodes automatically without requiring any operational Intervention,one operator can support up to 3,000 nodes in a cluster and that’s a lot ofnodes supported by a single person. In HDFS files are broken into blocks andthese blocks are typically large and they are of the size 128 megabytes theblocks are stored as files on the data nodes on the local storage after datanodes the blocks are replicated for reliability so typically the blockreplication factor is three in HDFS cluster. There is a node called name nodethat manages the file system, namespace are directories and files of the filesystem, the name node also manages the mapping of file to the blocks thatbelong to it.
The name node has complete information about every piece of dataavailable in the cluster there is always only one name node per Hadoop clusterits main duty is to get the data stored in the cluster with the help of datanodes when a new data is submitted in a Hadoop cluster the name node divides itinto smaller parts then it identifies the data nodes which can actually storethis partitioned data the name node then transfers the data to each data nodeand it maintains a table of data allocation when an application wishes toretrieve the data from HDFS. It will contact the name node for the location ofdata, the name node looks into its allocation table and conveys the locations.The name node periodically receives the status of stored data from the datanodes it helps the name node to maintain its data storage up-to-date. Datanodes are actual storage locations in Hadoop cluster. A Hadoop cluster can havethousands or even more data nodes. Data nodes actually store data and name nodecoordinates all the operations related to data and data nodes (fig.
3). 3.3 Mapreduce MapReduce is a programming modelintroduced by Google for processing and generating large data sets on clustersof computers.MapReduce is the executionengine of Hadoop its duty is to get the jobs executed. There are two main componentsof MapReduce job tracker and the task tracker.Jobtracker – Centralmanager for running MapReduce jobs.Tasktracker- accept and runs map, reduce and shuffle.
The job tracker is hostedinside the master computer and it receives the job execution request from theclient its main duties are to break down the received job that is bigcomputations in small parts, allocate the partial computations that are tasksto the slave nodes, monitoring the progress and report of task execution fromthe slave. Task tracker this is the MapReduce component on the slave machine asthere are multiple slave machines many task trackers are available in a clusterits duty is to perform computation given by job tracker on the data availableon the slave machine the task tracker will communicate the progress and reportthe results to the job tracker. The master node contains the job tracker andname node whereas all slaves contain the task tracker and data nodeMapReduce program runs in three stages: map stage,shuffle stage, and reduce stage. Inthe Map stage the input data is stored inthe form of files in the Hadoop file system (HDFS). This input file is thenpassed to the mapper function line by line. These data is then processed by themapper and several small chunks of data is created. Reducestage: Reduce stage is thecombination of the Shuffle stage and the Reduce stage.
The Reducer takes theoutput of mapper and the process it and generates new set of output which willbe stored in HDFS. Hadoop sends the Map and Reduce tasks to theappropriate servers in the cluster when MapReduce job is executing. Theframework manages all the details of data-passing such as verifying taskcompletion, issuing tasks and copying data around the cluster between thenodes. The cluster accumulates and cuts the data to form an appropriate result,and sends it back to the Hadoop server when the task is completed.
4.Advantages of Hadoop for the companies Probably youmight be thinking here instead of just using or having one computer here I usedifferent computers so obviously instead of buying one machine here I will buymore than one computer here that is the obvious question that we’ll be gettingin your mind. There is a moreexpensive for the organization of Hadoop features. If we useexisting legacy systems techniques the server should be always up and runningif the server is going down the data that was there will be inaccessible thatis the reason we have to use always the enterprise hardware.Enterprise Hardwareserver are highly reliable hardware they will not go down very frequently so ifwe want to have the higher level hardware you have to spend a lot of money butHadoop framework it can store and process the data using the commodityhardware. Commodity hardware server is cheap hardware in terms of price if youwant to buy one terabytes of hard drive in commodity Hardware you may have tospend very little, whereas if you are going with the enterprise hardware inorder to buy one terabyte of hard drive space you have to spend thousands ofdollars because this is a very high level hardware that is what you have to usefor legacy systems in order to make sure they are up and running in ourproduction. Then how about ifone of the machines is going down how I am going to recover the data? Becauseof my Hadoop framework the nature itself it says is a commodity hardware thehardware metal itself is a commodity hardware is a cheap hardware that it maygo down at any time then how we are going to recover the data.
Using theconcept called fault tolerance using the concept called replication factor weare going to make sure our data availability, we are going to increase the dataavailability using our replication factor Whenever yourorganization is trying to adopt any methodology or a framework or anytechnology what they are going to do first and foremost thing how much money doI need to spend in order to maintain this one so if you are going with Hadoopyou don’t have to pay too much for the software as well as we don’t have to buyvery high reliable machines you can go and buy the normal machines whatever weare using generally in our day to day activities you can just go ahead and usethem and the data that you are getting from your different sources you can goahead and store the data in Hadoop framework.Hadoop frameworkis going to like the data from different data source it can be satellite datait can be the sensor data it can be trucks data it can be servers data whateveryou take it can easily collect the data and put it into that database.5.ConclusionsWehave entered an era of Big Data. The paper describes the concept of Big Dataalong with 4 Vs, Volume, Velocity, Variety and Veracity of Big Data. The paperdescribes Hadoop which is an open source software used for processing Big Data.Over thelong run, Hadoop will become part of our day-to-day information architecture.We will start to see Hadoop playing a central role in statistical analysis, ETLprocessing, and business intelligence.