Hadoop is an Apache framework developed completely in Java with the opensource brand.
Hadoop analyzes and processes a large amount of data i.e petabytes of data in parallel with less time located in the distributed environment. Hadoop is not a single tool which contains a combination of different sub-frameworks called Hadoop Core, Map Reduce, HDFS, Pig, HBase. Hadoop is mostly used for OLTP transactions. Some big companies like Facebook use Hadoop for OLAP transactions as well. Hadoop can be set up in a clustered environment as well as a single node environment.
HDFS is developed based on google file system (GFS)
Map Reduce is developed on google map reduce concept.
Pig framework is SQL wrapper for map reduce jobs
In Hadoop, data is stored in Hadoop distributed file system (HDFS) software in the form blocks (ca hunk of size), data is replicated in different nodes (called machines) in clustered (many nodes) environment. In real time scenario, thousands of petabytes data is stored in thousands of nodes in distributed environment. The advantages of storing replicated data is data available when one of the node is down. With this, data available all the time for the client’s apps. Do we need high end hardware for all these nodes? The answer is no, we can accommodate commodity hardware for all these nodes.
if data is grown rapidly, we can add nodes without fail over of the whole system and losing the data. This we can call it a scalable system in network terminology. This system handles the case of losing data while adding machines to the existing machines or after the machine add to the cluster.
As you know cluster has different nodes, if one node fails, hadoop handles the scenario without losing the data and serves the required work as expected.
HDFS stores the data in the files where this files uses the underlying operating system’s file structure.
HDFS is suitable for storing a large amount of data like peta and tera bytes of data which process the data using Map reduce for OLTP transaction.
In HDFS, the set of data is called as blocks, each block of data is replicated in different node or machines. The number of nodes where this data is replicated is configured in hadoop system.
In hadoop system, petabytes of data is distributed in thousands of nodes in a cloud environment. To process the data stored in HDFS, We need a applications.
Map Reduce is a java framework used to write programs, which are executed parallel to process a large amount of data in the clustered environment.
Hadoop provides Map Reduce API’s to write map reduce programs. We have to make use of those API’s and customize our data analysis login in the code.
The map reduce a piece of code fetchs and process the data in a distributed environment
As we want to process the data stored in HDFS, For this we need to write programs using some language like java or python etc.
Map reduce has two different task 1.Mapper 2. Reducer
Map takes the input data and processes this data in a set of tasks by dividing input data. and out of this map is the result of the set of task, which are given to reducer. reducer process this data and combine the data output the data.
Hive is apache framework which is SQL wrapper implementation of map reduce programs. Hive provides sql languages which understand by hadoop system.
Most of the times, data analysis was done by database developers, so DB developer is not aware of the java programs, in that case, hive tool is useful.
Database developer writes Hive Queries in hive tool for the result. This queries call the underlying map to reduce jobs, process the data, finally data is returned to the hive tool.
Advantages with the hive is no need of java programming code in hadoop environment.
In some cases, SQL queries are not performing well, or some database feature (group by SQL function) is not implemented in the hive, then we have to write map-reduce plugin and register this plugin to hive repository. This is one time tasks.
On top of all these, we have Hadoop Common which is core framework written for processing distributed a large set of data to handling the hadoop features.
Let me take scenario before apache hadoop is introduced into the software world
Let me explain about the use cases of data processing for one big data company.
Retail company has 20000 stores in the world. Each store will have the data related to products in different regions. This data will be stored in different data sets including different popular databases in multi-software environment. Data company would need licenses of different software including databases and hardware
For each month, if this company wants to process the data by store wise for finding the best product as well as a loyal customer that means we need to process and calculate the data and find out the best customer as well as best selling product in each region to give better offers.
Assume that 2000 stores will have the data of all the products and customer details, and customer purchase information per each store.
To target the below use cases.
To find the popular product sold in the last year Christmas per the Store A, so that this year we can target the customers to give different discounts for this products
Top 10 Products sold
selecting the top 100 customers per each store to give more offers.
From the technical point of view, we have large data infrastructure to store this data as well as we need to process the data, for this we need data warehousing tools to process this normalized and unorganized which are costly. and also storing should have reliable as it will impact the data loss
Over this time data process is complex and license cost of maintain is more.
Suppose 10000 more stores added, data is grown, more nodes are added to current infrastructure, but overall performance system degrades as the nodes are added
Apache hadoop solve the above problems.
This topic has been a very basic start to explore on what is hadoop. Hopefully, you have enough information to get started.
If you have any questions, please feel free to leave a comment and I will get back to you.