Artikel Terbaru

Tuesday, 17 January 2012

Learn Basic of HDFS in Hadoop

HDFS is distribute file system used by apache hadoop.Hadoop used HDFS for storing large data say peta bytes of data. This data stores the data and distrubtes the data in different machine in clustered architecture, because of the distribute data over multiple machines, it is highly available in process the data. HDFS runs on low cost hardware.

How this data stored in HDFS:-


Hadoop runs on different machines which are named as clusters. and process the data huge amount of data. Each cluster has set of nodes called data node and each node has set of blocks. and the set of data is duplicated in different nodes in clustered environment. This duplication is we called as replication in the database environment. so set of data is available in different node or machines, so that you can't loose the data when one is down. The number of nodes where this data is replicated is configured in hadoop system, To control all these nodes, we have name node concept. which sync with set of data nodes in cluster environment to know the health of the nodes and also stores metadata about the nodes as well as blocks

if data is grown rapidly, we can add nodes without fail over of the whole system and losing the data. This we can call it as scalable system in network terminology. This system handles the case of losing data while adding machines to the existing machines or after the machine add to the cluster.
As you know cluster has different nodes, if one node fails, hadoop handles the scenaro without losing the data and serves the required work as expected.

HDFS stores the data in the files where this files uses the underlying system.

HDFS is suitable for storing the large amount of data like peta and tera bytes of data which process the data using Map reduce for OLAP transaction.

Let us take the scenario where you have 12 pages of pdf file you want to store in HDFS system.
Assume that each page has one block, it might be different in real system.

Name Node holds three items( filename,number of blocks,block ids)
1.File name which is page number of pdf file stored under hdfs file system example (/pdf/page1)
2 Number of blocks represents count of blocks where this file is store in hdfs
3.block ids represents reference to the blocks in the name node

Data nodes holds the page data in different blocks.
Page 1 is replicated in three blocks which are of id's 1,5,6
These blocks are of on different machines
Here is the summary of pages store in different nodes on clusters.

page1,3,{1,5,6}
Page2,3,{4,1,2}
Page3,3,{3,4,1}

Because of the replication of data, data will never lost even when the data node is down.The communication of namenode and data node can be throught TCP protocol.

HDFS is completely written on java framework.Data store in hdfs can be controlled using commands as well as java hdfs api's provided by apache. Commands executed on top of underlying operating system which inturns calls java apis to interact with file system.

Hope you got little bit idea on ocean of HDFS.

Please leave a comment if you like this post.