How To Install Hadoop Cluster

Introduction

Hadoop is a open source software use for distribute computing. It is reliable, scalable and perfectly use for big data purpose. Hadoop cluster commonly consist of NameNode, secondary NameNode, resource manager and DataNode.

NameNode itself store block metadata on the file call fsimage. Secondary NameNode is a NameNode helper. It log changes to fsimage (checkpoint) but do not store the actual fsimage file. Secondary NameNode update frequently and update NameNode fsimage by combining update logs with fsimage to achieve most recent fsimage. NameNode and DataNode are run on different machine to achieve best performances.

Resource manager is call YARN (Yet Another Resource Negotiator). It’s job to allocate memory and CPU available to the process. Apache Yarn consist of resources manager, slave daemon (NodeManager) and application master. Resource manage keep track, allocate resources and NodeManager.

NodeManager is manage process, run and provide resources to the container. Only one NodeManager per slave node.

Application master coordinates tasks within application and demand resources to perform tasks.

yarn

Amazon EC2 Instance

I am going to use Amazon EC2 instance on this tutorial. EC2 instance is web service provide by Amazon. By the time this article write they are offer free tier service. Please be caution about cost and not exceed free tier limit (we are not responsible for any cost occur).

  • Sign-in to Amazon web service, service => compute => EC2.

ec2

  • Launch instance => choose AMI =>Ubuntu 64 => general purpose => t2.micro(free tier eligible)

ami64

  • Configure instance details => add storage => 8 GiB

addstorage

  • Add tags “key” = “name” and “value” = “hadoop node”
  • Configure security group. Create new security group, name it and allow SSH from 0.0.0.0/0 (all traffic) to access the machine.
  • Launch, select new security pair, name it “hadoop_key” and download to local machine.
  • View instance. After a while you can check status of your instance to available.

Hadoop cluster

Hadoop cluster for this article very much like drawing below.

yarn

Remote access

Configuring remote access over secure network SSH is crucial for seamless communication between nodes. We would like every node able to communicate without go through login process again. The key is to set predefine configuration utilizing a keyfiles on each nodes.