In two previous posts I described the installation process for the 2.4.0 and the 0.20 releases of hadoop to the students of my class on on Big scale analytics.

I opted for a VM-based solution, so that most of hardware and OS issues students would face would be limited to installing and configuring the VM manager. For the records, I am running Mac OS X 10.10.5 and relying on VirtualBox 5.0.10.

First of all, I downloaded the ISO image for Ubuntu server 14.04 at the Ubuntu server download page and created a Linux-Ubuntu based VM in VirtualBox with 1GB RAM, a 8GB VDI-based HD (dynamically allocated), and a DVD preloaded with the Ubuntu server 14.04 ISO image. Then I ran the VM and followed all default installation options, except for keyboard layout (I use an italian keyboard). I did not install any additional software, with the exception of manual package installation support.

Once the system was up and running, I installed Hadoop following a mix of the instructions in the tutorials provided by Michael Noll, BigData Handler, and Rasesh Mori, that is what follows.

#### Disable IPV6

Hadoop and IPV6 do not agree on the meaning of 0.0.0.0 address, thus it is adivsable to disable IPV6 adding the following lines at the end of /etc/sysctl.conf (after having switched back to the boss user):

After a system reboot the output of cat /proc/sys/net/ipv6/conf/all/disable_ipv6 should be 1, meaning that IPV6 is actually disabled.

#### Setup SSH

All communications with Hadoop are encrypted via SSH, thus the corresponding server should be installed:

and the hadoop-user must be associated to a key pair and subsequently granting its access to the local machine:

Now hadoop-user should be able to access via ssh to localhost without providing a password:

Download hadoop-2.7.1.tar.gz (the link points to a suggested apache mirror, thus feel free to change it into a nearest link), unpack it and move the results in /usr/local, adding a symlink using the more friendly name hadoop and changing ownership of the directory contents to the hadoop-user user:

#### Setup the dedicated user environment

Switch to the hadoop-user user and add the following lines at the end of ~/.bashrc:

In order to have the new environment variables in place, reload .bashrc:

Before being able to actually use the hadoop file system it is necessary to modify some configuration files inside /usr/local/hadoop/etc/hadoop. All such files follow the an XML format, and the updates should concern the top-level node configuration (likely empty after the hadoop installation). Specifically:

• in yarn-site.xml:
• in core-site.xml:
• in mapred-site.xml (likey to be created through cp mapred-site.xml.template mapred-site.xml):
• in hdfs-site.xml:

This also requires to manually create the two directories specified in the last two value XML nodes:

Finally, set to /usr the JAVA_HOME variable in /usr/local/hadoop/etc/hadoop/hadoop-env.sh.

#### Formatting the distributed file system

The last step consists in formatting the file system, operation to be executed as hadoop-user:

the (hopeful) successful result of this operation is specified within the (quite verbose) output: search for the text successfully formatted!

#### A few more steps and… that’s it!

Hadoop is now installed. Invoking the scripts start-dfs.sh and start-yarn.sh respectively start the distributed file system and the mapreduce daemons:

Although it is possible to directly write on the hadoop file system root directory, it is more advisable to create the user directory for hadoop-user, because all relative paths will refer precisely to this directory:

An absence of outputs from these command invokations means a successful directory creation, which also ensure that the distributed filesystem component of hadoop has been correctly installed. To test also the mapreduce component it is possible to run one of the example jobs distributed along with hadoop:

Finally, to stop the hadoop daemons, simply invoke stop-dfs.sh and stop-yarn.sh.