Installing hadoop 2.4.0 from scratch (2014 version)
Tags: Hadoop – Ubuntu – VirtualBox
In a previous post I described how to set up a single-node hadoop cluster in an ubuntu server running on a virtual machine. Sort story: it’s related to my course on Big scale analytics. Refer to the original post for the details. As software upgrade is a matter of fact, I decided to update that tutorial for the more recent 2.4.0 release of hadoop.
I opted for a VM-based solution, so that most of hardware and OS issues students would face would be limited to installing and configuring the VM manager. For the records, I am running Mac OS X 10.9.5 and relying on VirtualBox 4.2.8.
First of all, I downloaded the ISO image for Ubuntu server 14.04 at the Ubuntu server download page and created a Linux-Ubuntu based VM in VirtualBox with 1GB RAM (who read my previous post will note an increase in the server’s RAM, which is due to the fact that the default RAM amount of 512MB did lead to hadoop crashes during simple experiments), a 8GB VDI-based HD (dynamically allocated), and a DVD preloaded with the Ubuntu server 14.04 ISO image. Then I ran the VM and followed all default installation options, except for keyboard layout (I use an italian keyboard). I did not install any additional software, with the exception of manual package installation support.
Some details about the examples: the host name is
an administrator user with login name
boss (that is,
boss is a
sudoer); three points (
...) in a console are used in order to skip
verbose output. Finally, a dollar sign (
$) occurring at the
beginning of a line denotes the bash prompt.
Setting up the environment
First of all, we need to be sure to work on an up-to-date system. This will probably be the case if the ISO image refers to the current version of Ubuntu server. Just to be sure, log in as the boss user and type the following commands.
Moreover, it is advisable not to run Hadoop services through a
general-purpose user, so the next step consists in adding a group
hadoop and a user
hadoop-user belonging to that group (for the
purposes of this tutorial, all information requested by
be left blank, except the password.
The mentioned tutorials suggest a potentially unsafe procedure in
order to install the jdk through
apt-get, thus it’s advisable to
opt for a manual installation.
Finally, a couple of environment variables should be set up so that
the java executables are in
$PATH and hadoop knows where java has
been installed: this is easily accomplished adding
at the end of
/etc/profile (to be edited through
su). When these
variable are in place it is easy to check that java has been properly
All communications with Hadoop are encrypted via SSH, thus the corresponding server should be installed:
hadoop-user must be associated to a key pair and
subsequently granting its access to the local machine:
hadoop-user should be able to access via ssh to
without providing a password:
Hadoop and IPV6 do not agree on the meaning of
thus it is adivsable to disable IPV6 adding the following lines at
the end of
/etc/sysctl.conf (after having switched back to the
After a system reboot the output of
cat /proc/sys/net/ipv6/conf/all/disable_ipv6 should be
that IPV6 is actually disabled.
Download and install Hadoop
unpack it and move the results in
/usr/local, adding a symlink
using the more friendly name
hadoop and changing ownership of the
directory contents to the
Setup the dedicated user environment
Switch to the
hadoop-user user and add the following lines at the
In order to have the new environment variables in place, reload
get back to the administrator user, then open
/usr/local/hadoop/etc/hadoop/hadoop-env.sh , uncomment the line
JAVA_HOME and set its value to the jdk directory:
Before being able to actually use the hadoop file system it is
necessary to modify some configuration files inside
/usr/local/hadoop/etc/hadoop. All such files follow the an XML
format, and the updates should concern the top-level node
configuration (likely empty after the hadoop installation).
mapred-site.xml(likey to be created through
cp mapred-site.xml.template mapred-site.xml):
This also requires to manually create the two directories specified
value XML nodes:
Formatting the distributed file system
The last step consists in formatting the file system, operation to be
the (hopeful) successful result of this operation is specified in one of the last output lines of previous command.
A few more steps and… that’s it!
Hadoop is now installed. Invoking the scripts
start-yarn.sh respectively start the distributed file system and
the mapreduce daemons:
Although it is possible to directly write on the hadoop file system root
directory, it is more advisable to create the user directory for
because all relative paths will refer precisely to this directory:
An absence of outputs from these command invokations means a successful directory creation, which also ensure that the distributed filesystem component of hadoop has been correctly installed. To test also the mapreduce component it is possible to run one of the example jobs distributed along with hadoop:
Finally, to stop the hadoop daemons, simply invoke
blog comments powered by Disqus