We wanted to use http://giraph.apache.org/ for big graph analysis. As Giraph depends on Hadoop and HDFS, We need to setup hadoop cluster. and deploy Giraph.


The setup http://giraph.apache.org/quick_start.html explains for hadoop version 1. We are going to use version hadoop 2 - which doesnt have map reduce and it is replaced by yarn framework.

Setup Hadoop

For local hadoop yarn setup I followed (not followed with exact steps - there are some config props changed due to version):


These are the steps for single node cluster setup:


Yogin-Patel:~ yoginpatel$ pwd
Yogin-Patel:hadoop yoginpatel$ sudo wget https://archive.apache.org/dist/hadoop/core/hadoop-2.6.5/hadoop-2.6.5.tar.gz

Yogin-Patel:hadoop yoginpatel$ tar xvf hadoop-2.6.5.tar.gz --gzip

Yogin-Patel:hadoop yoginpatel$ mv hadoop-2.6.5 hadoop

Yogin-Patel:hadoop yoginpatel$ cd hadoop

Yogin-Patel:hadoop yoginpatel$ pwd

Yogin-Patel:hadoop yoginpatel$ ls
LICENSE.txt NOTICE.txt  README.txt  bin         etc         include     lib         libexec     logs        sbin        share
Yogin-Patel:hadoop yoginpatel$

Since I added that in my home with my personal user on mac. this is what I added for paths in ~/.bash_profile:

export HADOOP_HOME=/Users/yoginpatel/hadoop


Need to create data directory for hdfs.

In my system’s case I created /mnt1/hadoop3 - also /mnt1 is not owned by yoginpatel local user. so I had to give 777 permission for hadoop3.

Yogin-Patel:hadoop3 yoginpatel$ cd ~/hadoop/etc/hadoop/
Yogin-Patel:hadoop yoginpatel$ ls
capacity-scheduler.xml     hadoop-env.sh              httpfs-env.sh              kms-env.sh                 mapred-env.sh              ssl-server.xml.example
configuration.xsl          hadoop-metrics.properties  httpfs-log4j.properties    kms-log4j.properties       mapred-queues.xml.template yarn-env.cmd
container-executor.cfg     hadoop-metrics2.properties httpfs-signature.secret    kms-site.xml               mapred-site.xml            yarn-env.sh
core-site.xml              hadoop-policy.xml          httpfs-site.xml            log4j.properties           slaves                     yarn-site.xml
hadoop-env.cmd             hdfs-site.xml              kms-acls.xml               mapred-env.cmd             ssl-client.xml.example
Yogin-Patel:hadoop yoginpatel$

Configure HDFS:

Need to change file at: hadoop/etc/hadoop/hdfs-site.xml (we basically need to configure data directory)

content from local for reference:

Yogin-Patel:hadoop yoginpatel$ cat hdfs-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->




Configure Yarn:

Need to change file at: hadoop/etc/hadoop/yarn-site.xml

content from my local setup: the limits are due to my local machine limitation - we need to allocate more memories in actual setup.

Yogin-Patel:hadoop yoginpatel$ cat yarn-site.xml
<?xml version="1.0"?>

<!-- Site specific YARN configuration properties -->







Start hadoop:

$ hdfs namenode -format

(below files are part of: $HADOOP_HOME/sbin)
$ start-dfs.sh
$ start-yarn.sh

This started hadoop cluster. this was for localhost one node setup. the config would change in multinode setup. http://localhost:8082 would show the resource manager ui.


Setup Girpah:

For yarn there is no distribution available we need to build from source:

Checkout Giraph Source:

Yogin-Patel:~ yoginpatel$ pwd

$ mkdir giraph
$ cd giraph

$ git clone https://github.com/apache/giraph.git

Yogin-Patel:giraph yoginpatel$ ls
CHANGELOG                README                   checkstyle.xml           giraph-accumulo          giraph-debugger          giraph-hbase             license-header.txt
CODE_CONVENTIONS         bin                      conf                     giraph-block-app         giraph-dist              giraph-hcatalog          pom.xml
LICENSE.txt              checkstyle-relaxed-8.xml dev-support              giraph-block-app-8       giraph-examples          giraph-parent.iml        src
NOTICE                   checkstyle-relaxed.xml   findbugs-exclude.xml     giraph-core              giraph-gora              giraph-rexster           target
Yogin-Patel:giraph yoginpatel$

Build Giraph

There is a bug in source where building for Hadoop Yarn profile the build fails. To fix the build we need to change the pom.xml file:

-> remove STATIC_SASL_SYMBOL from line no: 1270 of pom.xml ( The sample diff looks like this: https://github.com/yogin16/giraph/commit/acd536a0de748510a849392df18089ece69988c3 )

$mvn -Phadoop_yarn -Dhadoop.version=2.6.5 clean package -DskipTests

On build success there should be a target jar created:

$ ls giraph/giraph-core/target/giraph-1.3.0-SNAPSHOT-for-hadoop-2.6.5-jar-with-dependencies.jar

Add Giraph jars which created from above build to hadoop classpath:

cp giraph-1.3.0-SNAPSHOT-for-hadoop-2.6.5-jar-with-dependencies.jar ~/hadoop/share/hadoop/yarn/lib/
cp giraph-examples-1.3.0-SNAPSHOT-for-hadoop-2.6.5-jar-with-dependencies.jar ~/hadoop/share/hadoop/yarn/lib/

Restart hadoop

This works for single node. For clustered setup this might be different steps.

$ stop-all.sh

$ start-dfs.sh
$ start-yarn.sh


