03 Jan / 2011

Amazon Elastic Map Reduce (EMR): Beyond Basics

These are some handy scripts to launch / monitor / manage Amazon EMR jobs.

The code for all this is at GitHub:  https://github.com/sujee/amazon-emr-beyond-basics

Environment

I run these commands from an linux EC2 instance.  It doesn’t have
to be a ‘powerful’ instance, as it doesn’t do much work.  So an
M1.SMALL type is fine.  The following needs to be installed

Input Paths

For testing MR jobs on the local hadoop instance, we might use an input
path like  ‘hdfs://localhost:9000/input‘.

For running on EMR, we can use S3 as input:
s3://my_bucket/input

Since hadoop supports reading from S3 natively, S3 input works just
like a HDFS url

So how to do this, wih out hard-coding the path into the code?  We
passs it like a command line argument.  The following example
illustrates how to pass two arguments in command line

HADOOP on development machine:

hadoop jar my.jar   my.TestMR   hdfs://localhost:9000/input
hdfs://localhost:9000/output


HADOOP cluster running on EC2:

hadoop jar my.jar   my.TestMR   s3://my_bucket/input
s3://my_bucket/output

EMR:

elastic-mapreduce   --create --name "MyJob"
--num-instances "5"  --master-instance-type "m1.large"
--slave-instance-type "c1.xlarge"  --jar s3://my_bucket/my.jar
--main-class my.TestMR    --arg
s3://my_bucket/input   --arg  s3://my_bucket/output


Here is a skeleton code that takes the input path as a command line
argument

Input arguments as Property files

As we have seen we can supply any number of arguments in command line.  But when doing so, we are also hardcoding the parameter’s positional values.   First parameter is input_dir, second parameter is output_dir …etc.  This can lead to inflexible programs.  What if we want to over write only the third argument?  We still have to supply all the args.  And if we need to pass in a lot of arguments, then this method is not very handy.

Lets put all our arguments into a Property file and feed that to our program.  Java property files are  plain text files containing  key=value per line.
Here is an example

my.input=xxxx
my.output=xxxx
my.db.host=xxxx
my.db.dbname=xxxxx
my.db.user=xxxxx
my.db.pass=xxxx
my.memcached.host=xxxx
my.foo=xxxx
my.bar=xxxxx

Here the properties are prefixed with ‘my’ just so we don’t end up override any system properties by accident.

Next we should place this property file  in HDFS or S3.

HDFS:
hadoop  dfs -copyFromLocal   my.conf  hdfs://localhost:9000/my.conf

EMR:
copy this file to your bucket using s3cmd or   Firefox S3 organizer

Now provide this file as an argument to MR job

HADOOP on development machine:
hadoop jar my.jar   my.TestMR   hdfs://localhost:9000/my.conf



EMR:
elastic-mapreduce   --create --name "MyJob"   --num-instances "5"  --master-instance-type "m1.large"  --slave-instance-type "c1.xlarge"  --jar s3://my_bucket/my.jar --main-class my.TestMR    --arg  s3://my_bucket/my.conf

Here is how we access the property file in our MR job

Multiple JAR files

EMR allows uploading a single JAR file.  What if we need extra
JARs, like a JSON library.   We need to repackage extra jars
into a single jar.  We will upload this JAR into our S3 bucket, so
it can be used for launching MR jobs.

Launching and Monitoring EMR Jobs

There is a Web UI to submit a MapReduce job and monitor its progress.  We will look at an alternative – launching an EMR job from command line and monitoring its progress.

The following scripts do this.

Launch script is split in two parts.  First part is configurable.  Second part of the script is generic and does not need to be changed all much.  That is why I have split the script this way.  The bottom script can be ‘called’ from any script.

Some explanations:

  • instances : we control instance type  (–instance-type ) and
    number of instances (–num-instances).  This is a great feature of
    EMR.  We can requisition a cluster that fits our needs.  For
    example, if a small job needs only 5 instances we get 5.  IF a
    larger job needs 20 instances we can get 20.

    to specify  different types of machine types for NAMENODE and
    SLAVE NODES, we use ‘–master-instance-type’  and
    ‘–slave-instance-type’.  Name node doesn’t do much.  Slave
    nodes do the heavy lifting.  So in this case we make Namenode as
    ‘m1.large’  and Slaves ‘c1.xlarge’

  • logging : ‘–log-uri’ we save the logs to S3

The script ’emr-wait-for-completion.sh’ is below.  This script is
called from our run script.

https://github.com/sujee/amazon-emr-beyond-basics/blob/master/emr-wait-for-completion.sh

Here is how the script is launched

sh ./run-emr-testMR.sh   [input arguments]

or to run in background

nohup sh ./run-emr-testMR.sh  > emr.out 2>&1 &

The output will look like this:

20110129.090002  > run-emr-testMR.sh : starting....
=== j-J6USL8HDRX93 launched....
=== Job started RUNNING in  302  seconds.  status : RUNNING
j-J6USL8HDRX93      RUNNING        ec2-50-16-29-197.compute-1.amazonaws.com          TestMR__20110129-090002
Task tracker interface : http://ec2-50-16-29-197.compute-1.amazonaws.com:9100
Namenode interface : http://ec2-50-16-29-197.compute-1.amazonaws.com:9101
20110129.104454  > ./emr-wait-for-completion.sh : finished in 1-hours-27-mins.  status: SHUTTING_DOWN

Here is what it does:

  • when job starts prints out Namenode status url and TaskTracker status url

 

  • monitors the job progress every minute

 

 

  • it copies the logs created by emr into a directory in ‘/var/logs/hadoop-logs’.   We do this, so we can track the progress by from our machine.  This directory can be made accessible via a webserver

 

 

  • we use S3CMD to transfer files

 

 

  • script terminates when our EMR job is completed (success or fail)

 

Configuring Hadoop Cluster

We had launched a EMR job with default configuration, how about if we want to tweak hadoop settings?  The following script shows how to do that.

https://github.com/sujee/amazon-emr-beyond-basics/blob/master/run-emr-testMR2.sh

We use ‘boostrap action’ and supply config-core-site.xml and config-mapred-site.xml
(these are just examples… not recommended settings)

Logging and Debugging

We can track our job progress at task tracker interface (on port 9100
of master node).  e.g:  http://ec2-50-16-29-197.compute-1.amazonaws.com:9100

In order to view the name node and task tracker web pages, you need to
have access to machines launched with ‘ElasticMapReduce-master
security group.  I usually give my IP address access to ports
1-65535.

Note how ever, accessing output from individual mappers, require an SSH
tunnel setup.  More on this later

Also, all these logs will go away after our cluster terminates.
This is why we are copying the logs to our machine using s3cmd –sync
command.  This way we can go back and check our logs for debugging
purposes.

More:

So far we have looked at some handy scripts and tips to work with
Amazon EMR.  If you have any feedback, please leave a comment below

Also MRJob framework, developed by Yelp engineers and open sourced is also worth a look.  It is a Python based framework.

Sujee Maniyam
Sujee is a founder, principal at Elephant Scale where he provides consulting and training on Big Data technologies

1 Comment:


  • By James Carter 16 Jul 2011

    This article is a great one. It describes all clearly. I really liked it. It’s very helpful. I was looking for such articles. I have read a article here “http://www.techyv.com/article/amazon-elastic-mapreduce-cloud-computing” which is almost like this.

Leave a Reply



Copyright 2015 Sujee Maniyam (