10 Apr / 2011

Hbase Map Reduce Example

This is a tutorial on how to run a map reduce job on Hbase.  This
covers version 0.20 and later.

Recommended Readings:

Hbase home,

Hbase mapreduce Wiki

HbaseMap Reduce Package

– Great intro to Hbase map reduce by George Lars

 

Version Difference

Hadoop map reduce API changed around v0.20.  So did Hbase map
reduce package.

org.apache.hadoop.hbase.mapred
: older API, pre v0.20

org.apache.hadoop.hbase.mapreduce : newer API,  post v0.20

We will be using the newer API.

Frequency Counter

For this tutorial lets say our Hbase has records of
web_access_logs.  We record each web page access by a user.
To keep things simple, we are only logging the user_id and the page
they visit.  You can imagine all sorts of stats can be gathered,
such as ip_address, referer_paget ..etc

The schema looks like this:

userID_timestamp  =>
{

details => {

page:

}

}

To make row-key unique, we have in a timestamp at the end making up a
composite key.

So a sample setup data might looke like this:

row details:page
user1_t1 a.html
user2_t2 b.html
user3_t4 a.html
user1_t5 c.html
user1_t6 b.html
user2_t7 c.html
user4_t8 a.html

we want to count how many times we have seen each user.  The
result we want is:

user count (frequency)
user1 3
user2 2
user3 1
user4 1

So we will write a map reduce program.  Similar to the popular
example word-count
– couple of differences.  Our Input-Source is a Hbase
table.  Also output is sent to an Hbase table.

First, code access & Hbase setup

eclipse


The code is in GIT repository at GitHub : http://github.com/sujee/hbase-mapreduce

You can get it by

git clone git://github.com/sujee/hbase-mapreduce.git

This is an Eclipse project. To compile it, define HBASE_HOME to point Hbase install directory.

Lets also setup our Hbase tables:

0) For map reduce to run Hadoop needs to know about Hbase classes.
edit ‘hadoop/conf/hadoop-env.sh’:

# Extra Java CLASSPATH elements.  add hbae jars
export HADOOP_CLASSPATH=/hadoop/hbase/hbase-0.20.3.jar:/hadoop/hbase/hbase-0.20.3-test.jar:/hadoop/hbase/conf:/hadoop/hbase/lib/zookeeper-3.2.2.jar

Change this to reflect your Hbase installation.

instructions are here : (http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/mapreduce/package-summary.html ) to modify Hbase configuration

1) restart Hadoop in pseodo-distributed (single server) mode

2) restart Hbase in psuedo-distributed (single server) mode.

3)

hbase shell
    create 'access_logs', 'details'
    create 'summary_user', {NAME=>'details', VERSIONS=>1}

‘access_logs’ is the table that has ‘raw’ logs and will serve as our
Input Source for mapreduce.  ‘summary_user’ table is where we will
write out the final results.

Some Test Data …

So lets get some sample data into our tables.  The ‘Importer1’
class will fill ‘access_logs’ with some sample data.

package hbase_mapred1;

import java.util.Random;

import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.util.Bytes;

/**
 * writes random access logs into hbase table
 * 
 *   userID_count => {
 *      details => {
 *          page
 *      }
 *   }
 * 
 * @author sujee ==at== sujee.net
 *
 */
public class Importer1 {

    public static void main(String[] args) throws Exception {
        
        String [] pages = {"/", "/a.html", "/b.html", "/c.html"};
        
        HBaseConfiguration hbaseConfig = new HBaseConfiguration();
        HTable htable = new HTable(hbaseConfig, "access_logs");
        htable.setAutoFlush(false);
        htable.setWriteBufferSize(1024 * 1024 * 12);
        
        int totalRecords = 100000;
        int maxID = totalRecords / 1000;
        Random rand = new Random();
        System.out.println("importing " + totalRecords + " records ....");
        for (int i=0; i < totalRecords; i++)
        {
            int userID = rand.nextInt(maxID) + 1;
            byte [] rowkey = Bytes.add(Bytes.toBytes(userID), Bytes.toBytes(i));
            String randomPage = pages[rand.nextInt(pages.length)];
            Put put = new Put(rowkey);
            put.add(Bytes.toBytes("details"), Bytes.toBytes("page"), Bytes.toBytes(randomPage));
            htable.put(put);
        }
        htable.flushCommits();
        htable.close();
        System.out.println("done");
    }
}

Go ahead and
run ‘Importer1’ in Eclipse.

In hbase shell lets see how our data looks:

hbase(main):004:0> scan ‘access_logs’, {LIMIT => 5}

ROW
COLUMN+CELL

\x00\x00\x00\x01\x00\x00\x00r column=details:page,
timestamp=1269330405067,
value=/

\x00\x00\x00\x01\x00\x00\x00\xE7 column=details:page,
timestamp=1269330405068,
value=/a.html

\x00\x00\x00\x01\x00\x00\x00\xFC column=details:page,
timestamp=1269330405068,
value=/a.html

\x00\x00\x00\x01\x00\x00\x01a column=details:page,
timestamp=1269330405068,
value=/b.html

\x00\x00\x00\x01\x00\x00\x02\xC6 column=details:page,
timestamp=1269330405068,
value=/a.html

5 row(s) in 0.0470 seconds

 

About Hbase Mapreduce

Lets take a minute and examine the Hbase map reduce classes.

Hadoop mapper  can take in ( KEY1, VALUE1)  and output
(KEY2,  VALUE2).  The Reducer can take (KEY2, VALUE2) and
output (KEY3, VALUE3).

mapreduce

(image credit : http://www.larsgeorge.com/2009/05/hbase-mapreduce-101-part-i.html)

Hbase provides convenient Mapper & Reduce classes –
org.apache.hadoop.hbase.mapreduce.TableMapper
and org.apache.hadoop.hbase.mapreduce.TableReduce. These classes extend Mapper and Reducer interfaces. They make it easier to read & write from/to Hbase tables

tablemapper

tablereduce

TableMapper:

Hbase TableMapper is an abstract class extending Hadoop Mapper.

The source can be found at :
HBASE_HOME/src/java/org/apache/hadoop/hbase/mapreduce/TableMapper.java

package org.apache.hadoop.hbase.mapreduce;

import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.mapreduce.Mapper;

public abstract class TableMapper<KEYOUT, VALUEOUT>
extends Mapper<ImmutableBytesWritable, Result, KEYOUT, VALUEOUT> {

}

Notice how TableMapper parameterizes Mapper class.

Param class comment
KEYIN (k1) ImmutableBytesWritable fixed.
This is the row_key of the current row being processed
VALUEIN (v1) Result fixed.  This is the value
(result) of the row
KEYOUT (k2) user specified customizable
VALUEOUT (v2) user specified customizable

 

The input key/value for TableMapper is fixed.  We are free to
customize output key/value classes.  This is a noticeable
difference compared to writing a straight hadoop mapper.

TableReducer

src  :
HBASE_HOME/src/java/org/apache/hadoop/hbase/mapreduce/TableReducer.java

package org.apache.hadoop.hbase.mapreduce;

import org.apache.hadoop.io.Writable;
import org.apache.hadoop.mapreduce.Reducer;

public abstract class TableReducer<KEYIN, VALUEIN, KEYOUT>
extends Reducer<KEYIN, VALUEIN, KEYOUT, Writable> {
}

 

Lets look at the parameters:

Param Class Comment
KEYIN (k2 – same as mapper keyout) user-specified (same class as K2 ouput from mapper)
VALUEIN(v2 – same as mapper valueout) user-specified (same class as V2 ouput from mapper)
KEYIN (k3) user-specified
VALUEOUT (k4) must be Writable

TableReducer can take any KEY2 / VALUE2 class and emit any KEY3 class, and a Writable VALUE4 class.

 

Back to Frequency Counting

We will extend TableMapper and TableReducer with our custom classes.

Mapper

Input Output
ImmutableBytesWritable(RowKey = userID + timestamp) ImmutableBytesWritable(userID)
Result(Row Result) IntWritable(always ONE)

Reducer

Input Output
ImmutableBytesWritable(uesrID)(from output K2 from mapper) ImmutableBytesWritable(userID : same as input)(this will be the KEYOUT k3. And it will serve as the ‘rowkey’ for output Hbase table)
Iterable<IntWriable>(all ONEs combined for this key)(from output V2 from mapper, all combined into a ‘list’ for this key) IntWritable(total of all ONEs for this key)(this will be the VALUEOUT v3. And it will be PUT value for Hbase table)

In mapper we extract the USERID from the composite rowkey (userID +
timestamp).  Then we just emit the userID and ONE – as in number
ONE.

Visualizing Mapper output

   (user1, 1)
   (user2, 1)
   (user1, 1)
   (user3, 1)

The map-reduce framework, collects similar output keys together and
send them to reducer.  This is why we see a ‘list’ or ‘iterable’
for each userID key at reducer.   In Reducer, we simply add
all the values and emit   <UserID , total Count>.
Visualizing Input to Reducer:

   (user1, [1, 1])
   (user2, [1])
   (user3, [1])

And the output of reducer:

   (user1, 2)
   (user2, 1)
   (user3, 1)

Ok, now onto the code.

Frequency Counter Map Reduce Code

package hbase_mapred1;

import java.io.IOException;

import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.filter.FirstKeyOnlyFilter;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;
import org.apache.hadoop.hbase.mapreduce.TableMapper;
import org.apache.hadoop.hbase.mapreduce.TableReducer;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Job;

/**
 * counts the number of userIDs
 * 
 * @author sujee ==at== sujee.net
 * 
 */
public class FreqCounter1 {

    static class Mapper1 extends TableMapper<ImmutableBytesWritable, IntWritable> {

        private int numRecords = 0;
        private static final IntWritable one = new IntWritable(1);

        @Override
        public void map(ImmutableBytesWritable row, Result values, Context context) throws IOException {
            // extract userKey from the compositeKey (userId + counter)
            ImmutableBytesWritable userKey = new ImmutableBytesWritable(row.get(), 0, Bytes.SIZEOF_INT);
            try {
                context.write(userKey, one);
            } catch (InterruptedException e) {
                throw new IOException(e);
            }
            numRecords++;
            if ((numRecords % 10000) == 0) {
                context.setStatus("mapper processed " + numRecords + " records so far");
            }
        }
    }

    public static class Reducer1 extends TableReducer<ImmutableBytesWritable, IntWritable, ImmutableBytesWritable> {

        public void reduce(ImmutableBytesWritable key, Iterable<IntWritable> values, Context context)
                throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }

            Put put = new Put(key.get());
            put.add(Bytes.toBytes("details"), Bytes.toBytes("total"), Bytes.toBytes(sum));
            System.out.println(String.format("stats :   key : %d,  count : %d", Bytes.toInt(key.get()), sum));
            context.write(key, put);
        }
    }
    
    public static void main(String[] args) throws Exception {
        HBaseConfiguration conf = new HBaseConfiguration();
        Job job = new Job(conf, "Hbase_FreqCounter1");
        job.setJarByClass(FreqCounter1.class);
        Scan scan = new Scan();
        String columns = "details"; // comma seperated
        scan.addColumns(columns);
        scan.setFilter(new FirstKeyOnlyFilter());
        TableMapReduceUtil.initTableMapperJob("access_logs", scan, Mapper1.class, ImmutableBytesWritable.class,
                IntWritable.class, job);
        TableMapReduceUtil.initTableReducerJob("summary_user", Reducer1.class, job);
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }

}

Code Walk-through

    • Since our mapper/reducer code is pretty compact, we have it all in one file

 

  • At line 26 :
    static class Mapper1 extends TableMapper<ImmutableBytesWritable, IntWritable> {

we configure class type Emitted from mapper. Remember, map inputs are already defined for us by TableMapper (as ImmutableBytesWritable and Result)

 

 

  • At line 34:

 

ImmutableBytesWritable userKey = new ImmutableBytesWritable(row.get(), 0, Bytes.SIZEOF_INT);

we are extracting userID from the composite key (userID + timestamp = INT + INT). This will be the key that we will emit.

 

 

  • at line 36:
            context.write(userKey, one);
    

    Here is where we EMIT our output. Notice we always output ONE (which is IntWritable(1)).

 

 

 

  • At line 46, we configure our reducer to accept the values emitted from the mapper (ImmutableBytessWriteable, IntWritable)

 

 

 

  • line 52:

 

            for (IntWritable val : values) {
                sum += val.get();

we simply aggregate the count. Since each count is ONE, the sum is total is number values.

 

 

  • At line 56:

 

            Put put = new Put(key.get());
            put.add(Bytes.toBytes("details"), Bytes.toBytes("total"), Bytes.toBytes(sum));
            context.write(key, put);

Here we see the familiar Hbase PUT being created. The key being used is USERID (passed on from mapper, and used unmodified here). The value is SUM. This PUT will be saved into our target Hbase Table (‘summary_user’).

Notice how ever, we don’t write directly to output table. This is done by super class ‘TableReducer’.

 

 

  • Finally, lets look at the job setup.

 

        HBaseConfiguration conf = new HBaseConfiguration();
        Job job = new Job(conf, "Hbase_FreqCounter1");
        job.setJarByClass(FreqCounter1.class);
        Scan scan = new Scan();
        String columns = "details"; // comma seperated
        scan.addColumns(columns);
        scan.setFilter(new FirstKeyOnlyFilter());
        TableMapReduceUtil.initTableMapperJob("access_logs", scan, Mapper1.class, ImmutableBytesWritable.class,
                IntWritable.class, job);
        TableMapReduceUtil.initTableReducerJob("summary_user", Reducer1.class, job);
        System.exit(job.waitForCompletion(true) ? 0 : 1);

We setup Hbase configuration, Job and Scanner. Optionally, we are also configuring the scanner on which columns to read. And using the ‘TableMapReduceUtil’ to setup mapper class.

         TableMapReduceUtil.initTableMapperJob(
                "access_logs",  // table to read data from
                scan,  // scanner
                Mapper1.class,   // map class
                ImmutableBytesWritable.class,  // mapper output KEY class 
                IntWritable.class,   // mapper output VALUE class
                job  // job
                );

Similarly we setup Reducer

      TableMapReduceUtil.initTableReducerJob(
                        "summary_user", // table to write to
                        Reducer1.class, // reducer class 
                        job);           // job

 

Running the Job

Single Server mode

We can just run the code from Eclipse. Run ‘FreqCounter1’ from Eclipse. (You may need to up the memory for JVM using -Xmx300m in launch configurations).

Output looks like this:

...
10/04/09 15:08:32 INFO mapred.JobClient:  map 0% reduce 0%
10/04/09 15:08:37 INFO mapred.LocalJobRunner: mapper processed 10000 records so far
10/04/09 15:08:40 INFO mapred.LocalJobRunner: mapper processed 30000 records so far
...
10/04/09 15:08:55 INFO mapred.JobClient:  map 100% reduce 0%
...
stats :   key : 1,  count : 999
stats :   key : 2,  count : 1040
stats :   key : 3,  count : 986
stats :   key : 4,  count : 983
stats :   key : 5,  count : 967
...
10/04/09 15:08:56 INFO mapred.JobClient:  map 100% reduce 100%

Alright… we see mapper progressing and then we see ‘frequency output’ from our Reducer! Neat !!

 

Running this on a Hbase cluster (multi machines)

For this we need to make a JAR file of our classes.

Open a terminal and navigate to the directory of the project.

jar cf freqCounter.jar -C classes .

This will create a jar file ‘freqCounter.jar’. Use this jar file with ‘hadoop jar’ command to launch the MR job

hadoop jar freqCounter.jar hbase_mapred1.FreqCounter1

You can track the progress of the job at task tracker : http://localhost:50030

Plus you can monitor the program output on the task-tracker website as well.

Checking The Result

Lets do a scan of results table

hbase(main):002:0> scan ‘summary_user’, {LIMIT => 5}

ROW
COLUMN+CELL                       nbsp;

\x00\x00\x00\x00
column=details:total, timestamp=1269330349590,
value=\x00\x00\x04\x0A

\x00\x00\x00\x01
column=details:total, timestamp=1270856929004,
value=\x00\x00\x03\xE7

\x00\x00\x00\x02
column=details:total, timestamp=1270856929004,
value=\x00\x00\x04\x10

\x00\x00\x00\x03
column=details:total, timestamp=1270856929004,
value=\x00\x00\x03\xDA

\x00\x00\x00\x04
column=details:total, timestamp=1270856929005,
value=\x00\x00\x03\xD7

5 row(s) in 0.0750 seconds

ok, looks like we have our frequency count.  But they are in all
byte-display.  Lets write a quick scanner to print out a more user
friendly display

package hbase_mapred1;

import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.ResultScanner;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.util.Bytes;

public class PrintUserCount {

    public static void main(String[] args) throws Exception {

        HBaseConfiguration conf = new HBaseConfiguration();
        HTable htable = new HTable(conf, "summary_user");

        Scan scan = new Scan();
        ResultScanner scanner = htable.getScanner(scan);
        Result r;
        while (((r = scanner.next()) != null)) {
            ImmutableBytesWritable b = r.getBytes();
            byte[] key = r.getRow();
            int userId = Bytes.toInt(key);
            byte[] totalValue = r.getValue(Bytes.toBytes("details"), Bytes.toBytes("total"));
            int count = Bytes.toInt(totalValue);

            System.out.println("key: " + userId+ ",  count: " + count);
        }
        scanner.close();
        htable.close();
    }
}

 

Running this will print out output like …

key: 0,  count: 1034
key: 1,  count: 999
key: 2,  count: 1040
key: 3,  count: 986
key: 4,  count: 983
key: 5,  count: 967
key: 6,  count: 987
...
...

That’s it

thanks!

Sujee Maniyam
Sujee is a founder, principal at Elephant Scale where he provides consulting and training on Big Data technologies

52 Comments:


  • By Jean-Daniel Cryans 11 Apr 2010

    Excellent tutorial Sujee!

    With the filter, someone should also use scan.setCaching so speed up the job.

    J-D

  • By Lars George 13 Apr 2010

    Hi Sujee!

    Excellent post indeed, thank you for taking the time! Also please consider linking it into the HBase Wiki, so that other can find it easily (if you have not done so already) – simply register and edit the appropriate page. Much appreciated!

    Lars

  • By Renato Marroquin 14 Apr 2010

    Hey Sujee!! Really nice and helpful post, thanks a lot!!! Though I have a little doubt, I am still learning Hadoop MapReduce and I was wondering why I can’t see the jobs I am executing on my localhost:50030?? I tried to executed from eclipse using the cluster I set up, but I got an error that said

    Exception in thread “main” java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration

    any idea??? Thanks a lot in advance (=

    Renato

  • By admin 14 Apr 2010

    @Renato
    Your hadoop installation doesn’t know about Hbase classes.

    Check ‘step 0’ under ‘Hbase setup’ section. Or you can follow this:
    http://hadoop.apache.org/hbase/docs/r0.20.3/api/org/apache/hadoop/hbase/mapreduce/package-summary.html
    Once you modify your file, make sure to DISTRIBUTE the hadoop-env.sh across the cluster.

  • By Renato Marroquin 14 Apr 2010

    Hi I tried this:
    http://wiki.apache.org/hadoop/Hbase/MapReduce
    with no success. Though I am able to execute my job regularly from the command line, and as a java program, any ideas or suggestions are highly appreciated. Thanks in advanced.

    Renato M.

  • By admin 14 Apr 2010

    @Renato,
    the end goal is to be able to run in command line using
    hadoop jar your.jar your.mr.class.name
    so it will run on all cluster machines.
    So if you can do this, that is pretty good 🙂

    Not sure why you aren’t able to run it within Eclipse…

  • By Amol 16 Apr 2010

    Hi,
    I really like this post and has helped me in my project..thanks for the same..
    I am now motivated to make a HBase admin UI interface…
    do ping me if u want to help/get involved..
    thanks once again

  • By hadoop_learner 02 Sep 2010

    Worked perfectly. This tutorial was so helpful. Thank you so much.

    Even the official Apache tutorial in the Hadoop docs use the deprecated APIs, it’s nice to see an up to date tutorial like this one.

  • By Mark Kerzner 04 Feb 2011

    Hi, Sujee,

    thank for your tutorial. That is the only one that I found that does what I need: read from the HBase in MR and write back in MR! It works and uses the latest API. What more can a person want?

    Cheers,
    Mark

  • By Mark Kerzner 04 Feb 2011

    By the way, your timestamp is not unique, because a few records are created for the same time stamp – or my computer is too fast 🙂 , but you get this:

    x00x00x00x02x00 column=details:page, timestamp=1296845269865, value=/a.htm
    x00x07i l
    x00x00x00x02x00 column=details:page, timestamp=1296845269865, value=/a.htm
    x00x07j l

  • By admin 04 Feb 2011

    @Mark,
    the row key = random_user_id + incrementing_counter
    so it will be unique.

    timestamps are automatically created by Hbase for versioning cell data. I think it is milli-second accuracy for now. And it is possible you get a few entries with the same TS. But it is okay in most cases (in this case too)

    glad you found the article useful

    • By najeeb 18 May 2012

      “To make row-key unique, we have in a timestamp at the end making up a composite key.”
      But i didnt c tis in the code. u r nly appending the increment count right?/
      Hw can i append TS?

  • By Shrikant Bang 13 Feb 2011

    Thanks for great tutorials !

  • By Phuong Nguyen 24 Feb 2011

    Dear Sujee,

    Thanks for your great tutorial. One may need to download google-collect-1.0-rc1.jar and jsr305.jar
    I am glad that I found your tutorial too.

  • By Swagat 06 Mar 2011

    Hi Sujee,

    Really nice post, thanks a lot !!! Its really helpful to understand job creation and configuration.But i m getting some exception while executing through eclipse.I m using HBase 0.20.6 APIs. HDFS is in a cluster(multi machines)

    Exception in thread “main” java.lang.NoClassDefFoundError: org/codehaus/jackson/map/JsonMappingException
    at org.apache.hadoop.mapreduce.Job$1.run(Job.java:478)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1063)
    at org.apache.hadoop.mapreduce.Job.connect(Job.java:476)
    at org.apache.hadoop.mapreduce.Job.submit(Job.java:464)
    at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:495)
    at com.exilant.hbase.FreqCounter1.main(FreqCounter1.java:81)
    Caused by: java.lang.ClassNotFoundException: org.codehaus.jackson.map.JsonMappingException
    at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:319)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:330)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:254)
    at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:399)
    … 8 more

    any suggestion will be a great help for me.

    Thanks
    Swagat

    • By admin 08 Mar 2011

      @swagat,
      I am guessing Eclipse can not find some Hadoop / Hbase jars. Try running the program as a Map-Reduce job (use hadoop command to submit) from command line. That might fix the issue

  • By Misty 04 May 2011

    Can you please guide me how to use HBase in Java programs, I tried running your example but it gives this error

    11/05/02 17:44:40 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:60000. Already tried 0 time(s).
    11/05/02 17:44:42 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:60000. Already tried 1 time(s).
    11/05/02 17:44:44 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:60000. Already tried 2 time(s).
    11/05/02 17:44:46 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:60000. Already tried 3 time(s).
    11/05/02 17:44:48 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:60000. Already tried 4 time(s).
    11/05/02 17:44:50 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:60000. Already tried 5 time(s).
    11/05/02 17:44:52 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:60000. Already tried 6 time(s).
    11/05/02 17:44:54 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:60000. Already tried 7 time(s).
    11/05/02 17:44:56 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:60000. Already tried 8 time(s).
    11/05/02 17:44:58 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:60000. Already tried 9 time(s).
    11/05/02 17:44:59 INFO client.HConnectionManager$TableServers: Attempt 0 of 10 failed with . Retrying after sleep of 2000

    please let me know where am i going wrong. please suggest a simple way to use HBase. I just want to use HBase for simple java programming

  • By Misty 04 May 2011

    I tried your example, and these are the errors I am getting

    The project cannot be built until build path errors are resolved hbase_mapreduce Unknown Java Problem
    Unbound classpath variable: ‘C:cygwin/usr/local/hbase-0.90.2/conf’ in project ‘hbase_mapreduce’ hbase_mapreduce Build path Build Path Problem
    Unbound classpath variable: ‘C:cygwin/usr/local/hbase-0.90.2/hbase-0.20.3.jar’ in project ‘hbase_mapreduce’ hbase_mapreduce Build path Build Path Problem
    Unbound classpath variable: ‘C:cygwin/usr/local/hbase-0.90.2/lib/commons-logging-1.0.4.jar’ in project ‘hbase_mapreduce’ hbase_mapreduce Build path Build Path Problem
    Unbound classpath variable: ‘C:cygwin/usr/local/hbase-0.90.2/lib/hadoop-0.20.1-hdfs127-core.jar’ in project ‘hbase_mapreduce’ hbase_mapreduce Build path Build Path Problem
    Unbound classpath variable: ‘C:cygwin/usr/local/hbase-0.90.2/lib/log4j-1.2.15.jar’ in project ‘hbase_mapreduce’ hbase_mapreduce Build path Build Path Problem
    Unbound classpath variable: ‘C:cygwin/usr/local/hbase-0.90.2/lib/zookeeper-3.2.2.jar’ in project ‘hbase_mapreduce’ hbase_mapreduce Build path Build Path Problem

    Please let me know where I am going wrong

  • By Big Lep 10 Jun 2011

    This is an awesome tutorial. Gave me a good feel for what HBase brings to Hadoop. Good stuff!

  • By B Anil Kumar 15 Jun 2011

    Thank you Sujee, Really useful.

    It’s working fine. But If I check the table “access_logs”, It is present only in one of the machine(cluster of 3).

    How can I make it as distributed?

  • By B Anil Kumar 15 Jun 2011

    Thank you Sujee, Really useful.

    I did this, in a cluster which contains 3 Region Servers

    It’s working fine. But If I check the table “access_logs”, It is present only in one of the Region Server.

    How can I make it as distributed?

  • By admin 15 Jun 2011

    @Anil
    the table will split as it exceeds the ‘region max size’ (default 256M, I think).

  • By Amarjeet 11 Jul 2011

    Hi Sujee,

    Thanks for great tutorial!!

    I am able to run the mapreduce application using HDFS and HBase both.

    But now I want to use in memory data to be used as input to MapReduce application.

    e.g. – I have few data maps in my memory and want to use those in spite of Hbase table as input.

    Could you please guide me for this??

    Timely reply will be appreciated very much. 🙂

    Thanks in advance!!

  • By admin 11 Jul 2011

    @Amarjeet
    not sure what do you mean by using-memory data as Mapreduce input. are you talking about doing ‘data joins’ ?
    If that is the case, then you might try this:
    – load your data xref data into a memory cache like memcached (this can be done in the driver code of mapreduce)
    – in your mapper, access memcached to xref

  • By Amarjeet 12 Jul 2011

    Hi Sujee,

    by memory data I dont mean ‘DATA JOINS’.

    I Mean:
    lets take this example in your blog –
    you are reading input from Hbase table – ‘access_logs’ and then process it and put your output in ‘summary’ table.

    What I want is:
    I wish I could take input from some Map or list (present in-memory only or create myself in code).

    I hope you understand now.

  • By giri 28 Jul 2011

    Hi Sujee ,

    really very good doc …..

    thanks

  • By dus 24 Jan 2012

    Thanks, very good tut!

  • By Manu 20 Feb 2012

    I am getting below error. I saw in most of the places in blogs that it is resolved. But no clear details on how it got solved. Any help on this is really appreciated.
    Below is the error on running hbase -mapreduce program
    12/02/20 02:19:22 INFO mapred.JobClient: Task Id : attempt_201202182211_0032_m_000000_2, Status : FAILED
    Error: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.HBaseConfiguration
    at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:247)
    at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:819)
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:864)
    at org.apache.hadoop.mapreduce.JobContext.getCombinerClass(JobContext.java:207)
    at org.apache.hadoop.mapred.Task$CombinerRunner.create(Task.java:1307)
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.(MapTask.java:980)
    at org.apache.hadoop.mapred.MapTask$NewOutputCollector.(MapTask.java:673)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:755)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
    at org.apache.hadoop.mapred.Child.main(Child.java:253)

    • By Sujee Maniyam 20 Feb 2012

      @Manu
      you need to Hbase jar files to your HADOOP_CLASSPATH.

      edit $HADOOP_HOME/conf/hadoop-env.sh and edit the line

      # Extra Java CLASSPATH elements. add hbae jars
      export HADOOP_CLASSPATH=/hadoop/hbase/hbase-0.20.3.jar:/hadoop/hbase/hbase-0.20.3-test.jar:/hadoop/hbase/conf:/hadoop/hbase/lib/zookeeper-3.2.2.j

      • By amsal 24 Feb 2012

        i m also having almost same error
        java.lang.runtimeexception : java.lang.classnotfoundexception :hbase_mapred1.FreqCounter$mapper1
        at org.apache.hadoop.conf.configuration.getclass……….

        i have also checked my jar file using
        jar tf freqcounter.jar
        it contains hbase_mapred1/FreqCounter$mapper1.class
        plz help me….
        i have also set my hadoop-env.sh in hadoop/conf

        plz help…….

        n one more thing
        when i see hbase at http://localhost:60010/master.jsp i only see one regionserver i.e master
        when i start hbase without stopping it, the regionservers on slaves again start whereas om master it says regionserver is running a process XXXX stop it first….
        do you have any idea why it is so??
        plz help….!!!
        thanx in advance

  • By amsal 08 Mar 2012

    hi
    i have created a table in hbase with 12 columns in each row and each column has 8 qualifiers.when i try to read complete row it returns correct value for 1:1 in row 1 but returns null for 1:2
    it reads all the columns correctly from 2 to 10….
    plz help how to solve this problem
    i m using this code for reading….it is inside for loop thar runs from 1 to 10..

    train[0][i] = Double.parseDouble(Bytes.toString (r.getValue (Bytes.toBytes(Integer.toString(i)),Bytes.toBytes(“1″))));

    train[1][i] = Double.parseDouble (Bytes.toString (r.getValue (Bytes.toBytes(Integer.toString(i)), Bytes.toBytes(“2″))));

    train[2][i] = Double.parseDouble (Bytes.toString (r.getValue (Bytes.toBytes(Integer.toString(i)), Bytes.toBytes(“3″))));

    train[3][i] = Double.parseDouble (Bytes.toString (r.getValue (Bytes.toBytes(Integer.toString(i)), Bytes.toBytes(“4″))));

    train[4][i] = Double.parseDouble (Bytes.toString (r.getValue (Bytes.toBytes(Integer.toString(i)), Bytes.toBytes(“5″))));

    train[5][i] = Double.parseDouble (Bytes.toString (r.getValue (Bytes.toBytes(Integer.toString(i)), Bytes.toBytes(“6″))));

    train[6][i] = Double.parseDouble (Bytes.toString (r.getValue (Bytes.toBytes(Integer.toString(i)), Bytes.toBytes(“7″))));

    train[7][i] = Double.parseDouble (Bytes.toString (r.getValue (Bytes.toBytes(Integer.toString(i)), Bytes.toBytes(“8″))));

  • By rum 18 Mar 2012

    Awesome tutorial!! Works like a charm! Loved the step by step approach…Great one for beginners.. 🙂

  • By Arati Patro 12 Apr 2012

    Hi,

    Can you please how explain you created a composite row key.
    I have been trying to create an HBase Table with a composite row key that has a number and a timestamp. The Put Constructor you have used takes in two byte values thereby converting the timestamp to a byte value as well. While I am able to do this, I’m unable to perform any operation on the timestamp or the number individually as the entire row key is inserted as one entity, Could you please guide further on how a partial key scan may be performed.
    Thanks in advance.

    • By Sujee Maniyam 26 May 2012

      Bytes utility class (part of hbase) is your friend

      int userId = 123;
      long timestamp = System.currentMillis();
      byte [] compositeKey = Bytes.add (Bytes.tobytes(userID), Bytes.toBytes(timestamp));

  • By ramya 17 Apr 2012

    Hey
    Can you help me out a lil more on the map function. I understood the basics of mapreduce but can’t get how to utilize it when you are doing more complex processing.
    How do we actually access the individual columns in a map function? I know that a map returns a row. how do we break it down so that we can compare values of two or more columns which maybe of different column families?

    For example, a User has a visitor and I want to know if that visitor is one of his friends?

    Please help.Just started learning hbase.Having a problem is visualizing how the data appears in a map reduce.

  • By Ramanjaneylu 29 Apr 2012

    Hi sujee,
    Wonderful doc which helped me a lot.

    I want some suggestions like,
    1) how to display output of reduce functions into JTable.
    2) I have a jar file, after execute that jar file a text box will display. how the data which we entered in text box will get output for that data.

    Please guide me, Its urgent

    Thanks & Regards,
    V.Ramanjaneylu.

  • By k A r T h I k 18 May 2012

    Perfect!
    It works. I’ve used your program in the following enviroment:
    Hadoop 1.0.2
    HBase 0.92.1

    I observed a few changes:
    –HBaseConfiguration hbaseConfig = new HBaseConfiguration();
    ++Configuration conf = HBaseConfiguration.create();

    –String columns = “details”; // comma seperated
    –scan.addColumns(columns);
    ++scan.addFamily(Bytes.toBytes(“details”));

    Need to replace — with ++ statements in your program.
    Thanks a lot.

    • By Sujee Maniyam 18 May 2012

      Hbase APi changed a bit recently. Thanks for pointing this out. I will update the code.
      regards
      Sujee Maniyam (admin)

  • By Sheriffo Ceesay 26 May 2012

    Great tutorial, do you have any post or any recommended post using joins in a similar way.

  • By kamalnath 06 Jun 2012

    Thank you… this is the best article i have read on hbase till now …thanks a ton for sharing

  • By Balakrishnan Chandrasekaran 27 Aug 2012

    Very good Tutorial. Thank you very much Sujee.
    Exactly what a new bie is looking for.

  • By Taymoor 30 Aug 2012

    Hi Sujee,
    I have installed cdh4.0.1 on my ubuntu 12.04 LTS, having hadoop version 2.0.0-cdh4.0.1 and Hbase HBase 0.92.1-cdh4.0.1.

    I am running your example but found import errors i fixed these but now on running Importer1.java is giving too many exceptions i get rid many of these but stuck here
    “Exception in thread “main” java.lang.NoClassDefFoundError: com/google/common/collect/Maps
    at org.apache.hadoop.metrics2.lib.MetricsRegistry.(MetricsRegistry.java:42)”

    Please help me the jars required or something that can run this example.

    One more thing i am doing all in pseudo distributed mode and every thing is OK, also created schemas “access_logs” and “summary_user”

    Waiting for your reply
    Thanks

  • By Sujee Maniyam 31 Aug 2012

    I am suspecting the missing class is from Google’s Guava library. Check for it in $HADOOP_HOME/lib/guava-xxx.jar or $HBASE_HOME/lib.
    add it to your classpath and try again

  • By Madhu 28 Oct 2012

    Hi Sujee,
    I have CDH4 2 cluster installation with Hbase. I am trying to run your mapred example. I am getting class not found error, which looks incorrect because I ran with verbose option. I saw this class was loaded.
    Please, help me. Thanks
    —————————————————————————–
    INFO zookeeper.ZooKeeper: Client environment:java.library.path=/usr/java/jdk1.6.0_37/jre/lib/amd64/server:/usr/java/jdk1.6.0_37/jre/lib/amd64:/usr/java/jdk1.6.0_37/jre/../lib/amd64:/usr/lib/hadoop-0.20-mapreduce/lib/native/Linux-amd64-64:/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
    INFO zookeeper.ZooKeeper: Client environment:java.io.tmpdir=/tmp
    INFO zookeeper.ZooKeeper: Client environment:java.compiler=
    INFO zookeeper.ZooKeeper: Client environment:os.name=Linux
    INFO zookeeper.ZooKeeper: Client environment:os.arch=amd64
    INFO zookeeper.ZooKeeper: Client environment:os.version=2.6.18-308.16.1.el5
    INFO zookeeper.ZooKeeper: Client environment:user.name=cdh
    INFO zookeeper.ZooKeeper: Client environment:user.home=/home/cdh
    INFO zookeeper.ZooKeeper: Client environment:user.dir=/home/cdh
    INFO zookeeper.ZooKeeper: Initiating client connection, connectString=cloudera-cdh-1.etouch.net:2181 sessionTimeout=60000 watcher=hconnection
    INFO zookeeper.ClientCnxn: Opening socket connection to server mycompany:2181. Will not attempt to authenticate using SASL (Unable to locate a login configuration)
    INFO zookeeper.ClientCnxn: Socket connection established to mycompany:2181, initiating session
    INFO zookeeper.ClientCnxn: Session establishment complete on server mycompany:2181, sessionid = 0x13a9d1329770059, negotiated timeout = 60000
    INFO zookeeper.RecoverableZooKeeper: The identifier of this process is 9772@cloudera-cdh-1
    INFO mapreduce.TableOutputFormat: Created table instance for summary_user
    WARN conf.Configuration: fs.default.name is deprecated. Instead, use fs.defaultFS
    WARN conf.Configuration: io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
    INFO mapred.JobClient: Running job: job_201210260538_0022
    INFO mapred.JobClient: map 0% reduce 0%
    INFO mapred.JobClient: Task Id : attempt_201210260538_0022_m_000000_0, Status : FAILED
    java.lang.RuntimeException: java.lang.ClassNotFoundException: Class mapreduce.FreqCounter1$Mapper1 not found
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1571)
    at org.apache.hadoop.mapreduce.task.JobContextImpl.getMapperClass(JobContextImpl.java:191)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:605)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)
    at org.apache.hadoop.mapred.Child.main(Child.java:262)
    Caused by: java.lang.ClassNotFoundException: Class mapreduce.FreqCounter1$Mapper1 not found
    at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1477)
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1569)
    … 8 more

  • By johnpaulci 07 Jan 2013

    execellent tutorial for integrating hbase with hadoop ..
    But I encounter the following warning and error when compiling the Importer1.java

    root@Lonetzo:/usr/local/hadoop-1.0.1/hbase-mapreduce/src/hbase_mapred1# javac -cp /usr/local/hadoop-1.0.1/hbase/hbase-0.94.1.jar:/usr/local/hadoop-1.0.1/hbase/hbase-0.94.1-tests.jar:/usr/local/hadoop-1.0.1/hbase/lib/zookeeper-3.4.3.jar:/usr/local/hadoop-1.0.1/hadoop-core-1.0.1.jar Importer1.java -Xlint
    Importer1.java:16: warning: [deprecation] HBaseConfiguration() in HBaseConfiguration has been deprecated
    HBaseConfiguration hbaseConfig = new HBaseConfiguration();
    ^
    1 warning
    root@Lonetzo:/usr/local/hadoop-1.0.1/hbase-mapreduce/src/hbase_mapred1# java Importer1
    Exception in thread “main” java.lang.NoClassDefFoundError: Importer1 (wrong name: hbase_mapred1/Importer1)
    at java.lang.ClassLoader.defineClass1(Native Method)
    at java.lang.ClassLoader.defineClass(ClassLoader.java:787)
    at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
    at java.net.URLClassLoader.defineClass(URLClassLoader.java:447)
    at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:423)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:356)
    at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:476)

    How to resolve this error. ..

  • By Atul Patil 05 Feb 2013

    Nice Tutorial………keep it man……………..

  • By ramesh 02 May 2013

    Hi, i need help regarding the hadoop and hbase database, i have installed hadoop database in ubuntu OS, i have to connect to jsp login page, what is the procedure to connect to the hadoop database,please let me know……..Is it require JDBC connectivity for hadoop database?

  • By Arti 23 May 2015

    Excellent tutorial!!

  • By Deepak Nayak 15 Nov 2015

    Hi,
    My requirement is to read file from aws s3 and write to hbase table. So, I wrote a mapper which reads from s3 and then writes a “Put” to context, and then I have a initTableReduceJob. This worked, but instead of “Put” I want to write “Increment”, but when I tried it says something on the line that it is not 64 byte count type. So, how do I initialize a column family so it would let me write “Increment”. I can do this in shell, I just create a table with column family and I can write increment.

  • By Santhosh 16 Jun 2016

    Hi,
    I could see in the above example, while executing MR job, you have not given the ‘input’ and ‘output’ directories of HDFS. In case of Hbase, what these will be pointing to?

  • By Santhosh 16 Jun 2016

    I got this… my bad !!!!

  • By Annad 14 Jul 2016

    Awesome post Sujee! I was searching every where for a Map Reduce example on hbase table. You gave a very neat example and explained every line of it, great job man! You can add this example in hbase wiki ! great job again!

Leave a Reply



Copyright 2015 Sujee Maniyam (