12 Sep / 2011

Performance testing a HBase cluster

So you have setup a new Hbase cluster, and want to ‘take it for a spin’.  Here is how, without writing a lot of code on your own.

A) hbase PerformanceEvaluation

class : org.apache.hadoop.hbase.PerformanceEvaluation
jar : hbase-*-tests.jar

This is a handy class that comes with the distribution.   It can do read/writes to hbase.   It spawns a map-reduce job to do the reads / writes in parallel.   There is also an option to do the operations in threads instead of map-reduce.

lets find out the usage:

# hbase org.apache.hadoop.hbase.PerformanceEvaluation

Usage: java org.apache.hadoop.hbase.PerformanceEvaluation \
  [–miniCluster] [–nomapred] [–rows=ROWS] <command> <nclients>
….
[snipped]

So lets run a randomWrite test:

# time hbase org.apache.hadoop.hbase.PerformanceEvaluation  randomWrite 5

  • we are running 5 clients.  By default, this would be running in map reduce mode
  • each client is inserting 1 million rows (default), about 1GB size (1000 bytes per row).  So total data size is 5 GB (5 x 1)
  • typically there will be 10 maps per client.  So we will see 50 (5 x 10) map tasks

you can watch the progress on the console and also at task tracker UI (http://task_tracker:50030).

Once this test is complete, it will print out summaries:

… <output clipped>
….
Hbase Performance Evaluation
Row count=5242850
Elapsed Time in millisconds = 1789049
…..

real    3m21.829s
user    0m2.944s
sys     0m0.232s

I actually liked to look at elapsed REAL time (that I measure using unix ‘time’ command).  Then do this calculation:

5 million rows = 5242850
total time = 3m 21 sec = 201secs

write throughput
= 5242850 rows   /  201 seconds  = 26083.8  rows / sec
= 5 GB data / 201 seconds  = 5 * 1000 M bytes /  201 sec = 24.87 MB / sec
insert time = 201 seconds / 5242850 rows = 0.038 ms / row

This should give you a good idea of the cluster throughput.

Now, lets do a READ benchmark

# time hbase org.apache.hadoop.hbase.PerformanceEvaluation  randomRead 5

and you can calculate read throughput

B) YCSB

YCSB is a performance testing tool released by Yahoo.  It has a HBase mode that we will use:

First, read an exellent tutorial by George Lars on using YCSB with Hbase.
And follow his instructions setting up hbase and YCSB. ( I won’t repeat it here)

YCSB ships with a few ‘work loads’.  I am going to run  ‘workloada’  – it is a mix of read and write (50%  / 50%)

step 1)  setting up work load:
java -cp build/ycsb.jar:db/hbase/lib/* com.yahoo.ycsb.Client -load -db com.yahoo.ycsb.db.HBaseClient -P workloads/workloada -p columnfamily=family -p recordcount=10000000  -threads 10 -s > load.dat

  • -load : we are loading the data
  • -P workloads/workloada : we are using workloada
  • -p recordcount=100000000   : 10 million rows
  • -threads 10 : use 10 threads to parallelize inserts
  • -s  : print progress on stederr (console) every 10 secs
  • > load.dat :   save the data into this file

examine the file ‘load.dat’.  Here are the first few lines:

YCSB Client 0.1
Command line: -load -db com.yahoo.ycsb.db.HBaseClient -P workloads/workloada -p columnfamily=family -p recordcount=10000000 -threads 10 -s
[OVERALL], RunTime(ms), 786364.0
[OVERALL], Throughput(ops/sec), 12716.757125199018
[INSERT], Operations, 10000000
[INSERT], AverageLatency(ms), 0.5551727
[INSERT], MinLatency(ms), 0
[INSERT], MaxLatency(ms), 34580
[INSERT], 95thPercentileLatency(ms), 0
[INSERT], 99thPercentileLatency(ms), 1
[INSERT], Return=0, 10000000
[INSERT], 0, 9897989
[INSERT], 1, 99298

I have highlighted the important numbers in bold.  One interesting stat is how many ops were performed each second.  Also you can see the runtime in ms (~786 secs)

Step 2) running the workload
The previous step setup the workload.  Now lets run it.

java -cp build/ycsb.jar:db/hbase/lib/* com.yahoo.ycsb.Client -t -db com.yahoo.ycsb.db.HBaseClient -P workloads/workloada -p columnfamily=cf -p operationcount=1000000 -s -threads 10 > a.dat

Differences are:

  • -t : for transaction mode  (read/write)
  • operationcount : specifies how many ops to try

now lets examine a.dat:

YCSB Client 0.1
Command line: -t -db com.yahoo.ycsb.db.HBaseClient -P workloads/workloada -p columnfamily=family -p operationcount=10000000 -threads 10 -s
[OVERALL], RunTime(ms), 2060800.0
[OVERALL], Throughput(ops/sec), 4852.484472049689
[UPDATE], Operations, 5002015
[UPDATE], AverageLatency(ms), 0.6575520065413638
[UPDATE], MinLatency(ms), 0
[UPDATE], MaxLatency(ms), 28364
[UPDATE], 95thPercentileLatency(ms), 0
[UPDATE], 99thPercentileLatency(ms), 0
[UPDATE], Return=0, 5002015
[UPDATE], 0, 4986514
[UPDATE], 1, 15075
[UPDATE], 2, 0
[UPDATE], 3, 2
….
….[snip]
….
[READ], Operations, 4997985
[READ], AverageLatency(ms), 3.3133978993534394
[READ], MinLatency(ms), 0
[READ], MaxLatency(ms), 2868
[READ], 95thPercentileLatency(ms), 13
[READ], 99thPercentileLatency(ms), 24
[READ], Return=0, 4997985
[READ], 0, 333453
[READ], 1, 1866771
[READ], 2, 1197919

Here is how to read it:

  • overall details are printed on top
  • then UPDATE stats are shown
  • And  lots lines of percentiles for UPDATE follows
  • scroll down more (or search for READ), to find READ stats
    • we can see the avg latency is 3.13 ms
    • The percentiles are interesting too.  We can satisfy 95% requests in 13 ms.   Pretty good.  Almost as fast as a RDBMS

So in this tutorial I have demonstrated some quick ways of running some performance evaluations on an Hbase cluster.

Sujee Maniyam
Sujee is a founder, principal at Elephant Scale where he provides consulting and training on Big Data technologies

6 Comments:


  • By vamshi 18 Sep 2011

    Hi, this post is very good.But i have one basic doubt, all the inserted data by clients go to which place in hbase?? How can we see that inserted data in the hbase? One more thing is , during its operation of random read/random write, the namenode,tasktracker, Jobtracker web UI is not showing anything..But i can see the results and working on the console.
    Can you help me how to check the above mentioned things?
    Thank you

    • By Sujee Maniyam 20 Sep 2011

      1) Using hbase shell you can see the data in a table.
      on terminal, bring up hbase shell
      # hbase shell
      > scan ‘table’
      this will print out all entries in the table.

      If you just want to see a few, then use LIMIT
      > scan ‘table’ , {LIMIT => 10}
      see first 10 rows

      2) To monitor activity, use Hbase UI (http://hmaster:60010)
      you will see requests for each region server

  • By minsbrz 09 May 2012

    Hi.It is very great performance to me.
    Can I see the configuration of hadoop & hbase.
    We have been testing that tools , but the result of testing is not same.
    so we want to see detail option .
    please give me advice. thanks for reading.

    • By minsbrz 14 May 2012

      I have been testing on HBase for performance like you.
      But the result was not same as you , How can i do that .
      My testing result is about 4000 TPS on 4 clusters ,
      memory write only is 10000 TPS,
      could you tell me detail the property of xml and status of writing ,to memory or to Disk(flush,compact act so on..) ?
      Thanks in advance for advice.

  • By bonsonnoise 17 Jul 2015

    Hi,
    what is the configuration of your cluster ? How many servers ? What is the position of the regionserver, the namenode and the jobtracket (on the same server) ?
    I would like to compare my results with yours in the same conditions.
    Thank you.

  • By vmwalla 16 Jun 2016

    Sujee,
    Thanks for sharing. Is it possible to run this test and point it to a specific Name Space? I am on a shared cluster and in HBASE limited to a name_space that our Admin has created.

Leave a Reply



Copyright 2015 Sujee Maniyam (