OpenTSDB - an interesting monitoring project to (possibly) take some schema ideas from.

raw HBase

That is, without Phoenix or other things on top.

HBase Shell

Java client

The fastest. See test results.

REST interface

In python use requests

Rest interface reference

Filter parameters


<Scanner batch="5"> <filter>{
  "type": "FamilyFilter",
  "op": "EQUAL",
  "comparator": {
    "type": "BinaryComparator",
    "value": "dGVzdHJvdw\u003d\u003d"
}</filter> </Scanner>

Yes, json-like inside xml. Rest also supports application/protobuf instead of XML, but filter definition probably still uses the same format. Note that values are b64 encoded.

Thrift interface

Rather slow compared to direct access from Java.

HBase from Pig


STORE data INTO 'ssbtest' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('cf:etime, cf:value, cf:color, cf:url, cf:status');

First value in the tuple used as the key. To make a compound key either use several calls to CONCAT or write a UDF

HBase from MapReduce jobs


Scan time test

data 1 month aggregated transfer data, 1.7M rows
key format bTtttVvvvVvvvSsssSsssSsssSsssDdddDdddDdddDddd
salt = hash mod 16, 10 min time bin, vo, src, dst
scan parameters caching = 1000 batch = 100

java client on cluster (dashb-ai-410):

time range vo src dst row count prep scan total
1072637400 1672810200 "" "" "" 1770598     22659 ms
1372637400 1372810200 "" "" "" 105812 6 ms 831 ms 837 ms
1372637400 1372810200 "atlas" "" "" 68970 5 ms 624 ms 629 ms
1372637400 1372810200 "atlas1" "" "" 0 5 ms 103 ms 108 ms
1372637400 1372810200 "atlas" "GRIF" "" 11 5 ms 134 ms 139 ms
1372637400 1372810200 "atlas" "GRIF" "" 11 5 ms 134 ms 139 ms
1372637400 1372810200 "atlas" "" "GRIF" 1418 5 ms 270 ms 275 ms
1072637400 1672810200 "" "" "GRIF" 26456 8 ms 4743 ms 4751 ms
1072637400 1672810200 "" "GRIF" "" 1400 10 ms 1611 ms 1621 ms
1072637400 1672810200 "atlas" "GRIF" "" 417 9 ms 1195 ms 1204 ms
1372637400 1372810200 "atlas" "GRIF" "GRIF" 2 10 ms 142 ms 152 ms

java client on desktop - not tested (exception when creating scan)

thrift client on cluster (dashb-ai-410)

time range vo src dst row count walltime
1372637400 1372810200 "atlas" "" "" 68970 11.0466649532



Query language reference


  • Fast
  • SQL-like
  • Some utility functions can be useful even when using HBase directly
    • Key salt buckets
    • Java type to binary conversion
    • But if we're not using Phoenix, probably should implement our own bucket hash and use general HBase conversion.
  • Needs to be installed on the cluster
  • JDBC interface, so Java client only

Phoenix from Pig

Should work but fails with an error:

STORE data INTO 'hbase://ssbtest' USING com.salesforce.phoenix.pig.PhoenixHBaseStorage('dashb-ai-410', '-batchSize 10');

Source of the class in question.


Custom filters

Any filter can be used from HBase shell, provided it's in the classpath. hbase> scan 't1', {FILTER =>, 0)}

To be able to use filter through Thrift or REST it needs to provide

createFilterFromArguments(ArrayList<byte[]> filterArguments) 

This provides a way to create a filter from string parameters, which is exactly what Thrift/Rest provide. To debug this method for your custom filter from HBase shell, register the method (so that hbase can look up method class name from a short method name) and try a scan like this:

hbase> org.apache.hadoop.hbase.filter.ParseFilter.registerFilter("FuzzyRowFilterFromText", "org.apache.hadoop.hbase.filter.FuzzyRowFilterFromText")
scan 'SSB3', {FILTER => "(FuzzyRowFilterFromText ( '\x00\x00\x00\x00\x00\x00ssb_metric_18\x00\x00\x00VICTORIA-LCG2\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', '\x01\x01\x01\x01\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x01\x01\x01\x01\x01\x01\x01\x01\x01\x01\x01\x01\x01\x01\x01\x01\x01\x01') AND (PageFilter (10)) )" }

To use the custom filter with other interfaces, it needs to be registered too:

Adding custom filters to Thrift server configuration.

In CDH4 Cloudera manager:

Services->Hbase1->Configuration->View and edit->Default Category->HBaseThrift Server (Default)->Advanced->HBase Thrift Server Configuration Safety Valve for hbase-site.xml:


This is not implemented for REST yet.

Filters with skip hints

Easy to imagine, harder to understand and implement. Faster than simple filters that just check one row.


FuzzyRowFilter is a rowkey filter allowing filter by subsets of the rowkey bytes. Advantages:

  • Fast: provides skip hints to the scanner
  • Structurally simple: just provide keyvalues and byte masks
  • Part of HBase
  • Not very friendly: construct keyvalues and byte masks yourself
  • Does not support ranges for rowkey parts, only discrete values.
    • It's OK in our use-case, we only have one part of the key that needs range filtering: timestamp. We want it to be at the front of the key anyway, so we can use scan's startRow and stopRow to filter by timestamp. Then this filter would only be used to filter by resource and by metric, which don't need range filtering.
  • Does not allow for variable-length rowkey parts.
    • Either we use fixed length of resource and metric strings
    • Or we use binary resource id and metric id to save space
  • Does not implement createFilterFromArguments
    • This is pretty easy to fix with a subclass that implements this method
  • Consequently can't be used with REST or Thrift out of the box (they need createFilterFromArguments)


SkipScanFilter is a rowkey filter allowing to filter by key columns in Phoenix tables. Used iternally by Phoenix queries. Advantages:

  • Fast: provides skip hints to the scanner
  • Supports ranges for key column values.
  • Supports variable-length and empty key columns.
  • Easier to use: for each key column provide a set of value ranges
    • But key columns are defined in Phoenix
  • Part of Phoenix
  • Needs Phoenix metadata (key column parameters) to construct the filter
  • Does not implement createFilterFromArguments
    • Which is harder to implement ourselves because of depending on Phoenix metadata

Coprocessors Observers and endpoints ~ triggers and stored procedures.

Endpoints can't be called through Thrift out of the box:

Observers are used by Phoenix to do aggregation, but are not really designed for that. Phoenix coprocessors are too tightly coupled with Phoenix metadata and utilities to be usable from other clients.



UDF (user-defined functions)

UDF are used to do any data processing within PIG more complex that restructuring bags and straightforward aggregation.

Python UDFs. Actually Jython, take care. Not convenient to construct rowkeys because of Unicode issues.


Cloudera CDH4


Used by the manager stack

Used by actual servers

Thrift moved from the default 9090 to 7182 (manager port) because we have a lot of ports open as it is. Thrift is running on host 414, manager is on host 410 so there is no conflict as of now.

Edit | Attach | Watch | Print version | History: r12 | r10 < r9 < r8 < r7 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r8 - 2013-08-14 - IvanKadochnikov
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback