Using HBase TableIndexed from Thrift with unique keys
Posted 2009-09-30.HBase is primarily a sorted distributed hash map, but it does support secondary keys through a contrib package called Transactional HBase. The secondary keys are provided by a component called TableIndexed. A good general walkthrough is Secondary indexes in HBase on Rajeev Sharma’s blog. My blog post specifically addresses how to use secondary indexes outside of the Java API and how to handle unique keys.
Our scenario is this: my articles table will use a row key that starts with the timestamp. This is a very common scenario in HBase because it is usually the most natural way to access information. I also have on at least some articles, a secondary key that I want to be able to do lookups by. Let’s have a concrete example for further discussion.
row key key:id content:title 1234 abc ...
These keys are always unique, this is guaranteed at the application level. I would like to be able to fetch an article using the secondary key. All pretty straightforward, except I want to do this outside of Java using just Thrift, and I would prefer to do it using only get() because that makes it easier to write and easier for anyone coming along to read later. Scanners may be powerful, but it’s not intuitive that a get using a unique secondary index would use them.
The first thing we need is the index specification that can create the unique key that we want. By default, HBase provides a SimpleIndexKeyGenerator that creates row keys that start with the value of the secondary key, in our example, they would be abc1234. This supports having multiple rows that match the secondary key, for example, abc5678, but means that you have to use a scanner to get at the information.
I’ve written a UniqueIndexKeyGenerator that would use the exact value from the primary table’s secondary key as the row key for the index table. That is, for our example, the row key in “articles-index” would be “abc”. I’ve also added a forUniqueIndex static method to IndexSpecification to make it easier to call this from the jruby shell.
/**Construct an index spec for a single column that has only unique values. * @param indexId the name of the index * @param indexedColumn the column to index * @return the IndexSpecification */ public static IndexSpecification forUniqueIndex(String indexId, byte[] indexedColumn) { return new IndexSpecification(indexId, new byte[][] { indexedColumn }, null, new UniqueIndexKeyGenerator(indexedColumn)); }
I then add a create_index method directly in the shell. The shell is flexible enough that you have access to things like @configuration which are used by the all the other methods. Also, I was unsure whether I could do an import inside a method, but it worked just fine. This is there with the idea that it could eventually be added to the shell and if you tried to call it, would fail with a class not found if you didn’t have the transactional jar in your classpath.
def create_index(table_name, index_name, column) import org.apache.hadoop.hbase.client.tableindexed.IndexedTableAdmin import org.apache.hadoop.hbase.client.tableindexed.IndexSpecification @iadmin ||= IndexedTableAdmin.new(@configuration) spec = IndexSpecification.for_unique_index(index_name, column.to_java_bytes) @iadmin.addIndex(table_name.to_java_bytes, spec) end
Using this, I can create my tables in the shell easily.
create 'a', 'key', 'content' create_index 'a', 'index', 'key:id'
This creates a table “a” with two column families, “key” and “content”. It creates an index called “index” on the “id” column in the “key” column family. Internally,
To access this over Thrift, it’s now very simple. You can look at query.rb but these are the important sections. First, we need to make sure we have Thrift hooked up.
require 'rubygems' require 'hbase' transport = Thrift::BufferedTransport.new(Thrift::Socket.new('127.0.0.1', 9090)) protocol = Thrift::BinaryProtocol.new(transport) client = Apache::Hadoop::Hbase::Thrift::Hbase::Client.new(protocol) transport.open
Thrift is required by hbase, so we don’t need to require it separately. I’ve opened a connection to my own machine on port 9090, the default for the Thrift server. I don’t know the details of the above, it’s just voodoo I picked up on some other site. My main loop lets me query a bunch of secondary key interactively.
STDIN.each do |id| id.strip! row_keys = client.get index, id, '__INDEX__:ROW' row_key_cell = row_keys[0] if row_key_cell row_key = row_key_cell.value puts "Found row key #{row_key}" value = client.get table, row_key, column puts "Found item #{value[0].value}" else puts "unable to find '#{id}' in index" end end
This is the important part. First, we get an id, this is the value we are trying to match to key:id in the table. The first thing we do is get the contents of “__INDEX__:ROW” from the index table where the row key matches the id. In my case, there’s always only one, but the API doesn’t know that, so it returns a list of cells. The cells have a column specifier, a timestamp, and what we care about, a value. This value is the row key of the primary table. If we found anything in the secondary index, we then do a get on the primary table.
This is currently describing my work-in-progress. Once I’ve actually got everything up and running in production under load, I’ll create some patches to submit to HBase. I think adding UniqueIndexKeyGenerator greatly simplifies one common use of TableIndexed, and adding support for TableIndexed to the shell makes it easy to manage schema.
The code is now available in HBASE-1885.