- https://wiki.mozilla.org/Socorro:HBase
- http://blog.cloudera.com/blog/2011/02/log-event-processing-with-hbase/
- Column families
- Example: A common column family Socorro uses is "ids:" and a common column qualifier in that family is "ids:ooid". Another column is "ids:hang"
- The table schema enumerates the column families that are part of it. The column family contains metadata about compression, number of value versions retained, and caching.
- A column family can store tens of thousands of values with different column qualifier names.
- Retrieving data from multiple column families requires at least one block access (disk or memory) per column family. Accessing multiple columns in the same family requires only one block access.
- If you specify just the column family name when retrieving data, the values for all columns in that column family will be returned.
- If a record does not contain a value for a particular column in a set of columns you query for, there is no "null", there just isn't an entry for that column in the returned row.
- Manipulating a row
- All manipulations are performed using a rowkey.
- Setting a column to a value will create the row if it doesn't exist or update the column if it already existed.
- Deleting a non-existent row or column is a no-op.
- Counter column increments are atomic and very fast. StumbleUpon has some counters that they increment hundreds of times per second.
- Tables are always ordered by their rowkeys
- Scanning a range of a table based on a rowkey prefix or a start and end range is fast.
- Retrieving a row by its key is fast.
- Searching for a row requires a rowkey structure that you can easily do a range scan on, or a reverse index table.
- A full scan on a table that contains billions of items is slow (although, unlike an RDBMS it isn't likely to cause performance problems)
- If you are continually inserting rows that have similar rowkey prefixes, you are beating up on a single RegionServer. In excess, it is unpleasant.