I "discovered" some issues when implementing the happybase functionality on top of the Bigtable API. (I put discovered in quotes, because some of the issues may just be that I don't grok how to do the same thing with the Bigtable API).
These were mostly discovered because I wrote a system test for happybase that could work both with HBase and with the Bigtable backend. It can be switched from one to another by changing the USING_HBASE boolean.
Many other differences have been enumerated in the documentation for our custom Bigtable happybase package.
Issues / Differences
- When committing a batch of mutations, the
happybase method Batch.send() uses Thrift/HBase's mutateRows / mutateRowsTs method to send all mutations at once. With the Bigtable API, this is not possible, we have to commit row-by-row. (This comes up in the system test as well.)
- Bigtable Garbage Collection is not as immediate as HBase. In HBase, a column with one
max_version immediately evicts the old value when a new one is added. Similarly, with a TTL of 3 seconds, after sleeping for 3.5 seconds, the value has been evicted. Neither of these occur (at least consistently in Bigtable). (I don't really see this as a problem, but users from HBase may have different expectations)
- A row scan with
sorted_columns is not possible in Bigtable.
- Using HBase filter string is not possible in Bigtable. (Also some of the filter string concepts don't map to Bigtable filters, e.g.
KeyOnlyFilter)
- The Bigtable
Mutation.DeleteFromRow mutation does not support timestamps (also). Even attempting to send one conditionally (via CheckAndMutateRowRequest) deletes the entire row.
- Bigtable can't use a timestamp with column families since
Mutation.DeleteFromFamily does not include a timestamp range.
Differences that are Upgrades
-
Writes to HBase (via Thrift) with a timestamp just drop the timestamp whereas the Bigtable API respects them
-
The Thrift API fails to retrieve the TTL information from a column family while the Bigtable API succeeds in returning this information. (We have to work-around this in a few system tests.)
-
When Thrift API does a row read with columns cf1 and cf1:qual1 (in that order) only the results from cf1:qual1 are returned (even though they are a subset of all the columns in the column family cf1). If the columns are given in the opposite order (cf1:qual1 then cf1) the correct results are returned. In Cloud Bigtable, it works as expected in either order. (We use a union filter, one which has only family_name_regex_filter='cf1' and another which has that combined with column_qualifier_regex_filter='qual1'.) (This happen for a single row read and multiple rows.)
-
HBase counter_get doesn't actually populate the data even though the docstring says:
This method retrieves the current value of a counter column. If the counter column does not exist, this function initialises it to 0
Neither Good/Bad
- HBase reads (via
Table.row, Table.rows, Table.cells, Table.scan) all use exclusive end timestamps, which makes the behavior of a Bigtable TimestampRange. On the other hand, HBase deletes use inclusive end timestamps, while Bigtable deletes are still using a TimestampRange (only for deleting specific columns those, as column family or row deletes can't send a timestamp range, as referenced above). We address this just by incrementing the passed in timestamp by 1 millisecond (which is the lowest allowed granularity).
I "discovered" some issues when implementing the
happybasefunctionality on top of the Bigtable API. (I put discovered in quotes, because some of the issues may just be that I don't grok how to do the same thing with the Bigtable API).These were mostly discovered because I wrote a system test for
happybasethat could work both with HBase and with the Bigtable backend. It can be switched from one to another by changing theUSING_HBASEboolean.Many other differences have been enumerated in the documentation for our custom Bigtable
happybasepackage.Issues / Differences
happybasemethodBatch.send()uses Thrift/HBase'smutateRows/mutateRowsTsmethod to send all mutations at once. With the Bigtable API, this is not possible, we have to commit row-by-row. (This comes up in the system test as well.)max_versionimmediately evicts the old value when a new one is added. Similarly, with a TTL of 3 seconds, after sleeping for 3.5 seconds, the value has been evicted. Neither of these occur (at least consistently in Bigtable). (I don't really see this as a problem, but users from HBase may have different expectations)sorted_columnsis not possible in Bigtable.KeyOnlyFilter)Mutation.DeleteFromRowmutation does not support timestamps (also). Even attempting to send one conditionally (viaCheckAndMutateRowRequest) deletes the entire row.Mutation.DeleteFromFamilydoes not include a timestamp range.Differences that are Upgrades
Writes to HBase (via Thrift) with a timestamp just drop the timestamp whereas the Bigtable API respects them
The Thrift API fails to retrieve the TTL information from a column family while the Bigtable API succeeds in returning this information. (We have to work-around this in a few system tests.)
When Thrift API does a row read with columns
cf1andcf1:qual1(in that order) only the results fromcf1:qual1are returned (even though they are a subset of all the columns in the column familycf1). If the columns are given in the opposite order (cf1:qual1thencf1) the correct results are returned. In Cloud Bigtable, it works as expected in either order. (We use a union filter, one which has onlyfamily_name_regex_filter='cf1'and another which has that combined withcolumn_qualifier_regex_filter='qual1'.) (This happen for a single row read and multiple rows.)HBase
counter_getdoesn't actually populate the data even though the docstring says:Neither Good/Bad
Table.row,Table.rows,Table.cells,Table.scan) all use exclusive end timestamps, which makes the behavior of a BigtableTimestampRange. On the other hand, HBase deletes use inclusive end timestamps, while Bigtable deletes are still using aTimestampRange(only for deleting specific columns those, as column family or row deletes can't send a timestamp range, as referenced above). We address this just by incrementing the passed in timestamp by 1 millisecond (which is the lowest allowed granularity).