I was looking into more detail at how HBase compactions work, and given my experience collecting metrics for Lily, and also inspired by this blog post on Lucene, I thought it would be nice to do some tests and make some graphs of the HBase flush and compact process.
When something is written to HBase, it is first written to an in-memory store (memstore), once this memstore reaches a certain size, it is flushed to disk into a store file (everything is also written immediately to a log file for durability). The store files created on disk are immutable. Sometimes the store files are merged together, this is done by a process called compaction.
This buffer-flush-merge strategy is a common pattern, for example also used by Lucene, and is described in a paper called the Log-Structured Merge-Tree (never read it though). In contrast to Lucene though, were one usually has only one IndexWriter, HBase typically has hundreds of regions open.
There are two kinds of compactions:
- the minor compactions: these are triggered each time a memstore is flushed, and will merge some of the store files, determined by an algorithm described below.
- the major compactions: these run about every 24 hours (after the currently oldest store file was written), and merge together all store files into one. The 24 hours is adjusted with a random margin of up to 20% to avoid many major compactions happening at the same time. Major compactions can also be triggered manually, via the API or the shell.
There is another difference between minor and major compactions: major compactions process delete markers, max versions, etc, while minor compactions don’t. This is because delete markers might also affect data in the non-merged files, so it is only possible to do this when merging all files.
If, after a compaction, a newly written store file is greater than a certain size (default 256 MB, see property hbase.hregion.max.filesize), the region will be split into two new regions.
The basic compaction process
The illustration below shows a simple scenario: in a loop we perform a put of a row with a single 1k value. The table has one column family and only one region. The flush size size of the memstore has been set to 100MB (hbase.hregion.memstore.flush.size), the maximum file size (hbase.hregion.max.filesize) has been set high enough to avoid splits from happening. All tests in this blog have been done on a single node (my laptop).
The main goal here is to illustrate how the minor compaction process decides when and what to merge.
We see that the memstore grows gradually until it reaches its flush size of 100MB, when this happens it is flushed to disk in a new store file. This process repeats itself until we have 3 store files. At this point, a compaction is performed which merges the files together into one file.
The second compaction is only performed after 4 store files are written. How does HBase decide this?
The algorithm is basically as follows:
- Run over the set of all store files, from oldest to youngest
- If there are more than 3 (hbase.hstore.compactionThreshold) store files left and the current store file is 20% larger then the sum of all younger store files, and it is larger than the memstore flush size, then we go on to the next, younger, store file and repeat step 2.
- Once one of the conditions in step two is not valid anymore, the store files from the current one to the youngest one are the ones that will be merged together. If there are less than the compactionThreshold, no merge will be performed. There is also a limit which prevents more than 10 (hbase.hstore.compaction.max) store files to be merged in one compaction.
If you need more detail, have a look at the source of the Store.compact() method.
Now, with this knowledge, we can see why the second compaction does not run after flush 5: the oldest file was about ~240MB, the two younger ones together ~160MB, 240 > 160 * 1.2
When the second compaction runs after flush 6, it does merge all the files together, instead of just the 3 youngest, since ~240 > 3*80*1.2 is not true. After flush 9, we have ~480 > 3*80*1.2, so the oldest store file is not included in the merge.
With the default configuration of HBase, if a store file would be 480MB, the region will be split. I have on purpose augmented the limit to illustrate what happens if there are more store files, though I could also have lowered the memstore flush size.
If you wonder how the store file merging behaves over longer time, below is the result of a longer run with 157 flushes of 10 MB. Click on the image below to see it bigger (10000px by 800px).
The largest number of store files at any time is 7 (see number 143 in the chart), though the oldest (biggest) one is only merged occasionally. It is a good thing that the number of store files stays small, because when you read a row from HBase it needs to check all the store files and the memstore (this can be optimized through bloomfilters or by specifyingtimestamp ranges).
The number 7 might make you think of the property hbase.hstore.blockingStoreFiles, which has a default value of 7, but I augmented it for this test to be sure it did not have any influence.
Compactions and deletes
When you perform a delete in HBase, nothing gets deleted immediately, rather a delete marker (a.k.a. thombstone) is written. This is because HBase does not modify files once they are written.
The deletes are processed during the major compaction process, at which point the data they hide and the delete marker itself will not be present in the merged file.
Lets illustrate this with an extreme scenario. The following graphs are from a scenario where, in a loop, we put a row and then immediately delete the row. So conceptually, if we look at the table during the test, it will contain at most one row. Nevertheless, the store files will only grow.
What we see is:
- the memstore grows and flushes three times, three store files are created
- after the third store file is written, a compaction is performed
- the result of the compaction is a new store file of size 0.
This is a surprising result: only major compactions process deletes, and major compactions supposedly only run every 24 hours.
What happens is that if the set of store files selected for compaction is the set of all store files, then HBase decides to do a major compaction. In fact, it should, otherwise the 24H check, which looks at the time of the oldest store file, would mistakenly assume that this one file was the result of a major compaction.
Besides explicit deletes, there are other cases where store files will shrink after a major compaction: cells that are removed because there are more than max-versions versions of it, or when using the TTL feature.
Multiple column families
Each column family has its own memstore and its own set of store files. But all the column families within a table stick together. They will all flush their memstore at the same time (this might change). They are also splitted in the same regions, so when one of the column families grows to large, the other ones, how small they might be, will also split.
The illustration below shows a scenario where a table with two column families is used. In family 2, we only do a put for every 10 puts we do in family 1. So family 1 will grow much faster than family 2 (family 2 is sparse family).
The memstore and storefile count graphs show the data for both families counted together. Note how at each memstore flush, the number of store files increases with two.
You can also see how a second compaction is performed for family 2 but not for family 1. This is because all store files in family 2 are smaller than the memstore flush size of 100 MB.
Now lets do a scenario with a table that has multiple regions.
For this test, a table was created with 10 initials regions, divided so that random UUIDs would be evenly spread over these regions. This time the memstore flush size was set to 30 MB.
As the puts are performed, the memstores of each region should fill up evenly, and so we expect them to all reach the flush size at about the same time, and thus to flush at about the same time. However, in the HBase region server there is only one thread which performs flushes, so they happen in sequence. This means that the memstores of some regions will keep growing above their limit. The property hbase.hregion.memstore.block.multiplier controls how many times a memstore can grow above its limit before HBase blocks updates. The default is 2, in our example here this means the memstore would be allowed to grow to 2*30=60MB maximum.
Similarly, there is one thread for doing the compactions and splits, so these are performed in sequence too. Still, it is a period where resources are used which would otherwise have been usable to serve client requests. If you would run a similar scenario a cluster, the different nodes could still run there compactions in parallel. In such a scenario, were all regions are filled very evenly and decide to do compactions at about the same time, you can see some negative spikes when observing the hbaseRequestCount metric. In the chart here you can notice that memstore grows a bit less steep during the compactions.
All the above illustrations were quite synthetic. On a real system, there will often be much more regions, and more than one table, and there will often not be enough memory to let memory stores grow to their configured flush size.
HBase uses by default 40% of the heap (see property hbase.regionserver.global.memstore.upperLimit) for all memstores of all regions of all column families of all tables. If this limit is reached, it starts flushing some memstores until the memory used by memstores is below at least 35% of heap (lowerLimit property).
Let’s simulate this scenario by again using 10 regions, but now with an upper limit of 100 MB. The regionserver in this test has only 1GB of memory, so 400MB available for the memstores. Given that we have 10 memstores (1 table, 1 column family, 10 regions), and that we will fill them evenly, we expect the first flushes to happen when the memstores reach 400/10=40MB (instead of going on to their 100MB flush size).
And, hurary huray, the theory is confirmed by the plot below.
Some final notes:
- during flushes & compactions, HBase keeps processing put and get requests, always giving a consistent view of the data
- certain properties, like the flush size, can be configured on the level of the table
The charts were made with gnuplot, afterwards annotated with Inkscape. The metrics were collected through HBaseAdmin’s ClusterStatus and JMX. The store file sizes were read directly from HDFS. I set hbase.regionserver.msginterval to 1000 to get fine-grained data from ClusterStatus. Lily’s clientmetrics package was used to collect & output the metrics, and to generate the necessary input files for gnuplot.