An Adventure in InnoDB Table Compression (for read-only tables)
Mysql 10-Aug-2017

An Adventure in InnoDB Table Compression (for read-only tables)

In my last post about big MySQL deployments, I am quickly mentioning that InnoDB compression is allowing dividing disk usage by about 4.3 on a 200+ TiB dataset.  In this post, I will give more information about this specific use case of InnoDB table compression and I will share some statistics and learnings on this system and subject.  Note that I am not covering InnoDB page compression which is a new feature of MySQL 5.7 (also known as hole punching).

Before talking about the results, I will describe the system where I have implemented InnoDB compression and give the following warning:
The compression statistics shared here are from a very specific system.
Take them with a grain of salt and test to see how compression benefits your systems.

The system where I implemented InnoDB table compression contains more than one million read-only tables.  Some time ago, those tables were stored in a single MySQL instance, without using InnoDB compression (the size was 200+ TiB) and without using MySQL replication.  The storage was external (disk array accessed via a SAN) and was supporting snapshots.  At the very beginning, the amount of data (and tables) was much smaller, it slowly grew to its current size and it is still growing.  Because administration was becoming tedious, the system was sharded on servers with local storage and replication was used for redundancy.  To reduce the number of shards needed to store all this data, I implemented InnoDB table compression.

One particularity of this system is that the data arrives in MySQL in chunks (tables) that are never updated after being stored (after being fully written, the tables are read-only).  In the new sharded system, the data is written on the master in an InnoDB tables (non-compressed) and it reaches slaves via classical MySQL replication.  Once the table is fully written, a cron job running on all nodes of the replication hierarchy compresses the data locally (not writing to binary logs).

The compression script is trying many KEY_BLOCK_SIZE for finding the one resulting in the smallest footprint on disk.  The script works as follows for each table to compress:

  1. it copies the table twice in a temporary schema (let's call those copies A and B),
  2. it compresses table B with KEY_BLOCK_SIZE of 8,
  3. if compression does not succeed, it keeps table A then exits,
  4. if table B (just compressed) is larger than table A, it keeps table A then exits,
  5. it swaps tables A and B,
  6. it goes back to step #2 using KEY_BLOCK_SIZE of 4, 2 and 1.

The compression script stores the compression state of a table and the resulting sizes of each step in a log table having following structure:

> show create table compression_logs\G
*************************** 1. row ***************************
       Table: compression_logs
Create Table: CREATE TABLE `compression_logs` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `schema_name` varchar(64) NOT NULL,
  `table_name` varchar(64) NOT NULL,
  `initial_size` bigint(20) unsigned NOT NULL,
  `state` enum([...]) NOT NULL,
  `copied_size` bigint(20) unsigned DEFAULT NULL,
  `kbs_8_size` bigint(20) unsigned DEFAULT NULL,
  `kbs_4_size` bigint(20) unsigned DEFAULT NULL,
  `kbs_2_size` bigint(20) unsigned DEFAULT NULL,
  `kbs_1_size` bigint(20) unsigned DEFAULT NULL,
  `final_size` bigint(20) unsigned DEFAULT NULL,
  `final_kbs` tinyint(4) DEFAULT NULL
  PRIMARY KEY (`id`)
1 row in set (0.03 sec)

Note that finding the size on disk of an InnoDB table is not straightforward.  Looking at the size of the ibd file (lsdu or stat) does not work as some data might still be in the Buffer Pool (not yet flushed to disk).  Also, the output of SHOW TABLE STATUS or the content of information_schema.TABLEonly shows an approximation of the data and index lengths.  I used the command FLUSH TABLES ... FOR EXPORT for solving that.  Below is an extract from the manual, which is exactly what we need to be able to use lsdu or stat for getting the table size:

[FLUSH TABLES ... FOR EXPORT] ensures that changes to the named tables have been flushed to disk so that binary table copies can be made while the server is running.

I will come back later below to compression failing and to table size going the wrong direction (growing instead of shrinking after compression).  Now, let's look at the compression statistics.  For this post, I am only looking at a single shard of the new system (I am assuming that all shards have similar behavior).  The query below gives a good overview of the InnoDB table compression benefits on the amount of data on disk:

> SELECT nb_tables, initial_gib, final_gib,
    ->   ROUND(initial_gib/final_gib, 2) AS ratio
    -> FROM (SELECT COUNT(*) AS nb_tables,
    ->              ROUND(SUM(initial_size) / POW(2,30), 2) AS initial_gib,
    ->              ROUND(SUM(final_size)   / POW(2,30), 2) AS final_gib
    ->       FROM compression_logs) AS c;
| nb_tables | initial_gib | final_gib | ratio |
|    248670 |    20701.84 |   4566.50 |  4.53 |
1 row in set (0.14 sec)
What we have here is a very respectable MySQL instance:
  • it contains 248,670 tables,
  • uncompressed, those tables would use 20.21 TiB of disk space,
  • after compression, those tables are using 4.45 TiB of disk space,
  • and the average compression ratio is 4.53.

The results above have been obtained with the default value of innodb_compression_level (6).  Using the highest value (9) might give better results in the general case but the tests I did on that system do not show significant improvements (I might share more about that in another post).  So I kept the default for all the shards of this system.

An average compression ratio of 4.53 is very good but it might vary for different KEY_BLOCK_SIZE.  Let's look at more detailed data with this query (a final_kbs of 16 means no compression):

> SELECT final_kbs, nb_tables, initial_gib, final_gib,
    ->   ROUND(initial_gib/final_gib, 2) AS ratio
    -> FROM (SELECT final_kbs, COUNT(final_kbs) AS nb_tables,
    ->              ROUND(SUM(initial_size) / POW(2,30), 2) AS initial_gib,
    ->              ROUND(SUM(final_size)   / POW(2,30), 2) AS final_gib
    ->       FROM compression_logs GROUP BY final_kbs) AS c;
| final_kbs | nb_tables | initial_gib | final_gib | ratio |
|         1 |     40863 |     1940.72 |    265.15 |  7.32 |
|         2 |    152215 |    14937.60 |   3033.26 |  4.92 |
|         4 |     38862 |     3387.78 |   1048.73 |  3.23 |
|         8 |     13896 |      434.14 |    217.76 |  1.99 |
|        16 |      2834 |        1.60 |      1.60 |  1.00 |
5 rows in set (0.23 sec)

As we see above, we end-up with tables of all KEY_BLOCK_SIZE/final_kbs but at this point, we do not know if compression stopped because ALTER is failing or because the table size grew with the next KEY_BLOCK_SIZE.  The query below is telling us when the next KEY_BLOCK_SIZE resulted in a bigger table:

> SELECT final_kbs, COUNT(final_kbs) AS nb_tables,
    ->   ROUND(SUM(initial_size)         / POW(2,30), 2) AS initial_gib,
    ->   ROUND(SUM(final_size)           / POW(2,30), 2) AS final_gib,
    ->   ROUND(SUM(next_size-final_size) / POW(2,30), 2) AS diff_gib
    -> FROM (SELECT final_kbs, initial_size, final_size,
    ->              CASE final_kbs WHEN 16 THEN kbs_8_size
    ->                             WHEN  8 THEN kbs_4_size
    ->                             WHEN  4 THEN kbs_2_size
    ->                             ELSE         kbs_1_size END AS next_size
    ->       FROM compression_logs WHERE final_kbs <> 1) AS c
    -> WHERE next_size > final_size GROUP BY final_kbs;
| final_kbs | nb_tables | initial_gib | final_gib | diff_gib |
|         2 |    112278 |     9339.80 |   1955.21 |   609.37 |
|         4 |     38862 |     3387.78 |   1048.73 |   137.83 |
|         8 |     13896 |      434.14 |    217.76 |    51.82 |
|        16 |      2834 |        1.60 |      1.60 |     9.17 |
4 rows in set (0.29 sec)

The result above shows that, if we want the smallest footprint on disk, it is important to use the right KEY_BLOCK_SIZE:

  • if using 8 instead of uncompressed (final_kbs of 16), 9.17 GiB is wasted (this might not look like much but it is 5.7 times the initial size),
  • if 4 instead of 8, 51.82 GiB is wasted, which is an increase of 23%,
  • and the increase is 31% for 2 instead of 1.

The attentive reader might have spotted a gap between the two tables above.  In the first tables, we have 152,215 tables that are compressed with a KEY_BLOCK_SIZE of 2 and in the second tables, we have only 112,278 tables.  This is because some tables failed to alter with a KEY_BLOCK_SIZE of 1.  Let's confirm that with the following query.

> SELECT COUNT(final_kbs) as nb_tables,
    ->   ROUND(SUM(initial_size) / POW(2,30), 2) AS initial_gib,
    ->   ROUND(SUM(final_size)   / POW(2,30), 2) AS final_gib
    -> FROM compression_logs
    -> WHERE final_kbs = 2 AND kbs_1_size IS NULL
    -> GROUP BY final_kbs;
| nb_tables | initial_gib | final_gib |
|     39937 |     5597.80 |   1078.05 |
1 row in set (0.18 sec)
So 39,937 tables are failing ALTER with a KEY_BLOCK_SIZE of 1. The manual is saying that compression could fail if a row does not fit in a block, the exact being:
Generally, MySQL requires that each B-tree page in an InnoDB table can accommodate at least two records. For compressed tables, this requirement has been relaxed.  Leaf pages of B-tree nodes (whether of the primary key or secondary indexes) only need to accommodate one record, but that record must fit, in uncompressed form, in the per-page modification log.
However, I know that at least one of those tables has a row size smaller than 1 KiB.  I opened Bug#84439 for that.

I hope above gave you an idea of a real-world use case of InnoDB table compression.  Before ending this post, here are a few additional comments/thoughts on this work:
  • The script compresses table in a temporary schema and swaps the compressed with the original table (note that this is a cross-schema rename operation).  I found Bug#84038 that results in spamming the error log on MySQL restart for each cross-schema rename operation.  This should be fixed in MySQL 5.7.19 (I have not verified myself yet).
  • I already mentioned in a previous post that I tested MyISAM and myisampack and that this is giving a better compression ratio (I wrote a post Why we still need MyISAM about this).
  • Finally, I am curious how MyRocks would compress this data.  I might be able to find time to test this and write about it.