MySQL Compressed Binary Logs

On a busy server, the binary logs can end up being one of the largest contributors to amount of disk space used. That means higher I/O, larger backups (you are backing up your binary logs, right?), potentially more network traffic when replicas fetch the logs, and so on. In general, binary logs compress well, so it has been a long time wish for a feature that allowed you to compress the logs while MySQL are still using them. Starting from MySQL 8.0.20 that is now possible. I will take a look at the new feature in this post.

Configuration

The binary log compression feature is controlled by two variables, one for enabling the feature and another to specify the compression level. These are summarized in the below table.

Variable NameDefault
Value
Allowed
Values
binlog_transaction_compressionOFFOFF/ON
binlog_transaction_compression_level_zstd31-22

The two option names are fairly self explanatory. binlog_transaction_compression specifies whether compression is enabled or not, and binlog_transaction_compression_level_zstd specifies the compression level. The high level, the (in principle – see also the tests later in the blog) better compression at the cost of increased CPU.

Both options can be set dynamically both at the global and session scopes. However, it is not allowed to change the session values in the middle of a transaction. If you do that, you get an error like this:

mysql> SET SESSION binlog_transaction_compression = ON;
ERROR: 1766 (HY000): The system variable binlog_transaction_compression cannot be set when there is an ongoing transaction.

While it is easy to configure binary log compression, there are some limitations, you should be aware of.

Limitations

The short version of the limitations boils down to only transactional row events are compressed. Statement based events, GTID information, rotation events, row-based events for non-transactional tables, etc. are not compressed. Also if you have a transaction and add non-transactional changes, then neither will be compressed.

If you use all default values for your binary logs and use the InnoDB storage engine (the default), then compression will work. For the complete list of limitations, see Binary Log Transaction Compression in the manual.

Information

This also means that each transaction is compressed on its own. See the examples later in the blog for more on what that means.

As I usually preach, monitoring is key to understanding your system. What does the binary log compression feature support in terms of monitoring?

Monitoring

There are two ways to monitor the performance of the binary log compression feature. A Performance Schema table with compression statistics and new stage events.

The binary_log_transaction_compression_stats table in the Performance Schema includes statistics since the last restart of MySQL (or last time the table was truncated) for the compression. The table has two rows for the binary log, one for compressed events and one for events that are not compressed. Replicas will similarly have two rows for the relay log. An example of the content as well as the binary logs at the same time is:

mysql> SELECT * FROM binary_log_transaction_compression_stats\G
*************************** 1. row ***************************
                            LOG_TYPE: BINARY
                    COMPRESSION_TYPE: ZSTD
                 TRANSACTION_COUNTER: 15321
            COMPRESSED_BYTES_COUNTER: 102796461
          UNCOMPRESSED_BYTES_COUNTER: 252705572
              COMPRESSION_PERCENTAGE: 59
                FIRST_TRANSACTION_ID: 74470a0c-8ea4-11ea-966e-080027effed8:30730
  FIRST_TRANSACTION_COMPRESSED_BYTES: 313
FIRST_TRANSACTION_UNCOMPRESSED_BYTES: 363
         FIRST_TRANSACTION_TIMESTAMP: 2020-05-07 19:26:37.744437
                 LAST_TRANSACTION_ID: 74470a0c-8ea4-11ea-966e-080027effed8:46058
   LAST_TRANSACTION_COMPRESSED_BYTES: 712
 LAST_TRANSACTION_UNCOMPRESSED_BYTES: 1627
          LAST_TRANSACTION_TIMESTAMP: 2020-05-07 19:38:14.149782
*************************** 2. row ***************************
                            LOG_TYPE: BINARY
                    COMPRESSION_TYPE: NONE
                 TRANSACTION_COUNTER: 20
            COMPRESSED_BYTES_COUNTER: 5351
          UNCOMPRESSED_BYTES_COUNTER: 5351
              COMPRESSION_PERCENTAGE: 0
                FIRST_TRANSACTION_ID: 74470a0c-8ea4-11ea-966e-080027effed8:30718
  FIRST_TRANSACTION_COMPRESSED_BYTES: 116
FIRST_TRANSACTION_UNCOMPRESSED_BYTES: 116
         FIRST_TRANSACTION_TIMESTAMP: 2020-05-07 19:26:37.508155
                 LAST_TRANSACTION_ID: 74470a0c-8ea4-11ea-966e-080027effed8:31058
   LAST_TRANSACTION_COMPRESSED_BYTES: 116
 LAST_TRANSACTION_UNCOMPRESSED_BYTES: 116
          LAST_TRANSACTION_TIMESTAMP: 2020-05-07 19:30:30.840767
2 rows in set (0.0026 sec)

mysql> SHOW BINARY LOGS;
+---------------+-----------+-----------+
| Log_name      | File_size | Encrypted |
+---------------+-----------+-----------+
| binlog.000142 |       240 | No        |
| binlog.000143 |      4933 | No        |
| binlog.000144 |  28238118 | No        |
| binlog.000145 |  24667167 | No        |
| binlog.000146 |  39221771 | No        |
| binlog.000147 |  11944631 | No        |
| binlog.000148 |       196 | No        |
+---------------+-----------+-----------+
7 rows in set (0.0005 sec)

The example is generated with GTIDs enabled, and all of the binary logs except binlog.000142 have been created after the last restart.

This shows that there has been 15341 transactions (74470a0c-8ea4-11ea-966e-080027effed8:30718-46058) of which 15321 have been compressed with an average of 59% compression (compressing 252,705,572 bytes to 102,796,461 bytes). There are also statistics for the first and most recent transaction that was compressed. Similarly there have been 20 “transactions” that could not be compressed.

One thing is to know the compression rate, you also need to know the price you pay in terms of the overhead of performing the compression and decompression. That can be monitored using two stage events in the Performance Schema:

  • stage/sql/Compressing transaction changes.
  • stage/sql/Decompressing transaction changes.

(Yes, the period is part of the name.)

Neither are enabled by default and for the instruments to be collection, you will also need to enable the events_stages_current consumer. For example to enable these in the configuration file:

[mysqld]
performance-schema-instrument = "stage/sql/%Compressing transaction changes.=ON"
performance-schema-consumer-events-stages-current = ON

Warning

There is some overhead by enabling instrumentation of stages. Make sure you test the impact first, if you consider enabling these on a production system.

You can compare that with the wait/io/file/sql/binlog event (enabled by default) which is the time spent doing I/O. For example:

mysql> SELECT EVENT_NAME, COUNT_STAR,
              FORMAT_PICO_TIME(SUM_TIMER_WAIT) AS total_latency,
              FORMAT_PICO_TIME(MIN_TIMER_WAIT) AS min_latency,
              FORMAT_PICO_TIME(AVG_TIMER_WAIT) AS avg_latency,
              FORMAT_PICO_TIME(MAX_TIMER_WAIT) AS max_latency
         FROM performance_schema.events_stages_summary_global_by_event_name
        WHERE EVENT_NAME LIKE 'stage/sql/%transaction changes.'\G
*************************** 1. row ***************************
   EVENT_NAME: stage/sql/Compressing transaction changes.
   COUNT_STAR: 15321
total_latency: 6.10 s
  min_latency: 22.00 ns
  avg_latency: 397.96 us
  max_latency: 982.12 ms
*************************** 2. row ***************************
   EVENT_NAME: stage/sql/Decompressing transaction changes.
   COUNT_STAR: 0
total_latency:   0 ps
  min_latency:   0 ps
  avg_latency:   0 ps
  max_latency:   0 ps
2 rows in set (0.0008 sec)

mysql> SELECT *
         FROM sys.io_global_by_wait_by_latency
        WHERE event_name = 'sql/binlog'\G
*************************** 1. row ***************************
   event_name: sql/binlog
        total: 27537
total_latency: 4.83 min
  avg_latency: 10.51 ms
  max_latency: 723.92 ms
 read_latency: 138.25 us
write_latency: 290.69 ms
 misc_latency: 4.82 min
   count_read: 37
   total_read: 1002 bytes
     avg_read:   27 bytes
  count_write: 16489
total_written: 99.26 MiB
  avg_written: 6.16 KiB
1 row in set (0.0015 sec)

In this case, I have used the sys.io_global_by_wait_by_latency view as a short cut as it automatically makes the latencies human readable. For the latencies of the compression/decompression stages, the FORMAT_PICO_TIME() function convert them.

In this example, MySQL has spent 6.21 seconds compressing binary logs averaging just under 400 microseconds per transaction. For comparison, 4.8 minutes have been spent doing I/O on the binary log files.

You should check the time spent on writing and reading the binary log files before enabling compression, so you can determine the change in performance. You should also check the change in CPU usage.

MySQL 8 Query Performance Tuning

Book

If you want to learn more about MySQL performance tuning, then I have written MySQL 8 Query Performance Tuning published by Apress.

The book is available from Apress (print and DRM free ePub+PDF), Amazon (print and Kindle), Barnes & Noble (print), and others.

In the above outputs, it showed the compression was 59%, but how is it for different kinds of workloads?

Examples – Workloads

In order to determine how well the compression works, I have performed a series of tasks and compared the size of the binary logs with and without compression. For comparison, I have also tried to manually compress the binary logs from the series of tests without compression to see the optimal compression (as opposed to the per-transaction compression used by MySQL). The tests have been performed with the default configuration except the settings necessary for the given test.

The following workloads have been performed:

  • Bulk load: Loading the employees sample database.
  • Bulk update: Updating the salary column of all rows in the employees.salaries table: UPDATE employees.salaries SET salary = salary + 1.
  • Bulk load: Populating four tables with 100,000 rows using sysbench as preparation for the oltp_read_write benchmark.
  • OLTP workload: Running the sysbench oltp_read_write benchmark with --events=15000 using the four tables from the previous test.
  • Single row deletes: Deleting all 100,000 rows in one of the tables from the sysbench test. The rows are deleted one by one which represents a worst case scenario for the compression as the transactions are very small and there is only the before image in the binary log for each deleted row.

The single row deletes can be performed easily using MySQL Shell, for example in the Python language mode:

from datetime import datetime

for i in range(100000):
    if i % 5000 == 0:
        print('{0}: i = {1}'.format(datetime.now(), i))
    session.run_sql('DELETE FROM sbtest.sbtest1 WHERE id = {0}'.format(i+1))
print('{0}: i = 1000000'.format(datetime.now()))

The size of each task generates uncompressed binary logs in the range of 25 MiB to 82 MiB.

The tests were performed with the following settings:

  • No compression
  • Compression enabled
  • Encrypted but no compressed
  • Encryption and compression enabled
  • No compression in MySQL + compressing with zstd

The compression with zstd was chosen as MySQL uses the Zstandard compression algorithm. If you want to try yourself, you can download the Zstandard source code from Facebook's GitHub repository which also includes the compilation instructions.

The resulting size of the binary log in bytes for each combination can be found in the following table.

TestNormalCompressEncryptEncrypt +
Compress
zstd
Load employees6637863928237933663791512823863426892387
Bulk Update8569832024667051856988322466767724779953
sysbench prepare7636774039221052763682523922180639775067
sysbench run261902001193371326190712119365618468625
Single Row Deletes4730015639492400473006683963768416525219

That the encrypted binary log compresses as well as the unencrypted suggest that the compression is done before the encryption. (Encrypted data does not compress well.) Because there is no difference whether encryption is enabled, only the normal (uncompressed), compressed, and zstd results will be discussed further. The binary log sizes can also be seen in this figure:

Binary log sizes for different compressions schemes.
Binary log sizes for different compressions schemes.

Unsurprising the bulk loads and update perform the best with compressed binary log being 29% to 51% of the uncompressed size. The sysbench OLTP workload also compresses well ending up being 46% of the size. Also unsurprising, the single row deletes do not compress nearly as well with the compressed binary log being 83% of the size of the uncompressed binary log.

When comparing the binary logs compressed by MySQL with those compressed manually with zstd, the file size is around the same for the bulk loads reflecting that for large transactions, compressing on a per-transaction basis works as well as compressing the whole file. As the transaction size becomes smaller, the relative efficiency of the per-transaction compression reduces which is particularly visible for the single row deletes.

The other factor to consider is the compression level.

Examples – Compression Level

There are some oddities when it comes to the compression level, and the short story is that there is no need to change the setting.

The first oddity is that the allowed values are 1-22 but zstd only supports levels 1-19. There is nothing in the MySQL documentation explaining the difference.

The second oddity is that there is effectively no change in the size of the compressed binary logs by changing the value of binlog_transaction_compression_level_zstd as can be seen from the table:

Level/TestMySQLzstd
LoadOLTPLoadOLTP
12823814211935450344832078531545
32823793311933713268923878468625
112823812811902669247371946639524
192823824611937664188671875724300
222823812511910022

For comparison, the compression at levels 1, 3, 11, and 19 are included compressing the binary log manually with zstd. “Load” is for loading the employees database and OLTP is the sysbench oltp_read_write benchmark. The data can also be seen in this figure:

The binary log size as a function of the compression level.
The binary log size as a function of the compression level.

As it can be seen, there is essentially no difference in the file size irrespective of the compression level used in MySQL whereas for zstd the file size reduces as expected with increased compression level. For level 1 with the load test MySQL even compresses significantly better than zstd. It is like the compression level is never being set in MySQL.

Information

One possible explanation is that Zstandard supports training the algorithm for a given type of data (creating a dictionary). This particularly helps improving compression of small data. I do not know whether MySQL uses a dictionary and if so whether that makes all compression levels work roughly equal.

Conclusion

The new binary log transaction compression works great and can be a great way to reduce the amount of I/O, disk usage, and network usage. Should you enable it?

Most likely you will benefit from enabling binary log compression unless you are very strapped for CPU. From these tests, it is likely the disk space occupied by binary logs can be roughly cut in half, but as always it is workload dependent and you should test with your workload.

Advice

You can also enable compression at the protocol level for the transmission of the binary logs. There is no reason to both enable binary log transaction compression and the protocol compression.

On the other hand, there is not reason at present to change the compression level.

Reduce MySQL Core Dump Size by Excluding the InnoDB Buffer Pool

When things go horrible wrong and a process crashes, one of the most powerful things to investigate the cause of the crash is a core dump. As the amount of memory allocated to processes such as MySQL has increased – in some cases approaching 1TiB of memory – enabling core dumps can cause problems of their own. MySQL Server 8.0.14 and later supports an option to reduce the size of the core dump which will be discussed in this blog.

Typically the largest single user of memory for MySQL is the InnoDB buffer pool. This is used to cache the data and indexes for tables using the InnoDB storage engine (the default). It is rarely important to know what is stored in the InnoDB buffer pool when investigating a crash, so a good way to reduce the core dump size is to exclude the buffer pool. In MySQL 8.0.14 innodb_buffer_pool_in_core_file configuration option was added for this purpose. It defaults to ON which means the buffer pool is included in the core dumps (same behaviour as in older releases).

Contribution

Thanks for Facebook for the contribution that implemented the innodb_buffer_pool_in_core_file option.

Let's look a bit more into core dumps, how to enable them, why you may want to exclude the InnoDB buffer pool, and an example.

Information

The innodb_buffer_pool_in_core_file option only applies on Linux with kernel 3.4 and later. For this reason the rest of the blog assumes Linux is used.

Enabling Core Dumps

Core dumps can be enabled by including the core-file option in the MySQL configuration. It is disabled by default. However, on Linux/Unix it is in general not enough to enable core-file to get a core dump in case of a crash as the system will also limit the size of a core dump. Typically this limit is 0 by default.

The details of the steps required to enable core dumps depends on the Linux distribution used. The following will discuss the core-file option and the limit on the core size. However, the exact steps required may differ for your system and possibly include additional steps.

Enabling Core Dumps in MySQL

As mentioned, the option in MySQL to enable core dumps is core-file. This can only be set using the MySQL configuration file (my.cnf or my.ini). For example:

[mysqld]
core-file

After you have updated the configuration file, the change will take effect the next time MySQL is restarted.

Core Size Limit

Linux includes various limitations on the resources a process can use. This helps improve the stability of the system as a whole, but the default limits may be too restrictive for something like a database host where one process uses most of the resources. One of the limitations is the size of the core dump and typically the default limit is 0.

You can verify the current limits for your process using the proc file system, for example (assuming only a single mysqld process on the system):

shell$ sudo cat /proc/$(pidof mysqld)/limits
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        0                    unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             10240                12288                processes 
Max open files            10240                65536                files     
Max locked memory         65536                65536                bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       15611                15611                signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us

There are various ways to change the limits for MySQL and which one to use depends on how you start MySQL. If you start MySQL from the command-line, you can simply use the ulimit command first:

shell$ ulimit -c unlimited                                                                                                                                                                                                        

# Start MySQL
                                                                                                                                               
shell$ sudo cat /proc/$(pidof mysqld)/limits | grep 'core file size'
Max core file size        unlimited            unlimited            bytes

This has set the limit for the core file size to unlimited.

Warning

Do not blindly set the size to unlimited. If you configure the limit too high, you may run out of disk space preventing MySQL from restarting.

You can also configure the limit in the /etc/security/limits.conf file or in a new file in /etc/security/limits.d/. This is a better way to persist a setting and allows you for example to configure the limits for a given user. For example to set the core dump file size to unlimited for the mysql user:

mysql    soft    core    unlimited
mysql    hard    core    unlimited

However, on distributions using systemd (including Oracle Linux 7, Red Hat Enterprise Linux (RHEL) 7, and CentOS 7), systemd completely ignores /etc/security/limits.conf. Instead you need to use the service file for the process. For MySQL this is the /usr/lib/systemd/system/mysqld.service file, or if you have multiple instances on one host, /usr/lib/systemd/system/mysqld@.service. If you use Debian, replace mysqld with mysql. The option to set in the service file is LimitCore, for example:

[Service]
LimitCore = infinity

If you use mysqld_safe to start MySQL, you can use the core-file-size option (in the [mysqld_safe] group in the MySQL configuration file or on the command-line) to change the limit. This requires that the hard limit is high enough for the requested limit.

Problems with Core Dumps

Over time the amount of memory allocated to MySQL has grown. This particularly applies to the InnoDB buffer pool – as databases handle more data, there is also a greater need to cache the most used data and indexes in memory. Today, some MySQL instances have total memory allocations approaching 1TiB.

When a core dump happens, the memory allocated to the process is written out to disk. For processes using a large amount of memory, this may take some time. Furthermore, there is a possibility that the system may run out of disk, particularly if repeated crashes occur. The disk space usage can particularly be an issue when MySQL is running as a service as it is likely the core file in that case is written to the data directory. From a support perspective, it is also less than trivial to share a core dump that is several hundred gigabytes large.

Warning

Remember that the core dump is a snapshot of the process memory. This means that any data loaded into memory will be present in the core dump. This is particularly something to consider if you store sensitive data in your database. Even if you have transparent data encryption (TDE) enabled, the data in memory will be decrypted!

Excluding the InnoDB buffer pool from the core dump does reduce the amount of data in the core dump, but some data will still be included, for example because it is stored in memory for sorting or as part of a query result.

The InnoDB buffer pool is in most cases the largest contributor – it can easily contribute 75% or more of the total core dump size. So, excluding the buffer pool can significantly reduce the size of the core dump. I will give an example at the end of the blog.

Example

To finish off this blog, let's look at an example. In this case, MySQL has been configured with innodb_buffer_pool_size = 2G and data has been loaded, so most of the buffer is in use:

mysql> SELECT SUM(FREE_BUFFERS)
         FROM information_schema.INNODB_BUFFER_POOL_STATS;
+-------------------+
| SUM(FREE_BUFFERS) |
+-------------------+
|                 0 |
+-------------------+
1 row in set (0.02 sec)

And from the output of top

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
11142 mysql     20   0 4121468 2.261g      0 S   4.0 58.6   2:38.50 mysqld 

When MySQL is killed with signal 11 (segmentation fault), a core dump is created. Without innodb_buffer_pool_in_core_file, the core dump is almost 3 gigabytes (see output at the end of the blog).

If the new feature is enabled by adding innodb_buffer_pool_in_core_file to the configuration file or by persisting it

mysql> SET PERSIST innodb_buffer_pool_in_core_file = OFF;
Query OK, 0 rows affected (0.02 sec)

The change takes effect immediately. The core dump will now be much smaller – around the size of the buffer pool smaller:

shell$ ls -lh /var/lib/mysql/core.*
-rw-------. 1 mysql mysql 2.8G Feb 11 20:30 /var/lib/mysql/core.11142
-rw-------. 1 mysql mysql 759M Feb 11 20:47 /var/lib/mysql/core.14456

Here core.11142 is with innodb_buffer_pool_in_core_file enabled and core.14456 with the option disabled.

Tip

Want to know more? There are more details in the reference manual at Excluding Buffer Pool Pages from Core Files and in the blog from the server team: MySQL 8.0: Excluding the Buffer Pool from a Core File.

InnoDB Progress Information

MySQL has since version 5.7 had support for progress information for some queries. As promised in my previous post, I will here discuss how you can use that to get progress information for ALTER TABLE on InnoDB tables.

Background and Setup

Progress information is implemented through the Performance Schema using the stage events. In version 8.0.12 there are currently seven stages that can provide this information for  ALTER TABLE statements on InnoDB tables. In MySQL 8, it is easy to list the stages capable of reporting progress information by using the setup_instruments Performance Schema table:

mysql> SELECT NAME, ENABLED, TIMED
         FROM performance_schema.setup_instruments
        WHERE NAME LIKE 'stage/innodb/alter table%'
              AND PROPERTIES = 'progress';
+------------------------------------------------------+---------+-------+
| NAME                                                 | ENABLED | TIMED |
+------------------------------------------------------+---------+-------+
| stage/innodb/alter table (end)                       | YES     | YES   |
| stage/innodb/alter table (flush)                     | YES     | YES   |
| stage/innodb/alter table (insert)                    | YES     | YES   |
| stage/innodb/alter table (log apply index)           | YES     | YES   |
| stage/innodb/alter table (log apply table)           | YES     | YES   |
| stage/innodb/alter table (merge sort)                | YES     | YES   |
| stage/innodb/alter table (read PK and internal sort) | YES     | YES   |
+------------------------------------------------------+---------+-------+
7 rows in set (0.00 sec)

This also shows how the setup_instruments table in MySQL 8 has some additional information about the instruments such as properties and documentation (not included in the output). Adding this information is still work in progress.

MySQL 5.7 does not provide as easy a way to obtain the instruments providing progress information. Instead you need to consult the reference manual. However, the principle in using the feature is the same.

As you can see, all of the instruments are enabled and timed by default. What is not enabled by default, however, is the consumer that can make the information available:

mysql> SELECT NAME, ENABLED,
              sys.ps_is_consumer_enabled(NAME) AS EnabledWithHierarchy
         FROM performance_schema.setup_consumers
        WHERE NAME = 'events_stages_current';
+-----------------------+---------+----------------------+
| NAME                  | ENABLED | EnabledWithHierarchy |
+-----------------------+---------+----------------------+
| events_stages_current | NO      | NO                   |
+-----------------------+---------+----------------------+
1 row in set (0.01 sec)

Since the consumers form a hierarchical system, the sys schema function ps_is_consumer_enabled() is used to show whether the consumer is enabled taking the whole hierarchy into consideration.

In order to use the progress information, you need to enable the events_stages_current consumer. This is the consumer that is responsible for keeping the performance_schema.events_stages_current table up to date, i.e. record the current (or latest if there is no current stage) for each thread. With the default Performance Schema settings, the rest of the hierarchy is enabled. To enable event_stages_current and verify it will be consuming instruments, you can use the following queries:

mysql> UPDATE performance_schema.setup_consumers
          SET ENABLED = 'YES'
        WHERE NAME = 'events_stages_current';
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

mysql> SELECT NAME, ENABLED,
              sys.ps_is_consumer_enabled(NAME) AS EnabledWithHierarchy
         FROM performance_schema.setup_consumers
        WHERE NAME = 'events_stages_current';
+-----------------------+---------+----------------------+
| NAME                  | ENABLED | EnabledWithHierarchy |
+-----------------------+---------+----------------------+
| events_stages_current | YES     | YES                  |
+-----------------------+---------+----------------------+
1 row in set (0.00 sec)

That is it. Now you can monitor the progress of the queries that uses the stages with progress information.

Warning

The more parts of the Performance Schema that is enabled and the more fine grained monitoring, the more overhead. Stages are not the worst with respect to overhead; nevertheless it is recommended you keep an eye on the affect of enabling the events_stages_current consumer.

Monitoring Progress

The base for monitoring the progress information is the performance_schema.events_stages_current table. There are two columns of interest for this discussion:

  • WORK_COMPLETED: The amount of work that is reported to have been completed.
  • WORK_ESTIMATED: The estimated amount of work that needs to be done.

For InnoDB ALTER TABLE the estimated amount of work is for the entire operation. That said, the estimate may be revised during the process, so it may happen that the if you calculate the percentage it decreases as time goes. However, in general the percentage (100% * WORK_COMPLETED/WORK_ESTIMATED) will increase steadily until the operation completes at 100%.

To learn more about how the progress information works, the following pages in the manual are recommended:

For now, let's look at an example.

Example

For the example, the salaries table in the employees sample database will be used. The table is sufficiently large that it will be possible to query the progress while adding a column using the INPLACE algorithm. As discussed in MySQL 8.0.12: Instant ALTER TABLE, it is possible to add a column instantly, but for the purpose of this example, the INPLACE algorithm illustrates the progress information feature better. The query that will be executed is:

ALTER TABLE salaries ADD COLUMN new_col int NOT NULL DEFAULT 0, ALGORITHM=INPLACE;

The performance_schema.events_stages_current table can be joined with the performance_schema.events_statements_current to show the query and progress. For example:

mysql> SELECT stmt.THREAD_ID, stmt.SQL_TEXT, stage.EVENT_NAME AS State,
              stage.WORK_COMPLETED, stage.WORK_ESTIMATED,
              ROUND(100*stage.WORK_COMPLETED/stage.WORK_ESTIMATED, 2) AS CompletedPct
         FROM performance_schema.events_statements_current stmt
              INNER JOIN performance_schema.events_stages_current stage
                      ON stage.THREAD_ID = stmt.THREAD_ID
                         AND stage.NESTING_EVENT_ID = stmt.EVENT_ID\G
*************************** 1. row ***************************
     THREAD_ID: 63857
      SQL_TEXT: ALTER TABLE salaries ADD COLUMN new_col int NOT NULL DEFAULT 0, ALGORITHM=INPLACE
         State: stage/innodb/alter table (read PK and internal sort)
WORK_COMPLETED: 8906
WORK_ESTIMATED: 27351
  CompletedPct: 32.56
1 row in set (0.00 sec)

There is another way though. Instead of using the performance_schema.events_stages_current table directly, an easier way is to use the sys.session view. This is an advanced process list that includes much more information than the usual SHOW PROCESSLIST statement including progress information. The performance of sys.session has been improved with more than an order of magnitude in MySQL 8 by the addition of indexes to the Performance Schema tables making it highly useful.

Querying the sys.session view for sessions showing progress information while the ALTER TABLE is in progress returns an output similar to the following example:

mysql> SET @sys.statement_truncate_len = 85;
Query OK, 0 rows affected (0.00 sec)

mysql> SELECT thd_id, conn_id, db, command, state, current_statement,
              statement_latency, progress, current_memory, program_name
         FROM sys.session
        WHERE progress IS NOT NULL\G
*************************** 1. row ***************************
           thd_id: 63857
          conn_id: 63818
               db: employees
          command: Query
            state: alter table (merge sort)
current_statement: ALTER TABLE salaries ADD COLUMN new_col int NOT NULL DEFAULT 0, ALGORITHM=INPLACE
statement_latency: 4.22 s
         progress: 49.39
   current_memory: 464.27 KiB
     program_name: MySQLWorkbench
1 row in set (0.06 sec)

In the example, the @sys.statement_truncate_len user variable is set to 85. By default the sys schema ensures the current statement is at most 64 characters long. In order to avoid truncation in this case, the truncate length is increased to 85.

The example output shows that the progress is at around 49%. It is important to note that is an estimate and not an exact number. The ALTER TABLE is performing a merge sort at the time, and the query has been running for 4.22 seconds.

A couple of other interesting columns are included. It can be seen the connection is using 464KiB at the time. In MySQL 8 memory instrumentation is enabled by default (in MySQL 5.7 you need to enable it yourself). Additionally, the name of the program executing the query is MySQLWorkbench, that is the query originates from MySQL Workbench.

So, next time you plan a large ALTER TABLE operation, consider enabling the events_stages_current consumer, so you can follow the progress.

MySQL 8.0.12: Instant ALTER TABLE

There are many nice changes included in the MySQL 8.0.12 release that were published a couple of days ago. One of the most exciting is the ability to make instant schema changes to tables. This blog will look into why I think that that is a stand-out change.

I will not go into details with the implementation other than noting that the new data dictionary in MySQL 8.0 has allowed for a nice implementation of the patch that was contributed by the Tencent Games DBA Team. If you are interested in learning more about the implementation, I will recommend you to read the blog by Bin Su (MySQL 8.0: InnoDB now supports Instant ADD COLUMN) and the worklog implementing the feature (WL#11250: Support Instant Add Column).

Contribution

Thanks to the Tencent Games DBA Team who contributed the patch for this feature.

Why Do We Need Instant ALTER TABLE?

As you may know, InnoDB has since version 5.6 supported online schema changes. So, a fair thought may be why that is no longer good enough. The thing is that while online schema changes are very nice and allows you to make changes to the data (DML statements) while the schema change is made, there are still some problems:

  • Online ALTER TABLE still requires a meta data lock for the duration of the operation. That is, it is not possible to make other schema changes until the ALTER TABLE has completed.
  • In replication setups, the SQL thread handling the online schema change will still block other changes coming through the replication channel from occurring. This means that an online schema change that takes an hour suddenly makes the replication lag an hour.
  • This is compounded in chained replication setups where the delay on the last instances in the topology is multiplied with the number of times the change has been replication. Suddenly this schema change that did not really affect the replication master becomes a major pain and delays the deployment of the application upgrade that relies on the schema changes.
  • Even though the change is online, it still is heavy on resources: disk for storing the extra copy of the table when the change cannot be made in-place, CPU and disk activity that may cause other queries to become slow, etc.
  • There is a limit to the amount of DML changes that can be made during the schema change. If you make too many changes, the schema change will fail.

So, in short, online schema changes are only the first stop. Instance schema changes is the future.

Which Changes Can Be Made Instantly?

While it would be great if all schema changes could be made instantly, unfortunately that is not the case. Just as not all schema changes can be made online, there are limitations to the new instant feature. In short, the changes that can be made instantly must only affect the metadata of the table. The metadata is stored in the data dictionary. The changes that can be made with the instant ALTER TABLE feature as per 8.0.12 are:

  • Adding a new column as the last column in the table.
  • Adding a generated virtual column.
  • Dropping a generated virtual column.
  • Setting a default value for an existing column.
  • Dropping the default value for an existing column.
  • Changing the list of values allowed for a column with the ENUM or SET data types. A requirement is that the storage size does not change for the column.
  • Change whether the index type is set explicitly for an existing index.

And who knows, maybe later the feature can be extended to cover more changes. There are also a few limitations that are good to be aware of:

  • The row format cannot be COMPRESSED.
  • The table cannot have a fulltext index.
  • Tables in the data dictionary cannot use the instant algorithm.
  • Temporary tables are not supported.

How to Ensure You are Using the Expected Algorithm?

One problem with schema changes is that here are different algorithms depending on the schema change. Currently there are three different algorithms:

  • INSTANT: the change completes very quickly (yes not quite instantly) as only the metadata in the data dictionary needs to be updated.
  • INPLACE: the changes are made within the existing table, i.e. a complete table copy is avoided.
  • COPY: the table is copied into a new copy with the new definition.

By default, MySQL chooses the algorithm doing the least work. That is, INSTANT if that is supported, otherwise INPLACE if that is supported, and finally COPY. Additionally, there is the concept of locking which can be set to either NONE, SHARED, or EXCLUSIVE.

So, how do you ensure you are not ending up with a copying ALTER TABLE taking exclusive locks when you thought the operation was going to be instant? The answer is to explicitly set the algorithm and lock type. That way, MySQL will throw an error if your schema change is not compatible with the requested algorithm. The same principle can be used to force a full rebuild of the table if you for example have a small table and don’t see a reason to worry about instantly added columns (see also more later).

I will give some examples of specifying the ALGORITHM and LOCK options to ALTER TABLE later. However, we first need an example table to play with.

Tip

Always specify the ALGORITHM and LOCK options explicitly to avoid unpleasant surprises.

Creating an Example Table

For the purpose of the upcoming examples of instant schema changes, it is necessary to have a table to play with. The table will be the testtbl table in the my_schema schema. The table can be created and populated with sample data by using MySQL Shell – this allows us to take advantage of the support for scripting. The table will have one million rows.

Note

To limit the amount of scrolling, the prompt in MySQL Shell has been changed to just show the mode (JS, PY, SQL).

JS> \connect root@localhost
Creating a session to 'root@localhost'
Please provide the password for 'root@localhost': ********
Save password for 'root@localhost'? [Y]es/[N]o/Ne[v]er (default No):
Fetching schema names for autocompletion... Press ^C to stop.
Your MySQL connection id is 17 (X protocol)
Server version: 8.0.12 MySQL Community Server - GPL
No default schema selected; type \use <schema> to set one.

JS> \sql
Switching to SQL mode... Commands end with ;

SQL> CREATE SCHEMA my_schema;
Query OK, 1 row affected (0.0509 sec)

SQL> \use my_schema
Default schema set to `my_schema`.
Fetching table and column names from `my_schema` for auto-completion... Press ^C to stop.

SQL> CREATE TABLE testtbl (
 ...   id int unsigned NOT NULL auto_increment,
 ...   val varchar(36) NOT NULL,
 ...   PRIMARY KEY (id)
 ... ) ENGINE=InnoDB;
Query OK, 0 rows affected (0.0497 sec)
SQL> \py
Switching to Python mode...

Py> \use my_schema
Default schema `my_schema` accessible through db.

Py> db.testtbl
<Table:testtbl>

Py> import uuid

Py> for i in range(1000):
...     session.start_transaction()
...     stmt = db.testtbl.insert("val")
...     for j in range(1000):
...         stmt = stmt.values(uuid.uuid1().hex)
...     stmt.execute()
...     session.commit()
...
Query OK, 0 rows affected (0.0029 sec)

Py> db.testtbl.select("COUNT(*)")
+----------+
| COUNT(*) |
+----------+
|  1000000 |
+----------+
1 row in set (0.2476 sec)

Note how this takes advantage of the feature in MySQL Shell where the db object in JavaScript and Python mode has the tables as properties. Thus it is possible to refer to the testtbl table in the my_schema schema as db.testtbl after using \use my_schema to read the database objects. The data is inserted in 1000 batches of 1000 rows using a double loop.

Now it is time to play with schema changes.

Schema Changes Examples

For comparison, let’s first add a column using the old algorithms. First, add the column val2 using the copying algorithm – note this is not supported online, so a shared lock is taken:

SQL> ALTER TABLE testtbl
 ...       ADD COLUMN val2 varchar(36) DEFAULT NULL,
 ...       ALGORITHM=COPY, LOCK=SHARED;
Query OK, 1000000 rows affected (5.3952 sec)

Note

The test is made on a laptop – the timings themselves are not relevant, but they can be used to compare the time taken for the three algorithms.

So that took around 5.4 seconds. That is not too bad, but we can do much better than that. Let’s try the in-place algorithm. Not only will that require less work (though still a fair bit), it is also possible to perform the change online.

SQL> ALTER TABLE testtbl
 ...       ADD COLUMN val3 varchar(36) DEFAULT NULL,
 ...       ALGORITHM=INPLACE, LOCK=NONE;
Query OK, 0 rows affected (1.7511 sec)

So, this is a factor three faster despite the table having an extra column compared to the original table. And it is possible to execute DML statements. That is a big win. But let’s move on to the finale: instant ALTER TABLE:

SQL> ALTER TABLE testtbl
 ...       ADD COLUMN val4 varchar(36) DEFAULT NULL,
 ...       ALGORITHM=INSTANT;
Query OK, 0 rows affected (0.0490 sec)

Wow – that’s less than 5/100 of a second – that’s instant enough for me. Note hos the LOCK option is not specified here. The instant algorithm is always as lock free as possible, so the LOCK option is not allowed.

There is a little more to it than this. As mentioned by Bin Su, there are some trade offs such as the row size if not checked upfront. So, in some cases it can be useful to know whether a table contains instantly added columns. This can be seen through the innodb_tables view in the Information Schema on the data dictionary:

Py> schema = session.get_schema("information_schema")
Query OK, 1000 items affected (0.0091 sec)

Py> inno_tables = schema.get_table("INNODB_TABLES")
Query OK, 0 rows affected (0.0002 sec)

Py> inno_tables.select("NAME", "INSTANT_COLS").where("INSTANT_COLS > 0")
+-------------------+--------------+
| NAME              | INSTANT_COLS |
+-------------------+--------------+
| my_schema/testtbl |            4 |
+-------------------+--------------+
1 row in set (0.0405 sec)

Now what is that? 4 columns? But we only added one of the columns using the instant algorithm. What the INSTANT_COLS column shows is how many columns existed before the first instant column was added. In the example the columns id, val, val2, and val3 existed, then val4 was added using the instant algorithm. For tables that have never had any columns added instantly, the value of INSTANT_COLS is 0.

Want to Read More?

This blog has just been an introduction to the new MySQL 8.0.12 feature of instant schema changes. I will recommend you to read Bin Xu’s blog as well as the documentation in the MySQL reference manual to fully understand the feature:

Conclusion

The new feature allowing instant schema changes is a great way to avoid time consuming operations. Particularly in replication setups where the replicated statement will block while it applies, the different between an online and an instant schema change is like night and day.

MySQL 8.0.12 can be downloaded from MySQL Community Downloads or you can install it through one of the “native installers” such as MySQL Installer for Microsoft Windows or one of our Linux repositories (Yum, APT, SUSE). If you are a customer and need the commercial version of MySQL Server, I will recommend you to download it from Patches & Updates in My Oracle Support (MOS); there is also a 30 days trial version available from Oracle Software Delivery Cloud.

Please do not hesitate to provide feedback. Bugs and feature requests can be logged in the MySQL Bugs database.