Working Around MySQL Cluster Push Down Limitations Using Subqueries

This post was originally published on the MySQL Support Team Blog at https://blogs.oracle.com/mysqlsupport/entry/working_around_mysql_cluster_push on 5 August 2016.

I worked on an issue recently where a query was too slow when executed in MySQL Cluster. The issue was that Cluster has some restrictions when it comes to push down conditions.

As an example of this, consider the following query using the employees sample database. The query takes a look at the average salary based on how many years the employee has been with the company. As the latest hire date in the database is in January 2000, the query uses 1 February 2000 as the reference date.

Initially the query performs like (performance is with two data nodes and all nodes in the same virtual machine on a laptop, so the timings are not necessarily representative of a production system, though the improvements should be repeatable):

mysql> SELECT FLOOR(DATEDIFF('2000-02-01', hire_date)/365) AS LengthOfService,
              COUNT(DISTINCT employees.emp_no) AS NoEmployees, AVG(salary) AS AvgSalary
         FROM salaries
              STRAIGHT_JOIN employees ON employees.emp_no = salaries.emp_no
        WHERE salaries.from_date <= '2000-02-01' AND salaries.to_date >= '2000-02-01'
        GROUP BY LengthOfService;
+-----------------+-------------+------------+
| LengthOfService | NoEmployees | AvgSalary  |
+-----------------+-------------+------------+
|               0 |        1230 | 53051.4211 |
|               1 |        3667 | 54488.7478 |
|               2 |        5893 | 55670.2121 |
|               3 |        8305 | 57162.7269 |
|               4 |       10647 | 58557.9498 |
|               5 |       12804 | 60132.0652 |
|               6 |       15267 | 61769.0817 |
|               7 |       17540 | 62992.4075 |
|               8 |       19290 | 64468.6666 |
|               9 |       21737 | 66047.8462 |
|              10 |       24056 | 67671.1557 |
|              11 |       26488 | 68748.2062 |
|              12 |       28294 | 70409.9716 |
|              13 |       30249 | 72006.3509 |
|              14 |       32572 | 73478.7101 |
|              15 |         297 | 74331.7798 |
+-----------------+-------------+------------+
16 rows in set (23.89 sec)

The straight join is needed as the performance is better than leaving the join order up to the optimizer.

The schema for the two tables in use is:

mysql> SHOW CREATE TABLE employees\G
*************************** 1. row ***************************
       Table: employees
Create Table: CREATE TABLE `employees` (
  `emp_no` int(11) NOT NULL,
  `birth_date` date NOT NULL,
  `first_name` varchar(14) NOT NULL,
  `last_name` varchar(16) NOT NULL,
  `gender` enum('M','F') NOT NULL,
  `hire_date` date NOT NULL,
  PRIMARY KEY (`emp_no`)
) ENGINE=ndbcluster DEFAULT CHARSET=latin1
1 row in set (0.01 sec)

mysql> SHOW CREATE TABLE salaries\G
*************************** 1. row ***************************
       Table: salaries
Create Table: CREATE TABLE `salaries` (
  `emp_no` int(11) NOT NULL,
  `salary` int(11) NOT NULL,
  `from_date` date NOT NULL,
  `to_date` date NOT NULL,
  PRIMARY KEY (`emp_no`,`from_date`),
  KEY `emp_no` (`emp_no`),
  CONSTRAINT `FK_130_154` FOREIGN KEY (`emp_no`) REFERENCES `employees` (`emp_no`) ON DELETE CASCADE ON UPDATE NO ACTION
) ENGINE=ndbcluster DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

Why this poor performance? Looking at the EXPLAIN plan is the first step:

mysql> EXPLAIN EXTENDED SELECT FLOOR(DATEDIFF('2000-02-01', hire_date)/365) AS LengthOfService, COUNT(DISTINCT employees.emp_no) AS NoEmployees, AVG(salary) AS AvgSalary FROM salaries STRAIGHT_JOIN employees ON employees.emp_no = salaries.emp_no WHERE salaries.from_date <= '2000-02-01' AND salaries.to_date >= '2000-02-01' GROUP BY LengthOfService\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: salaries
         type: ALL
possible_keys: PRIMARY,emp_no
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 2844047
     filtered: 100.00
        Extra: Using where with pushed condition ((`employees`.`salaries`.`from_date` <= '2000-02-01') and (`employees`.`salaries`.`to_date` >= '2000-02-01')); Using temporary; Using filesort
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: employees
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: employees.salaries.emp_no
         rows: 1
     filtered: 100.00
        Extra: NULL
2 rows in set, 3 warnings (0.00 sec)

The EXPLAIN plan itself does not look bad – the index usage is as expected. However note the 3 warnings – one is the usual rewritten query after the application of rewriting and optimizer rules, but the two other gives more information why the performance is not what would be expected:

mysql> SHOW WARNINGS\G
*************************** 1. row ***************************
  Level: Note
   Code: 1003
Message: Table 'salaries' is not pushable: GROUP BY cannot be done using index on grouped columns.
*************************** 2. row ***************************
  Level: Note
   Code: 1003
Message: Table 'employees' is not pushable: GROUP BY cannot be done using index on grouped columns.
*************************** 3. row ***************************
...

Here it can be seen that the tables are not pushable, because they are involved in the GROUP BY.

This gave me an idea. A couple of years ago, I wrote a post on the MySQL Server Blog about MySQL Server Blog: Better Performance for JOINs Not Using Indexes. These materialized temporary tables can also include auto keys that can improve the join performance significantly. What if a similar tactics is used with the above query?

Changing the join on the employees table to join a subquery:

mysql> SELECT FLOOR(DATEDIFF('2000-02-01', hire_date)/365) AS LengthOfService,
              COUNT(DISTINCT employees.emp_no) AS NoEmployees, AVG(salary) AS AvgSalary
         FROM salaries
              STRAIGHT_JOIN (SELECT * FROM employees) AS employees ON employees.emp_no = salaries.emp_no
        WHERE salaries.from_date <= '2000-02-01' AND salaries.to_date >= '2000-02-01'
        GROUP BY LengthOfService;
...
16 rows in set (6.64 sec)

That is more than a factor 3 improvement.

The new EXPLAIN plan is:

*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: salaries
         type: ALL
possible_keys: PRIMARY,emp_no
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 2844047
        Extra: Using where with pushed condition; Using temporary; Using filesort
*************************** 2. row ***************************
           id: 1
  select_type: PRIMARY
        table: <derived2>
         type: ref
possible_keys: <auto_key0>
          key: <auto_key0>
      key_len: 4
          ref: employees.salaries.emp_no
         rows: 10
        Extra: NULL
*************************** 3. row ***************************
           id: 2
  select_type: DERIVED
        table: employees
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 300024
        Extra: NULL
3 rows in set (0.01 sec)

Note how <auto_key0> is used for the join on the derived table. This is what helps make the rewrite work.

Actually we can do a bit better. The above subquery selects all columns from the employees table even though only the emp_no and hire_date columns are used. So the next rewrite is:

mysql> SELECT FLOOR(DATEDIFF('2000-02-01', hire_date)/365) AS LengthOfService,
              COUNT(DISTINCT employees.emp_no) AS NoEmployees, AVG(salary) AS AvgSalary
         FROM salaries
              STRAIGHT_JOIN (SELECT emp_no, hire_date FROM employees) AS employees ON employees.emp_no = salaries.emp_no
        WHERE salaries.from_date <= '2000-02-01' AND salaries.to_date >= '2000-02-01'
        GROUP BY LengthOfService;
...
16 rows in set (5.74 sec)

The improvement of only including the columns needed will vary. For example if the internal temporary table ends up being converted into an on-disk table because of the extra data, or a covering index no longer can be used can increase the importance of only choosing the columns needed.

A final slight improvement can be gained by also converting the salaries table into a subquery:

mysql> SELECT FLOOR(DATEDIFF('2000-02-01', hire_date)/365) AS LengthOfService,
              COUNT(DISTINCT employees.emp_no) AS NoEmployees, AVG(salary) AS AvgSalary
         FROM (SELECT emp_no, salary
                 FROM salaries
                WHERE salaries.from_date <= '2000-02-01' AND salaries.to_date >= '2000-02-01'
              ) AS salaries
              STRAIGHT_JOIN (SELECT emp_no, hire_date FROM employees) AS employees ON employees.emp_no = salaries.emp_no
        GROUP BY LengthOfService;
...
16 rows in set (5.36 sec)

The total speed up is from a little under 24 seconds to a little over 5 second or more than a factor 4.

Note: the above rewrite will not work out of the box in MySQL Cluster 7.5. The reason is that it is based on MySQL Server 5.7 where the optimizer can perform more advanced rewrites than it was possible on MySQL 5.6. So by default the above subqueries will be rewritten to normal joins and you end up with the original query again. To use the tactics of subqueries in Cluster 7.5, it is necessary first to disable the optimizer switch allowing the optimizer to rewrite the subqueries to normal joins:

mysql> SET optimizer_switch='derived_merge=off';