Working Around MySQL Cluster Push Down Limitations Using Subqueries
This post was originally published on the MySQL Support Team Blog at https://blogs.oracle.com/mysqlsupport/entry/working_around_mysql_cluster_push on 5 August 2016.
I worked on an issue recently where a query was too slow when executed in MySQL Cluster. The issue was that Cluster has some restrictions when it comes to push down conditions.
As an example of this, consider the following query using the employees sample database. The query takes a look at the average salary based on how many years the employee has been with the company. As the latest hire date in the database is in January 2000, the query uses 1 February 2000 as the reference date.
Initially the query performs like (performance is with two data nodes and all nodes in the same virtual machine on a laptop, so the timings are not necessarily representative of a production system, though the improvements should be repeatable):
mysql> SELECT FLOOR(DATEDIFF('2000-02-01', hire_date)/365) AS LengthOfService,
COUNT(DISTINCT employees.emp_no) AS NoEmployees, AVG(salary) AS AvgSalary
FROM salaries
STRAIGHT_JOIN employees ON employees.emp_no = salaries.emp_no
WHERE salaries.from_date <= '2000-02-01' AND salaries.to_date >= '2000-02-01'
GROUP BY LengthOfService;
+-----------------+-------------+------------+
| LengthOfService | NoEmployees | AvgSalary |
+-----------------+-------------+------------+
| 0 | 1230 | 53051.4211 |
| 1 | 3667 | 54488.7478 |
| 2 | 5893 | 55670.2121 |
| 3 | 8305 | 57162.7269 |
| 4 | 10647 | 58557.9498 |
| 5 | 12804 | 60132.0652 |
| 6 | 15267 | 61769.0817 |
| 7 | 17540 | 62992.4075 |
| 8 | 19290 | 64468.6666 |
| 9 | 21737 | 66047.8462 |
| 10 | 24056 | 67671.1557 |
| 11 | 26488 | 68748.2062 |
| 12 | 28294 | 70409.9716 |
| 13 | 30249 | 72006.3509 |
| 14 | 32572 | 73478.7101 |
| 15 | 297 | 74331.7798 |
+-----------------+-------------+------------+
16 rows in set (23.89 sec)
The straight join is needed as the performance is better than leaving the join order up to the optimizer.
The schema for the two tables in use is:
mysql> SHOW CREATE TABLE employees\G
*************************** 1. row ***************************
Table: employees
Create Table: CREATE TABLE `employees` (
`emp_no` int(11) NOT NULL,
`birth_date` date NOT NULL,
`first_name` varchar(14) NOT NULL,
`last_name` varchar(16) NOT NULL,
`gender` enum('M','F') NOT NULL,
`hire_date` date NOT NULL,
PRIMARY KEY (`emp_no`)
) ENGINE=ndbcluster DEFAULT CHARSET=latin1
1 row in set (0.01 sec)
mysql> SHOW CREATE TABLE salaries\G
*************************** 1. row ***************************
Table: salaries
Create Table: CREATE TABLE `salaries` (
`emp_no` int(11) NOT NULL,
`salary` int(11) NOT NULL,
`from_date` date NOT NULL,
`to_date` date NOT NULL,
PRIMARY KEY (`emp_no`,`from_date`),
KEY `emp_no` (`emp_no`),
CONSTRAINT `FK_130_154` FOREIGN KEY (`emp_no`) REFERENCES `employees` (`emp_no`) ON DELETE CASCADE ON UPDATE NO ACTION
) ENGINE=ndbcluster DEFAULT CHARSET=latin1
1 row in set (0.00 sec)
Why this poor performance? Looking at the EXPLAIN
plan is the first step:
mysql> EXPLAIN EXTENDED SELECT FLOOR(DATEDIFF('2000-02-01', hire_date)/365) AS LengthOfService, COUNT(DISTINCT employees.emp_no) AS NoEmployees, AVG(salary) AS AvgSalary FROM salaries STRAIGHT_JOIN employees ON employees.emp_no = salaries.emp_no WHERE salaries.from_date <= '2000-02-01' AND salaries.to_date >= '2000-02-01' GROUP BY LengthOfService\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: salaries
type: ALL
possible_keys: PRIMARY,emp_no
key: NULL
key_len: NULL
ref: NULL
rows: 2844047
filtered: 100.00
Extra: Using where with pushed condition ((`employees`.`salaries`.`from_date` <= '2000-02-01') and (`employees`.`salaries`.`to_date` >= '2000-02-01')); Using temporary; Using filesort
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: employees
type: eq_ref
possible_keys: PRIMARY
key: PRIMARY
key_len: 4
ref: employees.salaries.emp_no
rows: 1
filtered: 100.00
Extra: NULL
2 rows in set, 3 warnings (0.00 sec)
The EXPLAIN
plan itself does not look bad – the index usage is as expected. However note the 3 warnings – one is the usual rewritten query after the application of rewriting and optimizer rules, but the two other gives more information why the performance is not what would be expected:
mysql> SHOW WARNINGS\G
*************************** 1. row ***************************
Level: Note
Code: 1003
Message: Table 'salaries' is not pushable: GROUP BY cannot be done using index on grouped columns.
*************************** 2. row ***************************
Level: Note
Code: 1003
Message: Table 'employees' is not pushable: GROUP BY cannot be done using index on grouped columns.
*************************** 3. row ***************************
...
Here it can be seen that the tables are not pushable, because they are involved in the GROUP BY.
This gave me an idea. A couple of years ago, I wrote a post on the MySQL Server Blog about MySQL Server Blog: Better Performance for JOINs Not Using Indexes. These materialized temporary tables can also include auto keys that can improve the join performance significantly. What if a similar tactics is used with the above query?
Changing the join on the employees table to join a subquery:
mysql> SELECT FLOOR(DATEDIFF('2000-02-01', hire_date)/365) AS LengthOfService,
COUNT(DISTINCT employees.emp_no) AS NoEmployees, AVG(salary) AS AvgSalary
FROM salaries
STRAIGHT_JOIN (SELECT * FROM employees) AS employees ON employees.emp_no = salaries.emp_no
WHERE salaries.from_date <= '2000-02-01' AND salaries.to_date >= '2000-02-01'
GROUP BY LengthOfService;
...
16 rows in set (6.64 sec)
That is more than a factor 3 improvement.
The new EXPLAIN
plan is:
*************************** 1. row ***************************
id: 1
select_type: PRIMARY
table: salaries
type: ALL
possible_keys: PRIMARY,emp_no
key: NULL
key_len: NULL
ref: NULL
rows: 2844047
Extra: Using where with pushed condition; Using temporary; Using filesort
*************************** 2. row ***************************
id: 1
select_type: PRIMARY
table: <derived2>
type: ref
possible_keys: <auto_key0>
key: <auto_key0>
key_len: 4
ref: employees.salaries.emp_no
rows: 10
Extra: NULL
*************************** 3. row ***************************
id: 2
select_type: DERIVED
table: employees
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 300024
Extra: NULL
3 rows in set (0.01 sec)
Note how <auto_key0>
is used for the join on the derived table. This is what helps make the rewrite work.
Actually we can do a bit better. The above subquery selects all columns from the employees table even though only the emp_no
and hire_date
columns are used. So the next rewrite is:
mysql> SELECT FLOOR(DATEDIFF('2000-02-01', hire_date)/365) AS LengthOfService,
COUNT(DISTINCT employees.emp_no) AS NoEmployees, AVG(salary) AS AvgSalary
FROM salaries
STRAIGHT_JOIN (SELECT emp_no, hire_date FROM employees) AS employees ON employees.emp_no = salaries.emp_no
WHERE salaries.from_date <= '2000-02-01' AND salaries.to_date >= '2000-02-01'
GROUP BY LengthOfService;
...
16 rows in set (5.74 sec)
The improvement of only including the columns needed will vary. For example if the internal temporary table ends up being converted into an on-disk table because of the extra data, or a covering index no longer can be used can increase the importance of only choosing the columns needed.
A final slight improvement can be gained by also converting the salaries
table into a subquery:
mysql> SELECT FLOOR(DATEDIFF('2000-02-01', hire_date)/365) AS LengthOfService,
COUNT(DISTINCT employees.emp_no) AS NoEmployees, AVG(salary) AS AvgSalary
FROM (SELECT emp_no, salary
FROM salaries
WHERE salaries.from_date <= '2000-02-01' AND salaries.to_date >= '2000-02-01'
) AS salaries
STRAIGHT_JOIN (SELECT emp_no, hire_date FROM employees) AS employees ON employees.emp_no = salaries.emp_no
GROUP BY LengthOfService;
...
16 rows in set (5.36 sec)
The total speed up is from a little under 24 seconds to a little over 5 second or more than a factor 4.
Note: the above rewrite will not work out of the box in MySQL Cluster 7.5. The reason is that it is based on MySQL Server 5.7 where the optimizer can perform more advanced rewrites than it was possible on MySQL 5.6. So by default the above subqueries will be rewritten to normal joins and you end up with the original query again. To use the tactics of subqueries in Cluster 7.5, it is necessary first to disable the optimizer switch allowing the optimizer to rewrite the subqueries to normal joins:
mysql> SET optimizer_switch='derived_merge=off';