Tracking Foreign Keys
The other day, I was reading a blog by Magnus Hagander about tracking foreign keys throughout a schema in PostgreSQL. I thought it was a good idea, so I decided to look at how you can track foreign key in MySQL.
The way I decided to do it was to start out with a table, then find all tables referencing the table by a foreign key. From this basic (and simple as it will be shown) query, it is possible to create a chain of relations. The key table for the queries is information_schema.KEY_COLUMN_USAGE
which has information about all the foreign key relationships.
However, before getting that far, it is necessary to have some tables to work with.
Sample Schema
A small schema, but with relatively complex foreign keys relations, can be created with the following statements:
DROP SCHEMA IF EXISTS db1;
CREATE SCHEMA db1;
CREATE TABLE db1.a (
a_id int,
PRIMARY KEY (a_id)
);
CREATE TABLE db1.b (
b_id1 int,
b_id2 int,
a_id int,
PRIMARY KEY (b_id1, b_id2),
FOREIGN KEY (a_id) REFERENCES a (a_id)
);
CREATE TABLE db1.c (
c_id int,
b_id1 int,
b_id2 int,
PRIMARY KEY (c_id),
FOREIGN KEY (b_id1, b_id2) REFERENCES b (b_id1, b_id2)
);
CREATE TABLE db1.d (
d_id int,
a_id int,
c_id int,
PRIMARY KEY (d_id),
FOREIGN KEY (a_id) REFERENCES a (a_id),
FOREIGN KEY (c_id) REFERENCES c (c_id)
);
CREATE TABLE db1.e (
e_id int,
a_id int,
PRIMARY KEY (e_id),
FOREIGN KEY (a_id) REFERENCES a (a_id)
);
CREATE TABLE db1.f (
f_id int,
b_id1 int,
b_id2 int,
c_id int,
PRIMARY KEY (f_id),
FOREIGN KEY (b_id1, b_id2)
REFERENCES b (b_id1, b_id2),
FOREIGN KEY (c_id) REFERENCES c (c_id)
);
The base table is a
. There are three tables, b
, d
, and e
, with a direct foreign key to table a
. Tables c
and f
in turn references table b
through a two column foreign key, and tables d
and f
references table c
. So there are quite a few paths to get to table a
from another table. Before looking at how the relationship can be found from the information_schema.KEY_COLUMN_USAGE
, it is worth creating a visual representation of the schema.
MySQL Workbench EER Diagrams
A long standing feature of MySQL Workbench is its ability to create enhanced entity–relationship (EER) diagrams. This shows a box with information about the columns and indexes of each table in a schema. Additionally there are lines connecting tables related by foreign keys. So, an EER diagram includes what we are looking for – the chain of foreign keys.
You can create an ERR diagram by clicking on Database and then choose Reverse Engineer … from the menu in MySQL Workbench:
Alternatively use CTRL + R. You can do this from the homepage or from a database connection. Either way, you are taken to the connection options screen where you can choose an already defined connection or create a new one – this works the same as when you connect to a MySQL instance to execute queries:
When you continue, MySQL Workbench will connect to MySQL Server and get a list of the schemas available. Once you get to the Select Schemas page, you can choose the schema or schemas you want to create the EER diagram for. In this case choose the db1
schema (unless you created the tables in another schema):
For this example, you can use the defaults for the rest of the screens. On the Select Objects screen, you can optionally choose to select a subset of tables for the diagram. On the same screen, you choose whether you want to place the imported objects into a diagram (enabled by default); you want to do that for this example.
At the end, you have the diagram. You can move the tables around to place them as you like. One example of the diagram is:
This makes it easy to see the relations between the tables.
But what do you do, if you want to analyze the relationship in a program or for some other reason have the relationships in a text format? Let's look at that.
Querying the Foreign Key Relationship
As mentioned, the base table for looking at foreign key relations is the information_schema.KEY_COLUMN_USAGE
table. It has the following definition:
mysql> DESC information_schema.KEY_COLUMN_USAGE;
+-------------------------------+------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------------------------+------------------+------+-----+---------+-------+
| CONSTRAINT_CATALOG | varchar(64) | YES | | NULL | |
| CONSTRAINT_SCHEMA | varchar(64) | YES | | NULL | |
| CONSTRAINT_NAME | varchar(64) | NO | | NULL | |
| TABLE_CATALOG | varchar(64) | YES | | NULL | |
| TABLE_SCHEMA | varchar(64) | YES | | NULL | |
| TABLE_NAME | varchar(64) | YES | | NULL | |
| COLUMN_NAME | varchar(64) | YES | | NULL | |
| ORDINAL_POSITION | int(10) unsigned | NO | | NULL | |
| POSITION_IN_UNIQUE_CONSTRAINT | binary(0) | YES | | NULL | |
| REFERENCED_TABLE_SCHEMA | binary(0) | YES | | NULL | |
| REFERENCED_TABLE_NAME | binary(0) | YES | | NULL | |
| REFERENCED_COLUMN_NAME | binary(0) | YES | | NULL | |
+-------------------------------+------------------+------+-----+---------+-------+
12 rows in set (0.01 sec)
In MySQL 8.0 this is a view on the new data dictionary, so effectively a plain InnoDB query and it is fast to query. In MySQL 5.7 and earlier, querying it requires opening the tables which can be slow and all tables must be opened. If you have many tables and they are not cached in the table caches yet, querying KEY_COLUMN_USAGE
can be slow and impact the general performance.
Basic Query – Single Column per Foreign Key
The three columns prefixed with REFERENCED_
contains the information about a foreign key. For example, for the tables used in this blog, if you want to know which tables have a direct foreign key to table a
in the db1
schema, you can query KEY_COLUMN_USAGE
with a WHERE
clause on REFERENCED_TABLE_SCHEMA
and REFERENCED_TABLE_NAME
like:
mysql> SELECT TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME,
REFERENCED_TABLE_SCHEMA, REFERENCED_TABLE_NAME, REFERENCED_COLUMN_NAME
FROM information_schema.KEY_COLUMN_USAGE
WHERE REFERENCED_TABLE_SCHEMA = 'db1'
AND REFERENCED_TABLE_NAME = 'a';
+--------------+------------+-------------+-------------------------+-----------------------+------------------------+
| TABLE_SCHEMA | TABLE_NAME | COLUMN_NAME | REFERENCED_TABLE_SCHEMA | REFERENCED_TABLE_NAME | REFERENCED_COLUMN_NAME |
+--------------+------------+-------------+-------------------------+-----------------------+------------------------+
| db1 | b | a_id | db1 | a | a_id |
| db1 | d | a_id | db1 | a | a_id |
| db1 | e | a_id | db1 | a | a_id |
+--------------+------------+-------------+-------------------------+-----------------------+------------------------+
3 rows in set (0.01 sec)
So, the tables b
, d
, and e
has a foreign key to a_id
in the db1.a
table, and the column name for each of the three tables is also called a_id
. This is just as expected.
The query works great for finding the immediate relations where the foreign key only includes a single column. However, for cases where there are multiple columns in the foreign key, there will be two rows for each referencing table. So what to do?
Basis Query – Multiple Columns per Foreign Key
To avoid having one row per column in a multi-column foreign key, you need to perform an aggregation. You can for example use the GROUP_CONCAT()
to generate a single value combining the column names. In MySQL 8.0, you can also consider creating a JSON array by using the JSON_ARRAYAGG()
function:
mysql> SELECT TABLE_SCHEMA, TABLE_NAME,
JSON_ARRAYAGG(COLUMN_NAME) ColumnNames
FROM information_schema.KEY_COLUMN_USAGE
WHERE REFERENCED_TABLE_SCHEMA = 'db1'
AND REFERENCED_TABLE_NAME = 'b'
GROUP BY TABLE_SCHEMA, TABLE_NAME;
+--------------+------------+--------------------+
| TABLE_SCHEMA | TABLE_NAME | ColumnNames |
+--------------+------------+--------------------+
| db1 | c | ["b_id1", "b_id2"] |
| db1 | f | ["b_id1", "b_id2"] |
+--------------+------------+--------------------+
2 rows in set (0.01 sec)
This queries the foreign keys to the b
tables. The c
and f
tables have a foreign key using the b_id1
and b_id2
columns.
This query result also means that the c
and f
tables are related to the a
table through the b
table. Would it not be great, if there was a single query that could provide the foreign key chains? Well, in MySQL 8 you can get this using a common table expression (CTE).
Querying Foreign Key Chains – Step by Step
The query will use a recursive common table expression. This requires a seed query and a recursive query (that works on the rows generated in the previous iteration). A good seed query is similar to what we had for the basis query. However, to make it possible to aggregate all of the steps in the chain, the chain will be generated as a JSON array with each part of the chain being a JSON object. The seed query becomes:
mysql> SELECT 1 AS Level, TABLE_SCHEMA, TABLE_NAME,
JSON_ARRAY(
JSON_OBJECT('Level', 1,
'TableSchema', TABLE_SCHEMA,
'TableName', TABLE_NAME,
'ColumnNames', JSON_ARRAYAGG(COLUMN_NAME)
)
) AS ReferenceChain
FROM information_schema.KEY_COLUMN_USAGE
WHERE REFERENCED_TABLE_SCHEMA = 'db1' AND REFERENCED_TABLE_NAME = 'a'
GROUP BY TABLE_SCHEMA, TABLE_NAME;
+-------+--------------+------------+---------------------------------------------------------------------------------+
| Level | TABLE_SCHEMA | TABLE_NAME | ReferenceChain |
+-------+--------------+------------+---------------------------------------------------------------------------------+
| 1 | db1 | b | [{"Level": 1, "TableName": "b", "ColumnNames": ["a_id"], "TableSchema": "db1"}] |
| 1 | db1 | d | [{"Level": 1, "TableName": "d", "ColumnNames": ["a_id"], "TableSchema": "db1"}] |
| 1 | db1 | e | [{"Level": 1, "TableName": "e", "ColumnNames": ["a_id"], "TableSchema": "db1"}] |
+-------+--------------+------------+---------------------------------------------------------------------------------+
3 rows in set (0.01 sec)
Now, you can take each of these relations and look for tables having a foreign key to them, and so forth. That is the recursive part of the query. There is one complication though: GROUP BY
is not allowed in the recursive part. The workaround is to use a subquery:
mysql> WITH RECURSIVE `reference` (Level, TABLE_SCHEMA, TABLE_NAME, ReferenceChain) AS (
SELECT 1, TABLE_SCHEMA, TABLE_NAME,
JSON_ARRAY(
JSON_OBJECT('Level', 1,
'TableSchema', TABLE_SCHEMA,
'TableName', TABLE_NAME,
'ColumnNames', JSON_ARRAYAGG(COLUMN_NAME)
)
)
FROM information_schema.KEY_COLUMN_USAGE
WHERE REFERENCED_TABLE_SCHEMA = 'db1' AND REFERENCED_TABLE_NAME = 'a'
GROUP BY TABLE_SCHEMA, TABLE_NAME
UNION
SELECT r.Level+1, ko.TABLE_SCHEMA, ko.TABLE_NAME,
JSON_ARRAY_APPEND(
ReferenceChain, '$',
(SELECT JSON_OBJECT('Level', r.Level+1,
'TableSchema', TABLE_SCHEMA,
'TableName', TABLE_NAME,
'ColumnNames', JSON_ARRAYAGG(COLUMN_NAME)
)
FROM information_schema.KEY_COLUMN_USAGE ki
WHERE ki.TABLE_SCHEMA = ko.TABLE_SCHEMA
AND TABLE_NAME = ko.TABLE_NAME
AND ki.REFERENCED_TABLE_SCHEMA = ko.REFERENCED_TABLE_SCHEMA
AND ki.REFERENCED_TABLE_NAME = ko.REFERENCED_TABLE_NAME)
)
FROM `reference` r
INNER JOIN information_schema.KEY_COLUMN_USAGE ko ON
ko.REFERENCED_TABLE_SCHEMA = r.TABLE_SCHEMA
AND ko.REFERENCED_TABLE_NAME = r.TABLE_NAME
)
SELECT Level, TABLE_SCHEMA, TABLE_NAME,
JSON_PRETTY(ReferenceChain) AS ReferenceChain
FROM `reference`\G
*************************** 1. row ***************************
Level: 1
TABLE_SCHEMA: db1
TABLE_NAME: b
ReferenceChain: [
{
"Level": 1,
"TableName": "b",
"ColumnNames": [
"a_id"
],
"TableSchema": "db1"
}
]
*************************** 2. row ***************************
Level: 1
TABLE_SCHEMA: db1
TABLE_NAME: d
ReferenceChain: [
{
"Level": 1,
"TableName": "d",
"ColumnNames": [
"a_id"
],
"TableSchema": "db1"
}
]
*************************** 3. row ***************************
Level: 1
TABLE_SCHEMA: db1
TABLE_NAME: e
ReferenceChain: [
{
"Level": 1,
"TableName": "e",
"ColumnNames": [
"a_id"
],
"TableSchema": "db1"
}
]
*************************** 4. row ***************************
Level: 2
TABLE_SCHEMA: db1
TABLE_NAME: c
ReferenceChain: [
{
"Level": 1,
"TableName": "b",
"ColumnNames": [
"a_id"
],
"TableSchema": "db1"
},
{
"Level": 2,
"TableName": "c",
"ColumnNames": [
"b_id1",
"b_id2"
],
"TableSchema": "db1"
}
]
*************************** 5. row ***************************
Level: 2
TABLE_SCHEMA: db1
TABLE_NAME: f
ReferenceChain: [
{
"Level": 1,
"TableName": "b",
"ColumnNames": [
"a_id"
],
"TableSchema": "db1"
},
{
"Level": 2,
"TableName": "f",
"ColumnNames": [
"b_id1",
"b_id2"
],
"TableSchema": "db1"
}
]
*************************** 6. row ***************************
Level: 3
TABLE_SCHEMA: db1
TABLE_NAME: d
ReferenceChain: [
{
"Level": 1,
"TableName": "b",
"ColumnNames": [
"a_id"
],
"TableSchema": "db1"
},
{
"Level": 2,
"TableName": "c",
"ColumnNames": [
"b_id1",
"b_id2"
],
"TableSchema": "db1"
},
{
"Level": 3,
"TableName": "d",
"ColumnNames": [
"c_id"
],
"TableSchema": "db1"
}
]
*************************** 7. row ***************************
Level: 3
TABLE_SCHEMA: db1
TABLE_NAME: f
ReferenceChain: [
{
"Level": 1,
"TableName": "b",
"ColumnNames": [
"a_id"
],
"TableSchema": "db1"
},
{
"Level": 2,
"TableName": "c",
"ColumnNames": [
"b_id1",
"b_id2"
],
"TableSchema": "db1"
},
{
"Level": 3,
"TableName": "f",
"ColumnNames": [
"c_id"
],
"TableSchema": "db1"
}
]
7 rows in set (0.02 sec)
Here the ARRAY_APPEND()
function is used to add the next part of the chain to ReferenceChain
. The query relies on that the UNION
is a UNION DISTINCT
by default, so for the cases where there are two columns in the foreign key, the second (duplicate) row is automatically filtered out. For the main query, JSON_PRETTY()
is used to make it easier to read the JSON document. If you are using the query in an application, this is not needed.
You can stop here. The result is correct. However, you may think there are more rows than you would expect. For example the chain a
→ b
is there on its own (1st row) even though there are also tables with foreign keys to b
. If you want to include subchains in the result, then you are all set. If you want to filter chains out that are part of another chain, a little more work is needed.
To filter out chains that are also included in subsequent rows, it is in one way or another necessary to keep track of whether a row has any child rows (i.e. that a subsequent row is generated based on the row). One way to do this is to have a serialized form of the chain, however the disadvantage is that you don't know how long a string you need to store that (and the string length must be specified in the seed query). Another option is to generate an ID for each row – for example using the UUID()
function. Then in rows generated from the row make a reference to the parent row. This is the option used here.
A disadvantage of this approach is that for tables with more then one column in the foreign key, the two rows generated are no longer identical. So, it is necessary to handle this in the main query. However, it is now easy to only include the end of the chains as these will not have another row with the parent ID set to the row's ID. To find this, use a LEFT OUTER JOIN
and look for rows where the optional row returns a NULL
ID (that is, a row was not found).
Final Query
The final query thus becomes:
mysql> WITH RECURSIVE `reference` (ID, ParentID, Level,
TABLE_SCHEMA, TABLE_NAME,
ReferenceChain) AS (
SELECT UUID(), CAST(NULL AS CHAR(36)), 1,
TABLE_SCHEMA, TABLE_NAME,
JSON_ARRAY(
JSON_OBJECT('Level', 1,
'TableSchema', TABLE_SCHEMA,
'TableName', TABLE_NAME,
'ColumnNames', JSON_ARRAYAGG(COLUMN_NAME)
)
)
FROM information_schema.KEY_COLUMN_USAGE
WHERE REFERENCED_TABLE_SCHEMA = 'db1'
AND REFERENCED_TABLE_NAME = 'a'
GROUP BY TABLE_SCHEMA, TABLE_NAME
UNION
SELECT UUID(), r.ID, r.Level+1,
ko.TABLE_SCHEMA, ko.TABLE_NAME,
JSON_ARRAY_APPEND(
ReferenceChain, '$',
(SELECT JSON_OBJECT('Level', r.Level+1,
'TableSchema', TABLE_SCHEMA,
'TableName', TABLE_NAME,
'ColumnNames', JSON_ARRAYAGG(COLUMN_NAME)
)
FROM information_schema.KEY_COLUMN_USAGE ki
WHERE ki.TABLE_SCHEMA = ko.TABLE_SCHEMA
AND TABLE_NAME = ko.TABLE_NAME
AND ki.REFERENCED_TABLE_SCHEMA = ko.REFERENCED_TABLE_SCHEMA
AND ki.REFERENCED_TABLE_NAME = ko.REFERENCED_TABLE_NAME)
)
FROM `reference` r
INNER JOIN information_schema.KEY_COLUMN_USAGE ko ON
ko.REFERENCED_TABLE_SCHEMA = r.TABLE_SCHEMA
AND ko.REFERENCED_TABLE_NAME = r.TABLE_NAME
)
SELECT DISTINCT r.Level AS NumLevels,
JSON_PRETTY(r.ReferenceChain) AS ReferenceChain
FROM `reference` r
LEFT OUTER JOIN `reference` rc ON rc.ParentID = r.ID
WHERE rc.ID IS NULL\G
*************************** 1. row ***************************
NumLevels: 1
ReferenceChain: [
{
"Level": 1,
"TableName": "d",
"ColumnNames": [
"a_id"
],
"TableSchema": "db1"
}
]
*************************** 2. row ***************************
NumLevels: 1
ReferenceChain: [
{
"Level": 1,
"TableName": "e",
"ColumnNames": [
"a_id"
],
"TableSchema": "db1"
}
]
*************************** 3. row ***************************
NumLevels: 2
ReferenceChain: [
{
"Level": 1,
"TableName": "b",
"ColumnNames": [
"a_id"
],
"TableSchema": "db1"
},
{
"Level": 2,
"TableName": "f",
"ColumnNames": [
"b_id1",
"b_id2"
],
"TableSchema": "db1"
}
]
*************************** 4. row ***************************
NumLevels: 3
ReferenceChain: [
{
"Level": 1,
"TableName": "b",
"ColumnNames": [
"a_id"
],
"TableSchema": "db1"
},
{
"Level": 2,
"TableName": "c",
"ColumnNames": [
"b_id1",
"b_id2"
],
"TableSchema": "db1"
},
{
"Level": 3,
"TableName": "d",
"ColumnNames": [
"c_id"
],
"TableSchema": "db1"
}
]
*************************** 5. row ***************************
NumLevels: 3
ReferenceChain: [
{
"Level": 1,
"TableName": "b",
"ColumnNames": [
"a_id"
],
"TableSchema": "db1"
},
{
"Level": 2,
"TableName": "c",
"ColumnNames": [
"b_id1",
"b_id2"
],
"TableSchema": "db1"
},
{
"Level": 3,
"TableName": "f",
"ColumnNames": [
"c_id"
],
"TableSchema": "db1"
}
]
5 rows in set (0.03 sec)
The DISTINCT
in the main part of the query ensures that duplicates due to multiple columns in the foreign key are filtered out.
Thus, this schema has five unique chains leading to the a
tables. You can also verify this from the EER diagram – for reference, here it is again:
Leave a Reply