Using SQLAlchemy with MySQL 8

I few months ago, I wrote about using the Django framework with MySQL 8. There are also other Python frameworks that are worth considering. In this blog, I will look at using SQLAlchemy with MySQL 8.

In order for you to be able to use MySQL 8 with SQLAlchemy, you need three pieces of software: MySQL Server, MySQL Connector/Python, and SQLAlchemy. I will go through the installations, then I will look at a code example.

Information

The examples in this blog uses MySQL Server 8.0.15, MySQL Connector/Python 8.0.15, and SQLAlchemy 1.2.18. It should be possible to reuse the instructions with other MySQL versions as well, except in older MySQL versions you need to explicitly use the utf8mb4 character set.

If you do not want to read through the whole blog, the main thing for using MySQL and MySQL Connector/Python with SQLAlchemy is to create the engine with dialect set to mysql and the driver to mysqlconnector:

engine = sqlalchemy.create_engine(
    'mysql+mysqlconnector://<user>:<password>@<host>:<port>/<default_db>...')

Installing MySQL Server

There are several ways to install MySQL Server and which one is the best depends on your circumstances and preferences. For the sake of this blog, I will show how MySQL Server can be installed on Oracle Linux/RHEL/CentOS 7 using RPMs and on Microsoft Windows using MySQL Installer. For more options, see the installation chapter in the reference manual. Let’s look at the Linux installation first.

RPM Install on Enterprise Linux

MySQL provides repositories for several Linux distributions including the Oracle Linux/RHEL/CentOS family. This makes it easy to install MySQL. The step to install the repository definition is:

shell$ sudo yum install https://dev.mysql.com/get/mysql80-community-release-el7-1.noarch.rpmLoaded plugins: langpacks, ulninfo
Repository ol7_developer_EPEL is listed more than once in the configuration
mysql80-community-release-el7-1.noarch.rpm                                     |  25 kB  00:00:00     
Examining /var/tmp/yum-root-Ts4OzC/mysql80-community-release-el7-1.noarch.rpm: mysql80-community-release-el7-1.noarch
Marking /var/tmp/yum-root-Ts4OzC/mysql80-community-release-el7-1.noarch.rpm to be installed
Resolving Dependencies
--> Running transaction check
---> Package mysql80-community-release.noarch 0:el7-1 will be installed
--> Finished Dependency Resolution
 
Dependencies Resolved
 
======================================================================================================
 Package                       Arch       Version   Repository                                   Size
======================================================================================================
Installing:
 mysql80-community-release     noarch     el7-1     /mysql80-community-release-el7-1.noarch      31 k
 
Transaction Summary
======================================================================================================
Install  1 Package
 
Total size: 31 k
Installed size: 31 k
Is this ok [y/d/N]: y
Downloading packages:
Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
  Installing : mysql80-community-release-el7-1.noarch                                             1/1
  Verifying  : mysql80-community-release-el7-1.noarch                                             1/1
 
Installed:
  mysql80-community-release.noarch 0:el7-1
 
Complete!

Now, you can install MySQL Server. There are several RPMs to choose from and which you need depends on which other features you need to use. A common set of RPMs can be installed as:

shell$ sudo yum install mysql-community-server mysql-community-libs \
            mysql-community-libs-compat mysql-community-common mysql-community-client
...

Note

If you have another MySQL installation, it will be upgraded to the latest release (at the time of writing 8.0.15).

On the first start, the data directory will be initialized:

shell$ sudo systemctl start mysqld

To keep a fresh installation secure, a random password has been set for the root user. This can be found from the MySQL error log:

shell$ sudo grep password /var/log/mysqld.log 
2018-11-05T08:05:09.985857Z 5 [Note] [MY-010454] [Server] A temporary password is generated for root@localhost: KWNfuA!1r:PF

Use this password to connect to MySQL and update the password (please use a strong password):

shell$ mysql --user=root --password
Enter password:

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 19
Server version: 8.0.15 MySQL Community Server - GPL

Copyright (c) 2000, 2018, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> ALTER USER root@localhost IDENTIFIED BY 'Kij0@jDi~Faf4';
Query OK, 0 rows affected (0.13 sec)

MySQL is now ready for use. Before continuing, I will show an example of installing MySQL on Microsoft Windows.

Microsoft Windows

On Microsoft Windows an easy way to install MySQL is to use the MySQL Installer. The installer can be downloaded from the MySQL download site. The MySQL Installer can be used to install most MySQL products. If you start MySQL Installer for the first time, you will be taken directly to the screen to choose the products to install; if you already have installed products, you will need to choose to add new products.

On the Select Products and Features screen, choose to install MySQL Server 8.0 (MySQL Installer will list the latest release from the list of available products):

Installing MySQL Server 8.0.13 from MySQL Installer
Installing MySQL Server from MySQL Installer – The screen shot is for 8.0.13, but other than the version number, it is the same for 8.0.15.

Optionally, you can filter the list of products. Feel free to choose other products you want. MySQL Notifier can be useful for starting and stopping MySQL, if you do not plan to have MySQL running at all times. You can also install MySQL Connector/Python this way, however for this blog a different method will be used.

Follow the installation wizard. For this blog, the default choices will work, though during the configuration you may want to ensure Open Windows Firewall ports for network access is unchecked unless you need remote access.

Preparing MySQL Server

While MySQL is now ready to work with SQLAlchemy, you will likely want to do a few more preparation steps. Here creating the MySQL user and schema (database) used by your application will be covered.

An example of creating the user pyuser@localhost and give it all privileges to the sqlalchemy schema and to create the sqlalchemy schema is:

mysql> CREATE USER pyuser@localhost IDENTIFIED BY 'Py@pp4Demo';
Query OK, 0 rows affected (0.01 sec)

mysql> GRANT ALL ON sqlalchemy.* TO pyuser@localhost;
Query OK, 0 rows affected (0.01 sec)

mysql> CREATE SCHEMA sqlalchemy;
Query OK, 1 row affected (0.01 sec)

This will allow the pyuser user to connect from the same host as MySQL Server is installed by authenticating with the password Py@app4Demo.

Installing MySQL Connector/Python and SQLAlchemy

Both MySQL Connector/Python and SQLAlchemy can be installed in a platform independent way using the pip command. Since Python 2.7 is soon end of life, I will assume Python 3.6 in this blog. (MySQL Connector/Python 8.0.13 and later also supports Python 3.7.)

If you do not have Python 3.6 installed on Oracle Linux/RHEL/CentOS 7, you can easily install it for example from for EPEL repository. Assuming you have configured the EPEL repository, the following steps install Python 3.6, enable pip, and update pip to the latest version:

shell$ yum install python36
shell$ python3.6 -m ensurepip
shell$ python3.6 -m pip install --upgrade pip

You can now use python3.6 to invoke Python 3.6. In the following, replace python with python3.6 if you have installed Python 3.6 in this way.

To install the latest MySQL Connector/Python release (currently 8.0.15):

PS> python -m pip install mysql-connector-python
Collecting mysql-connector-python
  Using cached https://files.pythonhosted.org/packages/31/45/ef8cf013918108f508a1a1bb5539abaff5f78f3a569f7fa30232967713c9/mysql_connector_python-8.0.15-cp36-cp36m-win_amd64.whl

Collecting protobuf>=3.0.0 (from mysql-connector-python)
  Downloading https://files.pythonhosted.org/packages/5d/5c/476f473c2efc0a8d9fd7185e6c08dcbd21c469698e2a80487fa054b8c5ba/protobuf-3.7.0-cp36-cp36m-win_amd64.whl (1.1MB)
    100% |████████████████████████████████| 1.1MB 6.6MB/s
Requirement already satisfied: six>=1.9 in c:\users\myuser\appdata\local\programs\python\python36\lib\site-packages (from protobuf>=3.0.0->mysql-connector-python)
 (1.11.0)
Requirement already satisfied: setuptools in c:\users\myuser\appdata\local\programs\python\python36\lib\site-packages (from protobuf>=3.0.0->mysql-connector-pytho
n) (28.8.0)
Installing collected packages: protobuf, mysql-connector-python
Successfully installed mysql-connector-python-8.0.15 protobuf-3.7.0

Similar for installing SQLAlchemy:

PS> python -m pip install SQLAlchemy
Collecting SQLAlchemy
  Downloading https://files.pythonhosted.org/packages/21/ed/7eb53305b43ca51774a458d7c292f3bc7664d7a9bbb5bac4149fa34756b9/SQLAlchemy-1.2.18.tar.gz (5.7MB)
    100% |████████████████████████████████| 5.7MB 3.3MB/s
Installing collected packages: SQLAlchemy
  Running setup.py install for SQLAlchemy ... done

That’s it. Now you are ready to use SQLAlchemy with MySQL Connector/Python 8 and MySQL Server 8.

SQLAlchemy Example

With MySQL and SQLAlchemy installed, it is straight forward to use SQLAlchemy in you application. Here, I will just show a very simple example based on the tutorial in the official SQLAlchemy manual.

The first step is to import the necessary parts of SQLAlchemy. In this example, only sqlalchemy itself and declarative_base from sqlalchemy.ext.declarative are required:

import sqlalchemy
from sqlalchemy.ext.declarative import declarative_base

The second step is to define MySQL as the engine and that you want to connect using MySQL Connector/Python:

# Define the MySQL engine using MySQL Connector/Python
engine = sqlalchemy.create_engine(
    'mysql+mysqlconnector://pyuser:Py@pp4Demo@localhost:3306/sqlalchemy',
    echo=True)

The definition of MySQL and MySQL Connector Python happens in line 6. mysql defines that you are using MySQL as the database (the dialect), and mysqlconnector tells that you want to use MySQL Connector/Python as the driver. The rest of the line defines the connection options. In this case you have user:password@host:port/default_db. You can add more options if you need it. So, in this case the following options have been set:

  • User: pyuser
  • Password: Py@app4Demo
  • Host: localhost
  • Port: 3306
  • Default database: sqlalchemy

Avoid

Never hard code the connection parameters into your application. Particularly the password is an absolutely no go. It is done here to keep the example simple, but doing so in an actual application makes deployments hard and pose a severe security issue.

See also the documentation of the engine configuration in the SQLAlchemy manual for more information.

The argument echo=True makes SQLAlchemy print each SQL statement it executes. This can be useful when testing.

The third step is to define and create a table – in this example the users table:

# Define and create the table
Base = declarative_base()

class User(Base):
    __tablename__ = 'users'

    id = sqlalchemy.Column(sqlalchemy.Integer, primary_key=True)
    name = sqlalchemy.Column(sqlalchemy.String(length=50))
    fullname = sqlalchemy.Column(sqlalchemy.String(length=50))
    nickname = sqlalchemy.Column(sqlalchemy.String(length=50))

    def __repr__(self):
        return "<User(name='{0}', fullname='{1}', nickname='{2}')>".format(
                            self.name, self.fullname, self.nickname)

Base.metadata.create_all(engine)

MySQL requires that you specify the maximum number of characters for varchar columns, which is the data type used when specifying sqlalchemy.String(). So, the length argument is passed. (The length argument is also the first argument to sqlalchemy.String(), so you do not need to specify explicitly that it is the length.)

The call to Base.metadata.create_all() tells SQLAlchemy to create the underlying database table, if it does not already exist. Otherwise the existing table will be used.

The fourth step is to add a user. This requires a session. Once the session is created, you can add the user to it:

# Create a session
Session = sqlalchemy.orm.sessionmaker()
Session.configure(bind=engine)
session = Session()

# Add a user
jwk_user = User(name='jesper', fullname='Jesper Wisborg Krogh', nickname='&#x1f42c;')
session.add(jwk_user)
session.commit()

There are two things, I will like you to pay attention to here. When defining the jwk_user in line 31, the nickname is set to be the dolphin emoji. This emoji requires four bytes in UTF-8 (0xF09F90AC). In older versions of MySQL (5.7 and earlier), you would need to explicitly change the character set to handle UTF-8. In MySQL 8, however, the four byte variant of UTF-8 (utf8mb4) is the default, so the dolphin emoji will work out of the box.

Tip

If you are interested in more information about the recommendation of which character set to use in MySQL, I wrote a blog about that last year: Which Character Set Should You Use in MySQL?

The other thing is that in order to persist the new user, you need to call session.commit(). The session works as a transaction here, so the changes will not be persisted until it is explicitly committed.

The fifth – and final – step is to query the data just saved:

# Query the user
our_user = session.query(User).filter_by(name='jesper').first()
print('\nOur User:')
print(our_user)
print('Nick name in hex: {0}'.format(our_user.nickname.encode('utf-8')))

The query finds the first user with the name set to “jesper”. Because the dolphin emoji tends not to be displayed correctly in many shells, the byte sequence in hex is also printed, so it can be confirmed it is indeed the dolphin emoji that was retrieved.

The output of the entire program is:

2019-03-03 14:02:57,265 INFO sqlalchemy.engine.base.Engine SHOW VARIABLES LIKE 'sql_mode'
2019-03-03 14:02:57,265 INFO sqlalchemy.engine.base.Engine {}
2019-03-03 14:02:57,265 INFO sqlalchemy.engine.base.Engine SHOW VARIABLES LIKE 'lower_case_table_names'
2019-03-03 14:02:57,265 INFO sqlalchemy.engine.base.Engine {}
2019-03-03 14:02:57,265 INFO sqlalchemy.engine.base.Engine SELECT DATABASE()
2019-03-03 14:02:57,265 INFO sqlalchemy.engine.base.Engine {}
2019-03-03 14:02:57,265 INFO sqlalchemy.engine.base.Engine SELECT CAST('test plain returns' AS CHAR(60)) AS anon_1
2019-03-03 14:02:57,265 INFO sqlalchemy.engine.base.Engine {}
2019-03-03 14:02:57,265 INFO sqlalchemy.engine.base.Engine SELECT CAST('test unicode returns' AS CHAR(60)) AS anon_1
2019-03-03 14:02:57,265 INFO sqlalchemy.engine.base.Engine {}
2019-03-03 14:02:57,265 INFO sqlalchemy.engine.base.Engine DESCRIBE `users`
2019-03-03 14:02:57,280 INFO sqlalchemy.engine.base.Engine {}
2019-03-03 14:02:57,280 INFO sqlalchemy.engine.base.Engine ROLLBACK
2019-03-03 14:02:57,280 INFO sqlalchemy.engine.base.Engine
CREATE TABLE users (
        id INTEGER NOT NULL AUTO_INCREMENT,
        name VARCHAR(50),
        fullname VARCHAR(50),
        nickname VARCHAR(50),
        PRIMARY KEY (id)
)


2019-03-03 14:02:57,280 INFO sqlalchemy.engine.base.Engine {}
2019-03-03 14:02:57,312 INFO sqlalchemy.engine.base.Engine COMMIT
2019-03-03 14:02:57,312 INFO sqlalchemy.engine.base.Engine BEGIN (implicit)
2019-03-03 14:02:57,312 INFO sqlalchemy.engine.base.Engine INSERT INTO users (name, fullname, nickname) VALUES (%(name)s, %(fullname)s, %(nickname)s)
2019-03-03 14:02:57,312 INFO sqlalchemy.engine.base.Engine {'name': 'jesper', 'fullname': 'Jesper Wisborg Krogh', 'nickname': '&#x1f42c;'}
2019-03-03 14:02:57,312 INFO sqlalchemy.engine.base.Engine COMMIT
2019-03-03 14:02:57,312 INFO sqlalchemy.engine.base.Engine BEGIN (implicit)
2019-03-03 14:02:57,312 INFO sqlalchemy.engine.base.Engine SELECT users.id AS users_id, users.name AS users_name, users.fullname AS users_fullname, users.nickname AS users_nickname
FROM users
WHERE users.name = %(name_1)s
 LIMIT %(param_1)s
2019-03-03 14:02:57,312 INFO sqlalchemy.engine.base.Engine {'name_1': 'jesper', 'param_1': 1}

Our User:
<User(name='jesper', fullname='Jesper Wisborg Krogh', nickname='&#x1f42c;')>
Nick name in hex: b'\xf0\x9f\x90\xac'

As it can be seen from the last two lines, the dolphin emoji was saved and retrieved correctly.

Complete Example Code

For completeness, here is the entire example program:

import sqlalchemy
from sqlalchemy.ext.declarative import declarative_base

# Define the MySQL engine using MySQL Connector/Python
engine = sqlalchemy.create_engine(
    'mysql+mysqlconnector://pyuser:Py@pp4Demo@localhost:3306/sqlalchemy',
    echo=True)

# Define and create the table
Base = declarative_base()
class User(Base):
    __tablename__ = 'users'

    id = sqlalchemy.Column(sqlalchemy.Integer, primary_key=True)
    name = sqlalchemy.Column(sqlalchemy.String(length=50))
    fullname = sqlalchemy.Column(sqlalchemy.String(length=50))
    nickname = sqlalchemy.Column(sqlalchemy.String(length=50))

    def __repr__(self):
        return "<User(name='{0}', fullname='{1}', nickname='{2}')>".format(
                            self.name, self.fullname, self.nickname)

Base.metadata.create_all(engine)

# Create a session
Session = sqlalchemy.orm.sessionmaker()
Session.configure(bind=engine)
session = Session()

# Add a user
jwk_user = User(name='jesper', fullname='Jesper Wisborg Krogh', nickname='&#x1f42c;')
session.add(jwk_user)
session.commit()

# Query the user
our_user = session.query(User).filter_by(name='jesper').first()
print('\nOur User:')
print(our_user)
print('Nick name in hex: {0}'.format(our_user.nickname.encode('utf-8')))

Enjoy using MySQL, MySQL Connector/Python, and SQLAlchemy.

Which Character Set Should You Use in MySQL?

MySQL supports a long list of character sets, and one of the strong points are that you can use different character sets per column. This is very flexible, but which character set should you use?

If you do not want to read the whole blog the summary is: Use utf8mb4, particularly in MySQL 8.0 with one of the UCA 9.0.0 based collations.

Before I reach this conclusion, let’s start out looking at what a character set is.

What is a Character Set?

Characters are wonderful things that can take all kinds of shapes ranging from the familiar (for English speakers at least) Latin characters as those used to write this blog, over Arabic (مرحبا بالعالم) and Asian characters such as simplified Chinese (你好,世界) to emojis (❤️🐬). This is all cool, but computers do not deal in characters, they work in bits, so how are all of these wonderful characters represented?

The answer is that character sets are used to encode the characters into bits. Given the number of known characters (Unicode 11.0 contains 137374 characters). Just to cover the possibility of those characters, 18 bits are needed (so effectively 3 bytes). That is significant.

In the good old days when computers were new, it was mostly English that was used, and using 18 bits per character was just not feasible. A popular home computer in the 1980s was the Commodore 64 was an 8-bit computer with a total of 64kB of RAM – you just could not get away with using multiple bytes per character.

In the early days a popular character set was ASCII which uses 7 bits to encode each character. This allows for 128 unique characters which is ok for plain English but does not even allow for all of the Western European accented characters to be included. One answer to that are the Latin character sets, such as ISO 8859-1 (Latin 1) and Windows-1252. These use 8 bits and thus supports 256 characters for each character set. Together they allow complete coverage for the languages using the Latin letters, but you have to choose the right one for the language you are writing, and in all cases 8 bit does not help much if you need to write something in Chinese that has thousands of characters.

The current solution is to use Unicode characters. The Unicode Consortium defines code points for the characters included. As mentioned, in Unicode 11.0 there are just over 137000 characters. This covers most usages (yes, there are actually still characters missing). There is a bit more to it than that, however. One thing is the code points defined by the Unicode Consortium, but these still need bits representations for computers. There are several solutions to that with UTF-8 being the most commonly used nowadays.

UTF-8 is a variable width encoding that uses one to four bytes to represent a Unicode code point. Part of the success of UTF-8 is that the characters used in ASCII have the same encoding in UTF-8 – that is, if you have a document in ASCII, you can just say it is now in UTF-8 and all works well. That made it easy for the large number of ASCII based websites to migrate to use Unicode. The rest of the characters use one or more bytes. While the compatibility with ASCII has been great for adaptation and it helps keep the size of documents in English down, it also has some downsides. One is that some other languages uses more bytes that needed; another is that scanning a string is relatively expensive as you must decode each character to know where the next character begins.

Before moving on, let’s have a quick look at some actual encodings in different character sets, to see the difference (UCS2 is a two-byte fixed width Unicode character set):

CharacterASCIILatin-1UTF-8UCS-2
A4141410041
ÅC5C38500C5
E7958C754C
🐬F09F90AC

The table shows how there clearly are some relations between these character sets, but also how only UTF-8 can represent all of the four test characters. So, from this comparison the strengths of UTF-8 is starting to show, but also the weaknesses. The character Å requires two bytes in UTF-8 (and UCS-2) but only one in Latin-1. The Chinese character 界 requires three bytes in UTF-8 but only two in UCS-2.

So, what does this mean in the context of MySQL? Let’s take a look.

MySQL Character Sets

Until MySQL 8.0 the default character set in MySQL was Latin-1 (named latin1). This was a convenient character set in many ways, for example it was fixed width, so finding the Nth character in a string was fast and it could store text for most Western European languages. However as discussed, Latin-1 is not what is used in this day and age – the World has moved on to UTF-8. So, in MySQL 8.0 the change was made to make utf8mb4 the default character set.

Stop a minute – what is utf8mb4? How does that differ from UTF-8 that was discussed in the previous section? Well, it is the same thing. Unfortunately, when UTF-8 was first implemented in MySQL, it was limited to three bytes (in MySQL 8.0 called utf8mb3). This predates my involvement with MySQL, but a guess for this choice is that it is related to internal temporary tables which in MySQL 5.7 and earlier uses the MEMORY storage engine when possible. Internal temporary tables are for example used to store the result of subquery and for sorting. The MEMORY storage engine only supports fix width columns, so a varchar(10) column would be treated as a char(10) column in an in-memory internal temporary table. With utf8mb4 that would mean 40 bytes, with the choice of a 3-byte implementation it would mean 30 bytes. Furthermore, until the emergence of emojis, it was rarely required to use more than three bytes in UTF-8.

Anyway, MySQL has support for a wide range of character sets to suite your specific need. In MySQL 8 there are a total of 41 character sets to choose from. The details of the character sets can be found in the information_schema.CHARACTER_SETS table:

mysql-sql> SELECT *
             FROM information_schema.CHARACTER_SETS
            ORDER BY CHARACTER_SET_NAME;
+--------------------+----------------------+---------------------------------+--------+
| CHARACTER_SET_NAME | DEFAULT_COLLATE_NAME | DESCRIPTION                     | MAXLEN |
+--------------------+----------------------+---------------------------------+--------+
| armscii8           | armscii8_general_ci  | ARMSCII-8 Armenian              |      1 |
| ascii              | ascii_general_ci     | US ASCII                        |      1 |
| big5               | big5_chinese_ci      | Big5 Traditional Chinese        |      2 |
| binary             | binary               | Binary pseudo charset           |      1 |
| cp1250             | cp1250_general_ci    | Windows Central European        |      1 |
| cp1251             | cp1251_general_ci    | Windows Cyrillic                |      1 |
| cp1256             | cp1256_general_ci    | Windows Arabic                  |      1 |
| cp1257             | cp1257_general_ci    | Windows Baltic                  |      1 |
| cp850              | cp850_general_ci     | DOS West European               |      1 |
| cp852              | cp852_general_ci     | DOS Central European            |      1 |
| cp866              | cp866_general_ci     | DOS Russian                     |      1 |
| cp932              | cp932_japanese_ci    | SJIS for Windows Japanese       |      2 |
| dec8               | dec8_swedish_ci      | DEC West European               |      1 |
| eucjpms            | eucjpms_japanese_ci  | UJIS for Windows Japanese       |      3 |
| euckr              | euckr_korean_ci      | EUC-KR Korean                   |      2 |
| gb18030            | gb18030_chinese_ci   | China National Standard GB18030 |      4 |
| gb2312             | gb2312_chinese_ci    | GB2312 Simplified Chinese       |      2 |
| gbk                | gbk_chinese_ci       | GBK Simplified Chinese          |      2 |
| geostd8            | geostd8_general_ci   | GEOSTD8 Georgian                |      1 |
| greek              | greek_general_ci     | ISO 8859-7 Greek                |      1 |
| hebrew             | hebrew_general_ci    | ISO 8859-8 Hebrew               |      1 |
| hp8                | hp8_english_ci       | HP West European                |      1 |
| keybcs2            | keybcs2_general_ci   | DOS Kamenicky Czech-Slovak      |      1 |
| koi8r              | koi8r_general_ci     | KOI8-R Relcom Russian           |      1 |
| koi8u              | koi8u_general_ci     | KOI8-U Ukrainian                |      1 |
| latin1             | latin1_swedish_ci    | cp1252 West European            |      1 |
| latin2             | latin2_general_ci    | ISO 8859-2 Central European     |      1 |
| latin5             | latin5_turkish_ci    | ISO 8859-9 Turkish              |      1 |
| latin7             | latin7_general_ci    | ISO 8859-13 Baltic              |      1 |
| macce              | macce_general_ci     | Mac Central European            |      1 |
| macroman           | macroman_general_ci  | Mac West European               |      1 |
| sjis               | sjis_japanese_ci     | Shift-JIS Japanese              |      2 |
| swe7               | swe7_swedish_ci      | 7bit Swedish                    |      1 |
| tis620             | tis620_thai_ci       | TIS620 Thai                     |      1 |
| ucs2               | ucs2_general_ci      | UCS-2 Unicode                   |      2 |
| ujis               | ujis_japanese_ci     | EUC-JP Japanese                 |      3 |
| utf16              | utf16_general_ci     | UTF-16 Unicode                  |      4 |
| utf16le            | utf16le_general_ci   | UTF-16LE Unicode                |      4 |
| utf32              | utf32_general_ci     | UTF-32 Unicode                  |      4 |
| utf8               | utf8_general_ci      | UTF-8 Unicode                   |      3 |
| utf8mb4            | utf8mb4_0900_ai_ci   | UTF-8 Unicode                   |      4 |
+--------------------+----------------------+---------------------------------+--------+
41 rows in set (0.0010 sec)

Notice how there are several UTF and other Unicode character sets including utf8 and utf8m4. The utf8 character set is the 3-byte implementation. The name utf8 has now been deprecated in favour of utf8mb3 to make it specific that you are using an implementation that can at most use three bytes per character.

The table has four columns: The character set name, the default collation, a description, and the maximum number of bytes per character. The first and the two last of the columns are clear enough, but what is the collation? Let’s discuss that.

What is a Collation?

It is all well and good that you now know how to encode characters, but one of the main uses of databases is to compare data. Are two values the same or is one larger than the other? For numbers this is in general well defined (floating point comparisons and release numbers are examples where it is less clear), for example 10 is greater than 5.

However, for strings it is not that simple. Different languages have different rules for the order of the characters, and there are the questions whether an accented character should be considered the same as the base character and whether upper and lower case should be considered the same. An example is that in Danish and Norwegian, the character Ø comes before Å, but in Swedish Å comes before Ö (Ø and Ö are essentially the same letter).

The rules that define the order of the characters and whether to take accents and capitalization into account is called collations. There are many collations for the same character set. In MySQL 8.0 there are alone 73 collations to choose from for the utf8mb4 character set (more later about utf8mb4).

The collations that can be used with a given character set can be found in the information_schema.COLLATIONS table. For example, to see the 73 collations available for utf8mb4:

mysql-sql> > SELECT *
              FROM information_schema.COLLATIONS
             WHERE CHARACTER_SET_NAME = 'utf8mb4'
             ORDER BY COLLATION_NAME;
+----------------------------+--------------------+-----+------------+-------------+---------+---------------+
| COLLATION_NAME             | CHARACTER_SET_NAME | ID  | IS_DEFAULT | IS_COMPILED | SORTLEN | PAD_ATTRIBUTE |
+----------------------------+--------------------+-----+------------+-------------+---------+---------------+
| utf8mb4_0900_ai_ci         | utf8mb4            | 255 | Yes        | Yes         |       0 | NO PAD        |
| utf8mb4_0900_as_ci         | utf8mb4            | 305 |            | Yes         |       0 | NO PAD        |
| utf8mb4_0900_as_cs         | utf8mb4            | 278 |            | Yes         |       0 | NO PAD        |
| utf8mb4_bin                | utf8mb4            |  46 |            | Yes         |       1 | PAD SPACE     |
| utf8mb4_croatian_ci        | utf8mb4            | 245 |            | Yes         |       8 | PAD SPACE     |
| utf8mb4_cs_0900_ai_ci      | utf8mb4            | 266 |            | Yes         |       0 | NO PAD        |
| utf8mb4_cs_0900_as_cs      | utf8mb4            | 289 |            | Yes         |       0 | NO PAD        |
| utf8mb4_czech_ci           | utf8mb4            | 234 |            | Yes         |       8 | PAD SPACE     |
| utf8mb4_danish_ci          | utf8mb4            | 235 |            | Yes         |       8 | PAD SPACE     |
| utf8mb4_da_0900_ai_ci      | utf8mb4            | 267 |            | Yes         |       0 | NO PAD        |
| utf8mb4_da_0900_as_cs      | utf8mb4            | 290 |            | Yes         |       0 | NO PAD        |
| utf8mb4_de_pb_0900_ai_ci   | utf8mb4            | 256 |            | Yes         |       0 | NO PAD        |
| utf8mb4_de_pb_0900_as_cs   | utf8mb4            | 279 |            | Yes         |       0 | NO PAD        |
| utf8mb4_eo_0900_ai_ci      | utf8mb4            | 273 |            | Yes         |       0 | NO PAD        |
| utf8mb4_eo_0900_as_cs      | utf8mb4            | 296 |            | Yes         |       0 | NO PAD        |
| utf8mb4_esperanto_ci       | utf8mb4            | 241 |            | Yes         |       8 | PAD SPACE     |
| utf8mb4_estonian_ci        | utf8mb4            | 230 |            | Yes         |       8 | PAD SPACE     |
| utf8mb4_es_0900_ai_ci      | utf8mb4            | 263 |            | Yes         |       0 | NO PAD        |
| utf8mb4_es_0900_as_cs      | utf8mb4            | 286 |            | Yes         |       0 | NO PAD        |
| utf8mb4_es_trad_0900_ai_ci | utf8mb4            | 270 |            | Yes         |       0 | NO PAD        |
| utf8mb4_es_trad_0900_as_cs | utf8mb4            | 293 |            | Yes         |       0 | NO PAD        |
| utf8mb4_et_0900_ai_ci      | utf8mb4            | 262 |            | Yes         |       0 | NO PAD        |
| utf8mb4_et_0900_as_cs      | utf8mb4            | 285 |            | Yes         |       0 | NO PAD        |
| utf8mb4_general_ci         | utf8mb4            |  45 |            | Yes         |       1 | PAD SPACE     |
| utf8mb4_german2_ci         | utf8mb4            | 244 |            | Yes         |       8 | PAD SPACE     |
| utf8mb4_hr_0900_ai_ci      | utf8mb4            | 275 |            | Yes         |       0 | NO PAD        |
| utf8mb4_hr_0900_as_cs      | utf8mb4            | 298 |            | Yes         |       0 | NO PAD        |
| utf8mb4_hungarian_ci       | utf8mb4            | 242 |            | Yes         |       8 | PAD SPACE     |
| utf8mb4_hu_0900_ai_ci      | utf8mb4            | 274 |            | Yes         |       0 | NO PAD        |
| utf8mb4_hu_0900_as_cs      | utf8mb4            | 297 |            | Yes         |       0 | NO PAD        |
| utf8mb4_icelandic_ci       | utf8mb4            | 225 |            | Yes         |       8 | PAD SPACE     |
| utf8mb4_is_0900_ai_ci      | utf8mb4            | 257 |            | Yes         |       0 | NO PAD        |
| utf8mb4_is_0900_as_cs      | utf8mb4            | 280 |            | Yes         |       0 | NO PAD        |
| utf8mb4_ja_0900_as_cs      | utf8mb4            | 303 |            | Yes         |       0 | NO PAD        |
| utf8mb4_ja_0900_as_cs_ks   | utf8mb4            | 304 |            | Yes         |      24 | NO PAD        |
| utf8mb4_latvian_ci         | utf8mb4            | 226 |            | Yes         |       8 | PAD SPACE     |
| utf8mb4_la_0900_ai_ci      | utf8mb4            | 271 |            | Yes         |       0 | NO PAD        |
| utf8mb4_la_0900_as_cs      | utf8mb4            | 294 |            | Yes         |       0 | NO PAD        |
| utf8mb4_lithuanian_ci      | utf8mb4            | 236 |            | Yes         |       8 | PAD SPACE     |
| utf8mb4_lt_0900_ai_ci      | utf8mb4            | 268 |            | Yes         |       0 | NO PAD        |
| utf8mb4_lt_0900_as_cs      | utf8mb4            | 291 |            | Yes         |       0 | NO PAD        |
| utf8mb4_lv_0900_ai_ci      | utf8mb4            | 258 |            | Yes         |       0 | NO PAD        |
| utf8mb4_lv_0900_as_cs      | utf8mb4            | 281 |            | Yes         |       0 | NO PAD        |
| utf8mb4_persian_ci         | utf8mb4            | 240 |            | Yes         |       8 | PAD SPACE     |
| utf8mb4_pl_0900_ai_ci      | utf8mb4            | 261 |            | Yes         |       0 | NO PAD        |
| utf8mb4_pl_0900_as_cs      | utf8mb4            | 284 |            | Yes         |       0 | NO PAD        |
| utf8mb4_polish_ci          | utf8mb4            | 229 |            | Yes         |       8 | PAD SPACE     |
| utf8mb4_romanian_ci        | utf8mb4            | 227 |            | Yes         |       8 | PAD SPACE     |
| utf8mb4_roman_ci           | utf8mb4            | 239 |            | Yes         |       8 | PAD SPACE     |
| utf8mb4_ro_0900_ai_ci      | utf8mb4            | 259 |            | Yes         |       0 | NO PAD        |
| utf8mb4_ro_0900_as_cs      | utf8mb4            | 282 |            | Yes         |       0 | NO PAD        |
| utf8mb4_ru_0900_ai_ci      | utf8mb4            | 306 |            | Yes         |       0 | NO PAD        |
| utf8mb4_ru_0900_as_cs      | utf8mb4            | 307 |            | Yes         |       0 | NO PAD        |
| utf8mb4_sinhala_ci         | utf8mb4            | 243 |            | Yes         |       8 | PAD SPACE     |
| utf8mb4_sk_0900_ai_ci      | utf8mb4            | 269 |            | Yes         |       0 | NO PAD        |
| utf8mb4_sk_0900_as_cs      | utf8mb4            | 292 |            | Yes         |       0 | NO PAD        |
| utf8mb4_slovak_ci          | utf8mb4            | 237 |            | Yes         |       8 | PAD SPACE     |
| utf8mb4_slovenian_ci       | utf8mb4            | 228 |            | Yes         |       8 | PAD SPACE     |
| utf8mb4_sl_0900_ai_ci      | utf8mb4            | 260 |            | Yes         |       0 | NO PAD        |
| utf8mb4_sl_0900_as_cs      | utf8mb4            | 283 |            | Yes         |       0 | NO PAD        |
| utf8mb4_spanish2_ci        | utf8mb4            | 238 |            | Yes         |       8 | PAD SPACE     |
| utf8mb4_spanish_ci         | utf8mb4            | 231 |            | Yes         |       8 | PAD SPACE     |
| utf8mb4_sv_0900_ai_ci      | utf8mb4            | 264 |            | Yes         |       0 | NO PAD        |
| utf8mb4_sv_0900_as_cs      | utf8mb4            | 287 |            | Yes         |       0 | NO PAD        |
| utf8mb4_swedish_ci         | utf8mb4            | 232 |            | Yes         |       8 | PAD SPACE     |
| utf8mb4_tr_0900_ai_ci      | utf8mb4            | 265 |            | Yes         |       0 | NO PAD        |
| utf8mb4_tr_0900_as_cs      | utf8mb4            | 288 |            | Yes         |       0 | NO PAD        |
| utf8mb4_turkish_ci         | utf8mb4            | 233 |            | Yes         |       8 | PAD SPACE     |
| utf8mb4_unicode_520_ci     | utf8mb4            | 246 |            | Yes         |       8 | PAD SPACE     |
| utf8mb4_unicode_ci         | utf8mb4            | 224 |            | Yes         |       8 | PAD SPACE     |
| utf8mb4_vietnamese_ci      | utf8mb4            | 247 |            | Yes         |       8 | PAD SPACE     |
| utf8mb4_vi_0900_ai_ci      | utf8mb4            | 277 |            | Yes         |       0 | NO PAD        |
| utf8mb4_vi_0900_as_cs      | utf8mb4            | 300 |            | Yes         |       0 | NO PAD        |
+----------------------------+--------------------+-----+------------+-------------+---------+---------------+
73 rows in set (0.0023 sec)

The most interesting in this discussion is the name. As you can see there is a pattern to the names, and it is possible to identify up to four parts in the name

  • First the character set name.
  • Then which language it is for, for example ja for Japanese. A special “language” is binary which mean each byte is compared directly one by one.
  • Then whether it is a UCA 9.0.0 based collation. The UCA 9.0.0 based collations have 0900 in the name.
  • Then up to three modifiers depending on whether it is accent and/or case sensitive or insensitive.

There are a total of five accent and case modifiers:

  • ai: accent insensitive, ä is equal to a.
  • as: accent sensitive, ä is not equal to a.
  • ci: case insensitive, a is equal to A.
  • cs: case sensitive, a is not equal to A.
  • ks: kana sensitive (only for Japanese)

This brings us on to MySQL 8.0 and character sets and collations.

MySQL 8.0

Good job of making it this far. You are almost at the end now.

As part of the work to make utf8mb4 the default character set in MySQL 8.0, a large amount of work was put into making the MySQL work more efficiently with the UTF-8 character set and to make it more useful for the users. Some of the changes include:

  • The UCA 9.0.0 collations (with 0900 in the name) are new in MySQL 8.0. MySQL works faster with these new collations than the old collations.
  • In-memory internal temporary tables now use the new Temptable storage engine by default. It supports variable width columns making it much more efficient with UTF-8 than earlier versions of MySQL. From version 8.0.12 (just released) the Temptable storage engine also supports BLOBs.

Another thing to be aware of is that the new X DevAPI that for example allows NoSQL access to both the MySQL Document store with JSON documents and to SQL tables expect all query results to be UTF-8.

For these reasons it is recommended to use utf8mb4 with one of the new UCA 9.0.0 collations for most uses in MySQL 8.0. The default collation for utf8mb4 in MySQL 8.0 is utf8mb4_0900_ai_ci. This is a good collation for the general use cases, but feel free to use one of the more specific collations if that works better for your application. For example, the utf8mb4_ja_0900_as_cs or utf8mb4_ja_0900_as_cs_ks if you need correct Japanese comparisons (such as the fix for the infamous Sushi = Beer bug).

Character sets and collations are big topics. If you are interested to know more about MySQL 8.0 and character sets, there is a series of blogs on the MySQL Server Blog, for example:

Additionally, the MySQL reference manual has a chapter with 15 sections about character sets and collations: Chapter 10 Character Sets, Collations, Unicode.