Which Character Set Should You Use in MySQL?
MySQL supports a long list of character sets, and one of the strong points are that you can use different character sets per column. This is very flexible, but which character set should you use?
If you do not want to read the whole blog the summary is: Use utf8mb4, particularly in MySQL 8.0 with one of the UCA 9.0.0 based collations.
Before I reach this conclusion, let’s start out looking at what a character set is.
What is a Character Set?
Characters are wonderful things that can take all kinds of shapes ranging from the familiar (for English speakers at least) Latin characters as those used to write this blog, over Arabic (مرحبا بالعالم) and Asian characters such as simplified Chinese (你好，世界) to emojis (❤️🐬). This is all cool, but computers do not deal in characters, they work in bits, so how are all of these wonderful characters represented?
The answer is that character sets are used to encode the characters into bits. Given the number of known characters (Unicode 11.0 contains 137374 characters). Just to cover the possibility of those characters, 18 bits are needed (so effectively 3 bytes). That is significant.
In the good old days when computers were new, it was mostly English that was used, and using 18 bits per character was just not feasible. A popular home computer in the 1980s was the Commodore 64 was an 8-bit computer with a total of 64kB of RAM – you just could not get away with using multiple bytes per character.
In the early days a popular character set was ASCII which uses 7 bits to encode each character. This allows for 128 unique characters which is ok for plain English but does not even allow for all of the Western European accented characters to be included. One answer to that are the Latin character sets, such as ISO 8859-1 (Latin 1) and Windows-1252. These use 8 bits and thus supports 256 characters for each character set. Together they allow complete coverage for the languages using the Latin letters, but you have to choose the right one for the language you are writing, and in all cases 8 bit does not help much if you need to write something in Chinese that has thousands of characters.
The current solution is to use Unicode characters. The Unicode Consortium defines code points for the characters included. As mentioned, in Unicode 11.0 there are just over 137000 characters. This covers most usages (yes, there are actually still characters missing). There is a bit more to it than that, however. One thing is the code points defined by the Unicode Consortium, but these still need bits representations for computers. There are several solutions to that with UTF-8 being the most commonly used nowadays.
UTF-8 is a variable width encoding that uses one to four bytes to represent a Unicode code point. Part of the success of UTF-8 is that the characters used in ASCII have the same encoding in UTF-8 – that is, if you have a document in ASCII, you can just say it is now in UTF-8 and all works well. That made it easy for the large number of ASCII based websites to migrate to use Unicode. The rest of the characters use one or more bytes. While the compatibility with ASCII has been great for adaptation and it helps keep the size of documents in English down, it also has some downsides. One is that some other languages uses more bytes that needed; another is that scanning a string is relatively expensive as you must decode each character to know where the next character begins.
Before moving on, let’s have a quick look at some actual encodings in different character sets, to see the difference (UCS2 is a two-byte fixed width Unicode character set):
The table shows how there clearly are some relations between these character sets, but also how only UTF-8 can represent all of the four test characters. So, from this comparison the strengths of UTF-8 is starting to show, but also the weaknesses. The character Å requires two bytes in UTF-8 (and UCS-2) but only one in Latin-1. The Chinese character 界 requires three bytes in UTF-8 but only two in UCS-2.
So, what does this mean in the context of MySQL? Let’s take a look.
MySQL Character Sets
Until MySQL 8.0 the default character set in MySQL was Latin-1 (named latin1). This was a convenient character set in many ways, for example it was fixed width, so finding the Nth character in a string was fast and it could store text for most Western European languages. However as discussed, Latin-1 is not what is used in this day and age – the World has moved on to UTF-8. So, in MySQL 8.0 the change was made to make utf8mb4 the default character set.
Stop a minute – what is utf8mb4? How does that differ from UTF-8 that was discussed in the previous section? Well, it is the same thing. Unfortunately, when UTF-8 was first implemented in MySQL, it was limited to three bytes (in MySQL 8.0 called utf8mb3). This predates my involvement with MySQL, but a guess for this choice is that it is related to internal temporary tables which in MySQL 5.7 and earlier uses the MEMORY storage engine when possible. Internal temporary tables are for example used to store the result of subquery and for sorting. The MEMORY storage engine only supports fix width columns, so a varchar(10) column would be treated as a char(10) column in an in-memory internal temporary table. With utf8mb4 that would mean 40 bytes, with the choice of a 3-byte implementation it would mean 30 bytes. Furthermore, until the emergence of emojis, it was rarely required to use more than three bytes in UTF-8.
Anyway, MySQL has support for a wide range of character sets to suite your specific need. In MySQL 8 there are a total of 41 character sets to choose from. The details of the character sets can be found in the information_schema.CHARACTER_SETS table:
mysql-sql> SELECT * FROM information_schema.CHARACTER_SETS ORDER BY CHARACTER_SET_NAME; +--------------------+----------------------+---------------------------------+--------+ | CHARACTER_SET_NAME | DEFAULT_COLLATE_NAME | DESCRIPTION | MAXLEN | +--------------------+----------------------+---------------------------------+--------+ | armscii8 | armscii8_general_ci | ARMSCII-8 Armenian | 1 | | ascii | ascii_general_ci | US ASCII | 1 | | big5 | big5_chinese_ci | Big5 Traditional Chinese | 2 | | binary | binary | Binary pseudo charset | 1 | | cp1250 | cp1250_general_ci | Windows Central European | 1 | | cp1251 | cp1251_general_ci | Windows Cyrillic | 1 | | cp1256 | cp1256_general_ci | Windows Arabic | 1 | | cp1257 | cp1257_general_ci | Windows Baltic | 1 | | cp850 | cp850_general_ci | DOS West European | 1 | | cp852 | cp852_general_ci | DOS Central European | 1 | | cp866 | cp866_general_ci | DOS Russian | 1 | | cp932 | cp932_japanese_ci | SJIS for Windows Japanese | 2 | | dec8 | dec8_swedish_ci | DEC West European | 1 | | eucjpms | eucjpms_japanese_ci | UJIS for Windows Japanese | 3 | | euckr | euckr_korean_ci | EUC-KR Korean | 2 | | gb18030 | gb18030_chinese_ci | China National Standard GB18030 | 4 | | gb2312 | gb2312_chinese_ci | GB2312 Simplified Chinese | 2 | | gbk | gbk_chinese_ci | GBK Simplified Chinese | 2 | | geostd8 | geostd8_general_ci | GEOSTD8 Georgian | 1 | | greek | greek_general_ci | ISO 8859-7 Greek | 1 | | hebrew | hebrew_general_ci | ISO 8859-8 Hebrew | 1 | | hp8 | hp8_english_ci | HP West European | 1 | | keybcs2 | keybcs2_general_ci | DOS Kamenicky Czech-Slovak | 1 | | koi8r | koi8r_general_ci | KOI8-R Relcom Russian | 1 | | koi8u | koi8u_general_ci | KOI8-U Ukrainian | 1 | | latin1 | latin1_swedish_ci | cp1252 West European | 1 | | latin2 | latin2_general_ci | ISO 8859-2 Central European | 1 | | latin5 | latin5_turkish_ci | ISO 8859-9 Turkish | 1 | | latin7 | latin7_general_ci | ISO 8859-13 Baltic | 1 | | macce | macce_general_ci | Mac Central European | 1 | | macroman | macroman_general_ci | Mac West European | 1 | | sjis | sjis_japanese_ci | Shift-JIS Japanese | 2 | | swe7 | swe7_swedish_ci | 7bit Swedish | 1 | | tis620 | tis620_thai_ci | TIS620 Thai | 1 | | ucs2 | ucs2_general_ci | UCS-2 Unicode | 2 | | ujis | ujis_japanese_ci | EUC-JP Japanese | 3 | | utf16 | utf16_general_ci | UTF-16 Unicode | 4 | | utf16le | utf16le_general_ci | UTF-16LE Unicode | 4 | | utf32 | utf32_general_ci | UTF-32 Unicode | 4 | | utf8 | utf8_general_ci | UTF-8 Unicode | 3 | | utf8mb4 | utf8mb4_0900_ai_ci | UTF-8 Unicode | 4 | +--------------------+----------------------+---------------------------------+--------+ 41 rows in set (0.0010 sec)
Notice how there are several UTF and other Unicode character sets including utf8 and utf8m4. The utf8 character set is the 3-byte implementation. The name utf8 has now been deprecated in favour of utf8mb3 to make it specific that you are using an implementation that can at most use three bytes per character.
The table has four columns: The character set name, the default collation, a description, and the maximum number of bytes per character. The first and the two last of the columns are clear enough, but what is the collation? Let’s discuss that.
What is a Collation?
It is all well and good that you now know how to encode characters, but one of the main uses of databases is to compare data. Are two values the same or is one larger than the other? For numbers this is in general well defined (floating point comparisons and release numbers are examples where it is less clear), for example 10 is greater than 5.
However, for strings it is not that simple. Different languages have different rules for the order of the characters, and there are the questions whether an accented character should be considered the same as the base character and whether upper and lower case should be considered the same. An example is that in Danish and Norwegian, the character Ø comes before Å, but in Swedish Å comes before Ö (Ø and Ö are essentially the same letter).
The rules that define the order of the characters and whether to take accents and capitalization into account is called collations. There are many collations for the same character set. In MySQL 8.0 there are alone 73 collations to choose from for the utf8mb4 character set (more later about utf8mb4).
The collations that can be used with a given character set can be found in the information_schema.COLLATIONS table. For example, to see the 73 collations available for utf8mb4:
mysql-sql> > SELECT * FROM information_schema.COLLATIONS WHERE CHARACTER_SET_NAME = 'utf8mb4' ORDER BY COLLATION_NAME; +----------------------------+--------------------+-----+------------+-------------+---------+---------------+ | COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN | PAD_ATTRIBUTE | +----------------------------+--------------------+-----+------------+-------------+---------+---------------+ | utf8mb4_0900_ai_ci | utf8mb4 | 255 | Yes | Yes | 0 | NO PAD | | utf8mb4_0900_as_ci | utf8mb4 | 305 | | Yes | 0 | NO PAD | | utf8mb4_0900_as_cs | utf8mb4 | 278 | | Yes | 0 | NO PAD | | utf8mb4_bin | utf8mb4 | 46 | | Yes | 1 | PAD SPACE | | utf8mb4_croatian_ci | utf8mb4 | 245 | | Yes | 8 | PAD SPACE | | utf8mb4_cs_0900_ai_ci | utf8mb4 | 266 | | Yes | 0 | NO PAD | | utf8mb4_cs_0900_as_cs | utf8mb4 | 289 | | Yes | 0 | NO PAD | | utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 | PAD SPACE | | utf8mb4_danish_ci | utf8mb4 | 235 | | Yes | 8 | PAD SPACE | | utf8mb4_da_0900_ai_ci | utf8mb4 | 267 | | Yes | 0 | NO PAD | | utf8mb4_da_0900_as_cs | utf8mb4 | 290 | | Yes | 0 | NO PAD | | utf8mb4_de_pb_0900_ai_ci | utf8mb4 | 256 | | Yes | 0 | NO PAD | | utf8mb4_de_pb_0900_as_cs | utf8mb4 | 279 | | Yes | 0 | NO PAD | | utf8mb4_eo_0900_ai_ci | utf8mb4 | 273 | | Yes | 0 | NO PAD | | utf8mb4_eo_0900_as_cs | utf8mb4 | 296 | | Yes | 0 | NO PAD | | utf8mb4_esperanto_ci | utf8mb4 | 241 | | Yes | 8 | PAD SPACE | | utf8mb4_estonian_ci | utf8mb4 | 230 | | Yes | 8 | PAD SPACE | | utf8mb4_es_0900_ai_ci | utf8mb4 | 263 | | Yes | 0 | NO PAD | | utf8mb4_es_0900_as_cs | utf8mb4 | 286 | | Yes | 0 | NO PAD | | utf8mb4_es_trad_0900_ai_ci | utf8mb4 | 270 | | Yes | 0 | NO PAD | | utf8mb4_es_trad_0900_as_cs | utf8mb4 | 293 | | Yes | 0 | NO PAD | | utf8mb4_et_0900_ai_ci | utf8mb4 | 262 | | Yes | 0 | NO PAD | | utf8mb4_et_0900_as_cs | utf8mb4 | 285 | | Yes | 0 | NO PAD | | utf8mb4_general_ci | utf8mb4 | 45 | | Yes | 1 | PAD SPACE | | utf8mb4_german2_ci | utf8mb4 | 244 | | Yes | 8 | PAD SPACE | | utf8mb4_hr_0900_ai_ci | utf8mb4 | 275 | | Yes | 0 | NO PAD | | utf8mb4_hr_0900_as_cs | utf8mb4 | 298 | | Yes | 0 | NO PAD | | utf8mb4_hungarian_ci | utf8mb4 | 242 | | Yes | 8 | PAD SPACE | | utf8mb4_hu_0900_ai_ci | utf8mb4 | 274 | | Yes | 0 | NO PAD | | utf8mb4_hu_0900_as_cs | utf8mb4 | 297 | | Yes | 0 | NO PAD | | utf8mb4_icelandic_ci | utf8mb4 | 225 | | Yes | 8 | PAD SPACE | | utf8mb4_is_0900_ai_ci | utf8mb4 | 257 | | Yes | 0 | NO PAD | | utf8mb4_is_0900_as_cs | utf8mb4 | 280 | | Yes | 0 | NO PAD | | utf8mb4_ja_0900_as_cs | utf8mb4 | 303 | | Yes | 0 | NO PAD | | utf8mb4_ja_0900_as_cs_ks | utf8mb4 | 304 | | Yes | 24 | NO PAD | | utf8mb4_latvian_ci | utf8mb4 | 226 | | Yes | 8 | PAD SPACE | | utf8mb4_la_0900_ai_ci | utf8mb4 | 271 | | Yes | 0 | NO PAD | | utf8mb4_la_0900_as_cs | utf8mb4 | 294 | | Yes | 0 | NO PAD | | utf8mb4_lithuanian_ci | utf8mb4 | 236 | | Yes | 8 | PAD SPACE | | utf8mb4_lt_0900_ai_ci | utf8mb4 | 268 | | Yes | 0 | NO PAD | | utf8mb4_lt_0900_as_cs | utf8mb4 | 291 | | Yes | 0 | NO PAD | | utf8mb4_lv_0900_ai_ci | utf8mb4 | 258 | | Yes | 0 | NO PAD | | utf8mb4_lv_0900_as_cs | utf8mb4 | 281 | | Yes | 0 | NO PAD | | utf8mb4_persian_ci | utf8mb4 | 240 | | Yes | 8 | PAD SPACE | | utf8mb4_pl_0900_ai_ci | utf8mb4 | 261 | | Yes | 0 | NO PAD | | utf8mb4_pl_0900_as_cs | utf8mb4 | 284 | | Yes | 0 | NO PAD | | utf8mb4_polish_ci | utf8mb4 | 229 | | Yes | 8 | PAD SPACE | | utf8mb4_romanian_ci | utf8mb4 | 227 | | Yes | 8 | PAD SPACE | | utf8mb4_roman_ci | utf8mb4 | 239 | | Yes | 8 | PAD SPACE | | utf8mb4_ro_0900_ai_ci | utf8mb4 | 259 | | Yes | 0 | NO PAD | | utf8mb4_ro_0900_as_cs | utf8mb4 | 282 | | Yes | 0 | NO PAD | | utf8mb4_ru_0900_ai_ci | utf8mb4 | 306 | | Yes | 0 | NO PAD | | utf8mb4_ru_0900_as_cs | utf8mb4 | 307 | | Yes | 0 | NO PAD | | utf8mb4_sinhala_ci | utf8mb4 | 243 | | Yes | 8 | PAD SPACE | | utf8mb4_sk_0900_ai_ci | utf8mb4 | 269 | | Yes | 0 | NO PAD | | utf8mb4_sk_0900_as_cs | utf8mb4 | 292 | | Yes | 0 | NO PAD | | utf8mb4_slovak_ci | utf8mb4 | 237 | | Yes | 8 | PAD SPACE | | utf8mb4_slovenian_ci | utf8mb4 | 228 | | Yes | 8 | PAD SPACE | | utf8mb4_sl_0900_ai_ci | utf8mb4 | 260 | | Yes | 0 | NO PAD | | utf8mb4_sl_0900_as_cs | utf8mb4 | 283 | | Yes | 0 | NO PAD | | utf8mb4_spanish2_ci | utf8mb4 | 238 | | Yes | 8 | PAD SPACE | | utf8mb4_spanish_ci | utf8mb4 | 231 | | Yes | 8 | PAD SPACE | | utf8mb4_sv_0900_ai_ci | utf8mb4 | 264 | | Yes | 0 | NO PAD | | utf8mb4_sv_0900_as_cs | utf8mb4 | 287 | | Yes | 0 | NO PAD | | utf8mb4_swedish_ci | utf8mb4 | 232 | | Yes | 8 | PAD SPACE | | utf8mb4_tr_0900_ai_ci | utf8mb4 | 265 | | Yes | 0 | NO PAD | | utf8mb4_tr_0900_as_cs | utf8mb4 | 288 | | Yes | 0 | NO PAD | | utf8mb4_turkish_ci | utf8mb4 | 233 | | Yes | 8 | PAD SPACE | | utf8mb4_unicode_520_ci | utf8mb4 | 246 | | Yes | 8 | PAD SPACE | | utf8mb4_unicode_ci | utf8mb4 | 224 | | Yes | 8 | PAD SPACE | | utf8mb4_vietnamese_ci | utf8mb4 | 247 | | Yes | 8 | PAD SPACE | | utf8mb4_vi_0900_ai_ci | utf8mb4 | 277 | | Yes | 0 | NO PAD | | utf8mb4_vi_0900_as_cs | utf8mb4 | 300 | | Yes | 0 | NO PAD | +----------------------------+--------------------+-----+------------+-------------+---------+---------------+ 73 rows in set (0.0023 sec)
The most interesting in this discussion is the name. As you can see there is a pattern to the names, and it is possible to identify up to four parts in the name
- First the character set name.
- Then which language it is for, for example ja for Japanese. A special “language” is binary which mean each byte is compared directly one by one.
- Then whether it is a UCA 9.0.0 based collation. The UCA 9.0.0 based collations have 0900 in the name.
- Then up to three modifiers depending on whether it is accent and/or case sensitive or insensitive.
There are a total of five accent and case modifiers:
- ai: accent insensitive, ä is equal to a.
- as: accent sensitive, ä is not equal to a.
- ci: case insensitive, a is equal to A.
- cs: case sensitive, a is not equal to A.
- ks: kana sensitive (only for Japanese)
This brings us on to MySQL 8.0 and character sets and collations.
Good job of making it this far. You are almost at the end now.
As part of the work to make utf8mb4 the default character set in MySQL 8.0, a large amount of work was put into making the MySQL work more efficiently with the UTF-8 character set and to make it more useful for the users. Some of the changes include:
- The UCA 9.0.0 collations (with 0900 in the name) are new in MySQL 8.0. MySQL works faster with these new collations than the old collations.
- In-memory internal temporary tables now use the new Temptable storage engine by default. It supports variable width columns making it much more efficient with UTF-8 than earlier versions of MySQL. From version 8.0.12 (just released) the Temptable storage engine also supports BLOBs.
Another thing to be aware of is that the new X DevAPI that for example allows NoSQL access to both the MySQL Document store with JSON documents and to SQL tables expect all query results to be UTF-8.
For these reasons it is recommended to use utf8mb4 with one of the new UCA 9.0.0 collations for most uses in MySQL 8.0. The default collation for utf8mb4 in MySQL 8.0 is utf8mb4_0900_ai_ci. This is a good collation for the general use cases, but feel free to use one of the more specific collations if that works better for your application. For example, the utf8mb4_ja_0900_as_cs or utf8mb4_ja_0900_as_cs_ks if you need correct Japanese comparisons (such as the fix for the infamous Sushi = Beer bug).
Character sets and collations are big topics. If you are interested to know more about MySQL 8.0 and character sets, there is a series of blogs on the MySQL Server Blog, for example:
- Sushi = Beer ?! An introduction of UTF8 support in MySQL 8.0 (by Manyi Lu)
- New collations in MySQL 8.0.0 (by Xing Zhang)
- MySQL 8.0 Collations: The devil is in the details. (by Bernt Marius Johnsen)
- MySQL 8.0.1: Accent and case sensitive collations for utf8mb4 (by Xing Zhang)
- MySQL 8.0.1: Japanese collation for utf8mb4 (by Xing Zhang)
- MySQL 8.0: Kana-sensitive collation for Japanese (By Xing Zhang)
- Debugging Character Set Issues by Example (by Bernt Marius Johnsen)
Additionally, the MySQL reference manual has a chapter with 15 sections about character sets and collations: Chapter 10 Character Sets, Collations, Unicode.