UTF-8 and MySQL

The UTF-8 character set allows a text to encode any of the 1,112,064 code points in the Unicode standard using one to four 8-bit bytes (known as octets in the Unicode Standard). UTF-8 stands for UCS Transformation Format – 8bit and has quickly become the standard text encoding for most internet based communication.

Although MySQL has support for UTF-8, it only allows up to three bytes per character. This is enough to cover the BMP (Basic Multilingual Plane), which contains virtually all characters in common use. However, recently there has been a growing number of characters that is becoming very popular, but most of these characters require four bytes and are outside the BMP: they represent emoji and several other icons that can be useful in chat and messaging systems.

Such characters cannot be saved into a MySQL database with “utf8”, and the text you are trying to update or insert will be truncated at the first occurrence of a four-byte sequence. The solution to this came with MySQL 5.5.3, where a new character set was added known as “utf8mb4”. This is really the same as the existing “utf8”, but with full support for the specification, including four-byte characters.

Here is a quick list of links with more information on the topic

http://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8mb4.html
http://mathiasbynens.be/notes/mysql-utf8mb4
http://stackoverflow.com/questions/7814293/how-to-insert-utf-8-mb4-characteremoji-in-ios5-in-mysql
http://stackoverflow.com/questions/10153529/emoji-on-mysql-and-php-why-some-symbol-yes-other-not
http://stackoverflow.com/questions/10580186/how-to-display-emoji-char-in-html

Leave a comment

Your email address will not be published. Required fields are marked *


*