The website this new customer represents has international appeal, has been around for a number of years (with years' worth of data to go with it) and has been hosted on several different hosting providers in the past. This website has also seen many code changes over the years including significant changes to the underlying platform, and some turn-over of developers working on it.
When our new customer came on-board with Blue Box Group, the agreement we came to was that BBG would handle most of the technical details behind the transition process including moving their entire data set (and associated databases) onto BBG hardware. They warned us up-front that they were concerned about “character set issues,” since they'd had to deal with these the last time they changed hosting providers. All of this is pretty standard fare when we bring on a new customer from another hosting provider, so I wasn't too worried.
I was careful to take the above into consideration when we moved the tens of gigabytes of database dumps, hundreds of gigabytes of data files, and everything else to their new cluster. Everything looked good after we'd gotten it all set up, and when it came time for the cut-over we expected a very smooth transition. To be fair, the transition was smooth, but a few days after the site went live we began to notice that some, rare pages on the site appeared to have suffered some kind of character set corruption somewhere in the process of transition. I was baffled at first until I took a closer look at those bits of data in the database from which these “corrupt” pages were being generated and began to understand what our new customer had meant by “character set issues.”
After several days, a lot of Googling, reading of (mostly unhelpful) support mailing lists and much experimentation I felt I had accumulated a pretty solid understanding of how things had gotten the way they did in our customer's database. What's more, unlike most of the other blog articles and other web pages you'll probably read on this, I felt I had discovered a relatively simple procedure for getting out of this situation. Since I wasn't able to find any other authoritative source on this on the internet (and in fact, most of the other sources I've seen have said you really don't want to be in this situation-- while offering little help as to what to do about it if you're already there), I thought writing a public document on the subject might help some other systems administrators out there who do find themselves unexpectedly in the middle of MySQL Character Set Hell.
A full discussion of character sets and encoding is beyond the scope of this document. (If you want more background, I recommend checking out the wikipedia article for a good place to start here: http://en.wikipedia.org/wiki/Character_encoding ) However, a basic understanding of character set encoding technology and the specific character sets covered in this document is going to be required in order for the problem description and subsequent solution to make sense.
In the early days of computing the ASCII character set became the industry standard that most computer systems used for representing characters. It's a 7-bit encoding scheme which defines all the most commonly used characters in English text, as well as some non-printable control characters.
Latin-1 (ISO-8859-1) is a slight extension of US-ASCII in that it is 8-bit (ie. each character is exactly one byte or octet), and defines some characters used by non-English languages which nevertheless share much of the character set with English. (This includes many European languages which employ accented characters and the like.) Latin-1 also defines several other commonly-used symbols and glyphs. Latin-1 is one of several 8-bit extensions to ASCII, and the one most commonly in use as a default by US computers.
The major drawback of using Latin-1 is that it has only 256 possible characters, and therefore has no way of effectively representing characters in languages whose alphabets differ significantly from English (this includes languages like Russian, Greek, Hebrew, Arabic, almost all Asian languages and just about every language ever invented by humans). However, Latin-1 and US-ASCII remain the most commonly used character sets for all computer-related functions. That is, except for web pages.
In 2008, UTF-8 surpassed ASCII as the most commonly used character encoding in use on the World Wide Web (source: http://en.wikipedia.org/wiki/ASCII ). Adoption of this character encoding scheme seems to be a growing trend for most applications in the information technology industry. A likely reason for this is because it is possible to represent the alphabet of nearly every written language ever invented using just this character set-- which really helps when designing software that is meant to have an international user-base. (The thought here is that your non-English-speaking users will prefer to represent words in their native languages using characters from their native alphabets.)
UTF-8 is a variable length multi-byte character encoding scheme in that each character may be represented using 1, 2, 3 or 4 bytes or octets of data. The total possible size of this character set is around 10 million characters, with around 1 million defined so far. It can be used to encode any Unicode character, including the characters in nearly every language known to humankind, as well as a bunch of symbols and other glyphs that aren't letters at all.
The main advantage of using UTF-8 over other multi-byte character encoding schemes (which can also encode Unicode characters) is that UTF-8 is backwards-compatible with ASCII. That is, a valid series of ASCII bytes is also a valid set of UTF-8 bytes and represents the exact same characters.
One obvious advantage to using UTF-8 instead of Latin-1 or ASCII (especially for web pages, where international appeal is usually desired) is that once you've converted over to using UTF-8, you really don't have to worry about character set encoding issues again: For just about any human language that has an alphabet or character set, there will be a valid UTF-8 encoding of those characters. (That's not to say there aren't other caveats and implications about which you need to be aware when using UTF-8, but we'll get to those a little later on.) Internationalization of your software becomes much less of a problem when using UTF-8.
For more good background information on UTF-8, check out its wikipedia article.
MySQL by default uses the Latin-1 character set for all entities within the software or its data sets. In recent years, using UTF-8 has become a viable (and recommended) alternative, but probably in order not to confuse people with changing behavior, the developers of MySQL have decided to leave the default at Latin-1 for now. I believe this will change with MySQL version 6 (which seems like a good time to make everyone go through the task of converting from Latin-1 to UTF-8).
MySQL has the ability to convert characters between Latin-1 and UTF-8 and will, in fact, do this transparently in many cases. As it turns out, this can actually be a problem which can lead to the exact sort of character set hell this document tries to help the reader recover from.
In terms of the way MySQL stores textual fields (like CHARs, VARCHARs, and TEXT columns)-- as far as I've been able to determine, at the raw storage level there is no difference between one of these fields and a BLOB (which is just raw binary data), except that text fields have a “collation” applied to them. Documentation on what this means is somewhat lacking, but as far as I can tell, “collation” in this case doesn't exactly mean the character set in use in the text fields in the table so much as how MySQL will attempt to interpret the binary data stored in there. This means that it is possible to get ourselves into interesting situations where MySQL will store UTF-8 encoded characters in a table with Latin-1 collation. Also, since not every sequence of bytes is a valid string of UTF-8 characters, it's possible to have invalid characters in a UTF-8 collated table.
Through experimentation I've been able to determine that if you have a Latin-1 collated table with UTF-8 encoded characters in it, then attempting to take the most direct approach to fix this and simply alter the collation from Latin-1 to UTF-8 on the table causes MySQL to “helpfully” convert the characters in the text fields in the table from Latin-1 to UTF-8 for you. The conversion algorithm, unfortunately, is not intelligent enough to figure out when there already are perfectly valid UTF-8 characters in the text fields and instead tries to convert each byte of a multi-byte character into its UTF-8 equivalent. This results in what I will call “double-encoded UTF-8 characters” in your text fields, and looks like a bunch of gobbledygook. Do this several times (usually in an attempt to fix the problem you're actually just making worse) and you can have triple- and quadruple-encoded characters (which look like really long strings of gobbledygook which nevertheless have noticeable patterns in them). Fortunately this is all reversible.
Oh-- and if this weren't enough, MySQL can have different character encoding settings for the server, client, table, field, query results, replication, etc. What's more, MySQL will transparently convert characters for you between character sets depending on how they're used and accessed. (This can make troubleshooting these character set problems “interesting” to say the least.) Further, the terminal software or web server or web browser you're using to access the data might also be using its own character set (and transparently converting characters for you)...
You may be asking yourself, if you're an English-speaking person writing English software which will only be used by an English-speaking user-base, why you should care about using the UTF-8 character encoding scheme instead of US-ASCII or Latin-1. It certainly seems like there are enough places where it's easy to screw something up that it ought to be better to avoid the problem entirely and stick with Latin-1. There are several reasons you want to deal with this problem now:
Believe me: Even if you believe you'll never need anything more than the 256 characters in the Latin-1 character set, it's better to do it with UTF-8 now and save yourself the headache months or years from now.
It's actually somewhat difficult for me to define exactly what it is that quantifies MySQL character set hell. I've alluded to some of the problems that make it a nasty situation to be in above; But just to try to create a more complete list of the symptoms describing what I mean here:
Further, while the above can be fiendish problems the following secondary factors you may encounter when troubleshooting them can make it a truly hellish position to be in (and get out of):
... and the list goes on. The above are just some of the things I found attempting to resolve character encoding issues for one customer on one moderately-sized data set. I have no doubt there are more symptoms and stumbling blocks that can be encountered when dealing with MySQL Character Set Hell.
If you've read this far, you've probably already gathered that the only quasi-permanent solution to the problem of MySQL Character Set Hell is to use the UTF-8 (and only the UTF-8) character encoding everywhere: It's the direction all content on the internet is headed, and universal adoption on PCs and server systems probably won't be too far behind. The rest of this document proceeds with the assumption that this is where you want to be at the end of the process. The procedure for getting there consists of a few steps:
The way I fixed the above issues was done mostly using standard linux terminal software and the mysql command-line client. Specifically, I used gnome-terminal under Fedora linux, however any terminal software which understands how to display the UTF-8 character set should work. If you have to tweak any of your terminal's settings to get it to talk UTF-8 by default, you should probably do this. (The current version of gnome-terminal with Fedora speaks UTF-8 by default, so I didn't have to do anything special here.)
Beyond that, it would be a good idea to consult the documentation for your application software (or the interpreter it uses, be that php, perl, rails, etc.) to determine how you force it to use UTF-8 as its default character set for all things MySQL related. (Since Blue Box Group specializes in Ruby on Rails, I'll talk specifically about how to do this with Rails, but for other application platforms you'll need to discover how to do this on your own.)
Note that these instructions do not cover using any kind of GUI to fix your MySQL character set issues: A GUI introduces another layer of indirection from the database, which can further complicate the troubleshooting process (and may introduce yet another place a character set conversion may take place). The goal here is to get MySQL talking UTF-8 at the most fundamental level and in my mind the most direct way to do this is with the mysql command-line client.
Also, these instructions assume you have complete administrative control over your own mysql database. If you run into MySQL Character Set Hell in a shared environment where you don't have complete control over your databases, you'll probably have to work with whoever does have administrative control of your database to get these problems fixed. Obviously, YMMV here.
To configure both your mysql server daemon and client to speak UTF-8 by default, add lines like the following to every section of your my.cnf and restart your mysql daemon. If you use a mysql client from a different machine, make sure your client machine also has a my.cnf file which defines at least the [client] section and has the following configuration line:
default-character-set = utf8
You can verify both server and client are speaking UTF-8 by running the following query on the mysql command line (which should yield the following results):
mysql> show variables like 'char%'; +--------------------------+----------------------------+ | Variable_name | Value | +--------------------------+----------------------------+ | character_set_client | utf8 | | character_set_connection | utf8 | | character_set_database | utf8 | | character_set_filesystem | binary | | character_set_results | utf8 | | character_set_server | utf8 | | character_set_system | utf8 | | character_sets_dir | /usr/share/mysql/charsets/ | +--------------------------+----------------------------+ 8 rows in set (0.00 sec)
If you see latin1 mentioned in there anywhere, double-check your configuration, and / or make sure you've restarted your mysql daemon.
If you're running Rails version 2.0 or later, this version of Rails uses the UTF-8 character encoding scheme by default for talking to the MySQL database. If you're using Rails 1.2, you need to add the last line below to the appropriate sections of your database.yml file in order to get your application to speak UTF-8 to the database:
production: adapter: mysql database: mydatabase username: mydbuser password: mydbpassword host: mydbhost encoding: utf8
If you are using a different application platform, you'll need to check its documentation to determine how to get your mysql connection to speak UTF-8. If you've written your own functions to do MySQL queries, you can often make sure these queries use UTF-8 by issuing the following query after opening each MySQL connection:
set names utf8;
And of course, to test your application to make sure that it's really talking UTF-8 to the database, you can always execute and examine the results of the show variables like 'char%'; query above.
Since this step is the first place we're actually going to be changing data within the database, the very first thing we'll want to do here is make a backup of the database. Note that since in MySQL Character Set Hell, a mysqldump of the database is likely to have syntax errors in it, I recommend making a backup of the raw database files. In order for this backup to have clean data in it, you'll need to shut down your mysql daemon while making the backu If you're server is a RedHat- or CentOS-like operating system using vendor-supplied RPMs for the MySQL packages (and default paths in your my.cnf), the following set of commands ought to do this for you:
# service mysql stop # cd /var/lib # tar zcvf mysql-backup.tar.gz mysql # service mysql start
You'll want to make sure your database server has enough space under /var/lib on it to store the above backup, of course.
Please note that if your data set does not contain a mix of single- and double-encoded UTF-8 characters in a Latin-1 character set (ie. you either have one or the other) then there are safer conversion techniques than what I'm about to document below. (Specifically, see this page:). This step assumes that your data set contains tables with a mix of single- and double- (and triple- and quadruple-) encoded UTF-8 characters in a single Latin-1 collated table.
The procedure for converting your tables to UTF-8 collation will vary depending on which sources you'll read, but the way that worked for me was to do a carefully-crafted mysqldump, alter the dump slightly, then re-import the dump back over the database itself. To minimize the chances of clobbering data in tables that are already properly UTF-8 collated, this really only needs to be done on tables using the Latin-1 collation to begin with. Do discover which tables these are, you can run:
mysql> show table status where collation like '%latin1%' \G *************************** 1. row *************************** Name: UserInfo Engine: InnoDB Version: 10 Row_format: Compact Rows: 2 Avg_row_length: 8192 Data_length: 16384 Max_data_length: 0 Index_length: 65536 Data_free: 0 Auto_increment: 3 Create_time: 2008-06-01 22:48:11 Update_time: NULL Check_time: NULL Collation: latin1_swedish_ci Checksum: NULL Create_options: Comment: InnoDB free: 0 kB 9 rows in set (0.00 sec) mysql>
(Yes, “latin1_swedish_ci” is the name MySQL uses for Latin-1 collation.) You'll need to run the above for all databases on the MySQL server, and compile a list of affected tables. For each of these tables, run:
# mysqldump -u dbuser -pdbpass --opt --skip-set-charset \ --default-character-set=latin1 --skip-extended-insert \ databasename --tables tablename > database.table.sql # perl -i -pe 's/DEFAULT CHARSET=latin1/DEFAULT CHARSET=utf8/' \ database.table.sql # cat database.table.sql | mysql -u dbuser -pdbpass databasename
Let me analyze what each of the above three commands does for you:
An astute observer here will note that we told mysql above to import a UTF-8 encoded dump that was actually encoded using Latin-1. In my experimentation, I found that this actually does seem to work, and I suspect this is because of the following reasons:
I should also point out here that other sources of documentation will recommend actually converting the mysqldump to the UTF-8 character set using the iconv utility. If you have a relatively clean dataset, this may work for you. For the data set I had to work with, it did not.
Anyway, before going any further in these instructions, it's a good idea to check the collation on your converted tables:
mysql> show table status \G *************************** 1. row *************************** Name: UserInfo Engine: InnoDB Version: 10 Row_format: Compact Rows: 2 Avg_row_length: 8192 Data_length: 16384 Max_data_length: 0 Index_length: 65536 Data_free: 0 Auto_increment: 3 Create_time: 2008-06-01 22:48:11 Update_time: NULL Check_time: NULL Collation: utf8_general_ci Checksum: NULL Create_options: Comment: InnoDB free: 0 kB 9 rows in set (0.00 sec) mysql>
Before going any further in these instructions, I highly recommend taking a close look at your application. At this point, everything everywhere should be speaking UTF-8, and all your tables should have the UTF-8 collation. In an ideal world, especially if you started with a relatively clean data set, this may be all you need to do to make everything peachy in your application. So take a moment to look at your application (and specifically content generated from rows with non-ASCII characters in them) and verify you really need to go further in this how-to.
Ok, so from here on out, we're going to assume that you do still have problems displaying characters as they should be in your application, and that this is due to having double-encoded characters in your UTF-8 collated tables. (Note that if your terminal and mysql command-line client are speaking UTF-8, you can verify the problems lie with the data itself with a few SELECT statements targeted at rows containing the data.)
As a few final notes before moving on just to drive the point home:
The instructions in this section of this document will detail how to fix double- (and triple- and quadruple-)-encoded character issues for one column in all rows of one table in the database. These steps will need to be repeated for any affected columns in all tables in your database. Scripting this is left as an exercise for the reader (or as an incentive to become a customer of Blue Box Group) )
For the examples I'll be using below, we'll use the following example name conventions. Obviously, you'll need to replace these with the actual names that apply to your database:
At this point, only those rows which have non-ASCII characters in text-containing fields (like CHAR, VARCHAR, TEXT, etc.) should have problems with double-encoded UTF-8 characters in them. The really nice part about UTF-8 here (and tables containing UTF-8 characters) is that any non-ASCII characters are by design multi-byte characters. We can exploit this fact to find tables and fields with rows containing non-ASCII characters.
mysql> select count(*) from mytable where LENGTH(myfield) != CHAR_LENGTH(myfield); +----------+ | count(*) | +----------+ | 669 | +----------+ 1 row in set (0.28 sec)
If the above query returns a count of zero, this means that either:
Anyway, you will want to go through all text-like fields in all tables of your database and find any which contain rows that have multi-byte characters in them like this. These rows can potentially have double-encoded UTF-8 characters in them and will need conversion as detailed below. I suggest compiling these into a list you can automatically parse with a script which will do the remainder of the steps below.
*Also note that it is very important at this stage to make note of the field type and any associated parameters (eg. the length of the VARCHAR, etc.).* I suggest taking a close look at the table's description to get this information.
Now that we know which fields in which tables potentially have double-encoded UTF-8 characters in them, copy just those rows into a temporary table. The goal here is to minimize the chances of blowing away large chunks of the data you want to keep, so we only need to run the remainder of our conversion steps on rows that are potential candidates for double-encoded UTF-8 characters:
mysql> create table temptable (select * from mytable where LENGTH(myfield) != CHAR_LENGTH(myfield)); Query OK, 669 rows affected (1.15 sec) Records: 669 Duplicates: 0 Warnings: 0
In order to do this, we're actually going to convert the field from UTF-8 to Latin-1 and use a round-about way of changing it back to UTF-8 again. This may seem counter-intuitive, but I'll explain why this works after listing the actual commands to run:
mysql> alter table temptable modify temptable.myfield mytype character set latin1; Query OK, 669 rows affected, 127 warnings (0.05 sec) Records: 669 Duplicates: 0 Warnings: 0 mysql> alter table temptable modify temptable.myfield blob; Query OK, 669 rows affected (0.17 sec) Records: 669 Duplicates: 0 Warnings: 0 mysql> alter table temptable modify temptable.myfield mytype character set utf8; Query OK, 669 rows affected, 552 warnings (0.05 sec) Records: 669 Duplicates: 0 Warnings: 0
Let me explain what the above does:
Ok, so about those rows that already had properly-encoded UTF-8 characters in them: If MySQL encounters a multi-byte character in UTF-8 encoding, it essentially has two options when converting this character to Latin-1:
There's also another case I haven't discussed yet which you might encounter here: If the series of bytes that should be the UTF-8 field contains any sequence of bytes that does not actually represent UTF-8 characters (ie. corrupt characters or starting data that was never Latin-1 or UTF-8 at all to begin with), then my experimentation seems to show that MySQL will truncate these fields at the first such sequence of bytes. This appears to be how it will also handle queries on tables with UTF-8 collation where the series of bytes actually stored in the fields do not describe a proper UTF-8 string. Anyway, the end result of this “corrupt starting data” is that in the end, MySQL will truncate it at the first erroneous byte sequence whenever it's accessed.
And this is another key behavioral feature of MySQL's character handling algorithms we can exploit: At the end of step 4C above, any fields which had already contained only properly-encoded UTF-8 characters in them, or which contained an erroneous sequence of bytes will be converted to strings which contain only single-byte characters.
Since these are exactly the fields we do not want to do any kind of conversion on, it's easy enough now to exclude them from our list of rows in the temporary table to be fixed in the “real” table:
mysql> delete from temptable where LENGTH(myfield) = CHAR_LENGTH(myfield); Query OK, 549 rows affected (0.01 sec)
Since our temporary table should now include only those rows which had double-encoded UTF-8 characters in them, and since those rows have been “fixed” as it were, all we need to do now is to update said rows in the original table. Note that it would probably be a good idea to take a look at the “fixed” rows with a few select statements before you do this:
mysql> replace into mytable (select * from temptable); Query OK, 120 rows affected (0.05 sec) Records: 120 Duplicates: 0 Warnings: 0
At this point, the field we just updated should contain only valid, properly-encoded UTF-8 characters in it (unless, of course, you had triple- and quadruple-encoded characters). Especially before you run a script to do steps 4B through 4E above for all your affected columns in all your affected tables, now would be a very good time to check how those affected characters from those fields look in your application, in selects on the mysql command line client, etc. to verify the fixes we did above actually did fix things.
It's entirely possible at this point that there still might be some character set issues which have defied our above attempts at using MySQL's character encoding conversion algorithms to fix. There are any number of ways this may have happened, especially if you have a rather old data set that has been used with several different application front-ends or has been moved across several different platforms or been touched by a lot of different developers who may have taken different approaches to dealing with the problem of MySQL Character Set Hell. The hope here is that by following the above steps we should have reduced the number of rows which need fixing to only a handful, where manually fixing them is no longer a losing proposition.
If you find yourself at this stage, I recommend reading through Derek Sivers' article on doing this as one way to fix these lingering problems: http://www.oreillynet.com/lpt/wlg/9022
Let's see... I said at some point above that if you're new to the UTF-8 world, there are other issues about which you need to be aware when using this character set encoding scheme. Here are a few major ones:
...and this list also goes on. Despite this, the more universal adoption of UTF-8 as the “standard” character set for nearly all human alphabets is probably a good thing. (And yes, your application is going to have to deal with these 21st century problems sometime– It's better to deal with these sooner rather than later, eh.)
The one thing you'll probably see in all the documentation you'll find online about dealing with MySQL character set encoding issues is that it's better to start with everything set up right from day one, and not have to go through a painful conversion. As they say, hindsight is better than foresight, and that definitely applies to the problems associated with MySQL character set encoding.
If the steps I've done my best to explain above still seem too difficult to follow, then the only thing I can say about this is that this is a situation where hiring an expert DBA, or a well-trained, experienced support team is money well spent. This is a deceptively difficult position to be in, and it's a good idea to have people who know what they're doing to help you get out of it.
In any case, if you've read this far, I thank you and wish you the best of luck in climbing out of MySQL Character Set Hell.
Have you read the news?
Rumor has it this cloud thing is taking off.