utf8 data on latin1 tables: converting to utf8 without downtime or double encoding

Here’s a problem some or most of us have encountered. You have a latin1 table defined like below, and your application is storing utf8 data to the column on a latin1 connection. Obviously, double encoding occurs. Now your development team decided to use utf8 everywhere, but during the process you can only have as little to no downtime while keeping your stored data valid.

CREATE TABLE `t` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `c` text,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
master> SET NAMES latin1;
master> INSERT INTO t (c) VALUES ('¡Celebración!');
master> SELECT id, c, HEX(c) FROM t;
+----+-----------------+--------------------------------+
| id | c               | HEX(c)                         |
+----+-----------------+--------------------------------+
|  3 | ¡Celebración!   | C2A143656C656272616369C3B36E21 |
+----+-----------------+--------------------------------+
1 row in set (0.00 sec)
master> SET NAMES utf8;
master> SELECT id, c, HEX(c) FROM t;
+----+---------------------+--------------------------------+
| id | c                   | HEX(c)                         |
+----+---------------------+--------------------------------+
|  3 | Â¡CelebraciÃ³n!     | C2A143656C656272616369C3B36E21 |
+----+---------------------+--------------------------------+
1 row in set (0.00 sec)

One approach here is as described to the manual is to convert the TEXT column into BLOB, then convert the table character set to utf8 and the c column back to TEXT, like this:

master> ALTER TABLE t CHANGE c c BLOB;
master> ALTER TABLE t CONVERT TO CHARACTER SET utf8, CHANGE c c TEXT;
master> SET NAMES utf8;
master> SELECT id, c, HEX(c) FROM t;
+----+-----------------+--------------------------------+
| id | c               | HEX(c)                         |
+----+-----------------+--------------------------------+
|  3 | ¡Celebración!   | C2A143656C656272616369C3B36E21 |
+----+-----------------+--------------------------------+
1 row in set (0.00 sec)

All good so far, but, if the tables are too big or big enough to disrupt your application significantly without downtime, this becomes a problem. The old little trick of using slaves now comes into play. In a nutshell, you can convert the TEXT column first on a slave into BLOB, then switch your application to use this slave as its PRIMARY. Any utf8 data written via replication or from the application should be stored and retrieved without issues either via latin1 connection character set or otherwise. This is because the BINARY data type does not really have character sets. Let me show you:

slave> SET NAMES latin1;
slave> INSERT INTO t (c) VALUES ('¡Celebración!');
slave> SELECT id, c, HEX(c) FROM t;
+----+-----------------+--------------------------------+
| id | c               | HEX(c)                         |
+----+-----------------+--------------------------------+
|  3 | ¡Celebración!   | C2A143656C656272616369C3B36E21 |
+----+-----------------+--------------------------------+
1 row in set (0.00 sec)
slave> ALTER TABLE t CHANGE c c BLOB;
slave> SET NAMES latin1;
slave> INSERT INTO t (c) VALUES ('¡Celebración!');
slave> SELECT id, c, HEX(c) FROM t;
+----+-----------------+--------------------------------+
| id | c               | HEX(c)                         |
+----+-----------------+--------------------------------+
|  3 | ¡Celebración!   | C2A143656C656272616369C3B36E21 |
|  4 | ¡Celebración!   | C2A143656C656272616369C3B36E21 |
+----+-----------------+--------------------------------+
2 rows in set (0.00 sec)
slave> SET NAMES utf8;
slave> INSERT INTO t (c) VALUES ('¡Celebración!');
slave> SELECT id, c, HEX(c) FROM t;
+----+-----------------+--------------------------------+
| id | c               | HEX(c)                         |
+----+-----------------+--------------------------------+
|  3 | ¡Celebración!   | C2A143656C656272616369C3B36E21 |
|  4 | ¡Celebración!   | C2A143656C656272616369C3B36E21 |
|  5 | ¡Celebración!   | C2A143656C656272616369C3B36E21 |
+----+-----------------+--------------------------------+
3 rows in set (0.00 sec)

As you can see, while the column is still in BLOB, I have no problems reading or storing utf8 data into it. Now, after your application has been configured to use this slave and use utf8 connection, you can now convert the column and the table back to TEXT and utf8 character set.

slave> ALTER TABLE t CONVERT TO CHARACTER SET utf8, CHANGE c c TEXT;
slave> SET NAMES utf8;
slave> SELECT id, c, HEX(c) FROM t;
+----+-----------------+--------------------------------+
| id | c               | HEX(c)                         |
+----+-----------------+--------------------------------+
|  3 | ¡Celebración!   | C2A143656C656272616369C3B36E21 |
|  4 | ¡Celebración!   | C2A143656C656272616369C3B36E21 |
|  5 | ¡Celebración!   | C2A143656C656272616369C3B36E21 |
+----+-----------------+--------------------------------+
3 rows in set (0.00 sec)

Some caveats though, you cannot replicate from BLOB or utf8 back to the latin1 column, so you will have to discard the data from the original master. Doing so will just result in double encoding. Second, while the column is in BLOB or any other BINARY type and this column is indexed, you may experience different results when the index is used. This is because BINARY data is indexed based on their numeric values per bytes not per character strings. Here is an example:

master> SHOW CREATE TABLE t \G
*************************** 1. row ***************************
       Table: t
Create Table: CREATE TABLE `t` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `c` blob,
  PRIMARY KEY (`id`),
  KEY `c` (`c`(255))
) ENGINE=InnoDB AUTO_INCREMENT=6 DEFAULT CHARSET=latin1
1 row in set (0.00 sec)
master> SET NAMES latin1;
master> INSERT INTO t (c) VALUES ('¡Celebración!'), ('Férrêts being fërøcîóúß'), ('Voyage à Montreal');
master> SELECT c FROM t ORDER BY c;
+---------------------------------+
| c                               |
+---------------------------------+
| ¡Celebración!                   |
| Férrêts being fërøcîóúß         |
| Voyage à Montreal               |
+---------------------------------+
3 rows in set (0.00 sec)
master> ALTER TABLE t CHANGE c c BLOB;
master> SELECT c FROM t ORDER BY c;
+---------------------------------+
| c                               |
+---------------------------------+
| Férrêts being fërøcîóúß         |
| Voyage à Montreal               |
| ¡Celebración!                   |
+---------------------------------+
3 rows in set (0.00 sec)

See how the results are now ordered differently?

What’s your utf8 horror? Share with us on the comments below Image may be NSFW.
Clik here to view. :-)

UPDATE: This was how the process looks like without downtime or extended table being blocked, but there are other ways. One of them is creating a copy of the original table converted to utf8 and doing an

INSERT INTO .. SELECT

using the

CAST

CONVERT

functions like below.

master> SHOW CREATE TABLE t \G
*************************** 1. row ***************************
       Table: t
Create Table: CREATE TABLE `t` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `c` mediumtext,
  PRIMARY KEY (`id`),
  KEY `c` (`c`(255))
) ENGINE=InnoDB DEFAULT CHARSET=latin1
1 row in set (0.00 sec)
master> SHOW CREATE TABLE x \G
*************************** 1. row ***************************
       Table: x
Create Table: CREATE TABLE `x` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `c` mediumtext,
  PRIMARY KEY (`id`),
  KEY `c` (`c`(255))
) ENGINE=InnoDB DEFAULT CHARSET=utf8
1 row in set (0.00 sec)
master> SET NAMES latin1;
master> SELECT * FROM t;
+----+-----------------+
| id | c               |
+----+-----------------+
|  1 | ¡Celebración!   |
|  2 | a               |
|  3 | A               |
|  4 | 東京            |
+----+-----------------+
4 rows in set (0.00 sec)
master> INSERT INTO x SELECT id, CONVERT(c USING BINARY) FROM t;
master> SELECT * FROM x;
+----+---------------+
| id | c             |
+----+---------------+
|  1 | ebraci  |
|  2 | a             |
|  3 | A             |
|  4 | ??            |
+----+---------------+
4 rows in set (0.00 sec)
master> SET NAMES utf8;
master> SELECT * FROM x;
+----+-----------------+
| id | c               |
+----+-----------------+
|  1 | ¡Celebración!   |
|  2 | a               |
|  3 | A               |
|  4 | 東京            |
+----+-----------------+
4 rows in set (0.00 sec)

Another method is to copy the FRM file of the same table structure but in

utf8

and replace your original table’s FRM file. Since the data is already stored as utf8, you should be able to read them on utf8 connection. However, you will have to rebuild you indexes based on affected columns as they are sorted as latin1 originally. In my tests though, there was no difference before and after rebuilding the index, so, YMMV. To demonstrate, still the same 2 previous tables – on the filesystem, I replaced

t.frm

with a copy of

x.frm

then did a

FLUSH TABLES

, afterwards, t looked like this:

master> SHOW CREATE TABLE t \G
*************************** 1. row ***************************
       Table: t
Create Table: CREATE TABLE `t` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `c` mediumtext,
  PRIMARY KEY (`id`),
  KEY `c` (`c`(255))
) ENGINE=InnoDB AUTO_INCREMENT=5 DEFAULT CHARSET=utf8
1 row in set (0.00 sec)

Now, attempting to read the data on latin1 connection causes truncation:

master> SET NAMES latin1;
master> SELECT * FROM t ORDER BY c;
+----+---------------+
| id | c             |
+----+---------------+
|  3 | a             |
|  4 | A             |
|  2 | ebraci  |
|  1 | ??            |
+----+---------------+
4 rows in set (0.00 sec)

But on utf8, I am now able to read it fine:

master> SET NAMES utf8;
master> SELECT * FROM t ORDER BY c;
+----+-----------------+
| id | c               |
+----+-----------------+
|  3 | a               |
|  4 | A               |
|  2 | ¡Celebración!   |
|  1 | 東京            |
+----+-----------------+
4 rows in set (0.00 sec)

Rebuilding the secondary key on

column has no difference on the results too.

master> ALTER TABLE t DROP KEY c, ADD KEY (c(255));
master> SELECT * FROM t ORDER BY c;
+----+-----------------+
| id | c               |
+----+-----------------+
|  3 | a               |
|  4 | A               |
|  2 | ¡Celebración!   |
|  1 | 東京            |
+----+-----------------+
4 rows in set (0.00 sec)

UPDATE: Apparently, the last method will not work for InnoDB tables because the character collation is stored in the data dictionary too as my colleague Alexander Rubin pointed out. But not all how is lost, you can still rebuild the table with pt-online-schema-change without blocking it.

The post utf8 data on latin1 tables: converting to utf8 without downtime or double encoding appeared first on MySQL Performance Blog.

utf8 data on latin1 tables: converting to utf8 without downtime or double encoding

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112