You know everything about how MySQL treats UTF8 and LATIN1 charsets and how the collation table impacts on selection and insertion of data, right?
Great, then stop reading :)
I don't and since I'm in the process of setting up a new version of the Opera accounts database, I really don't want to screw up things. I tried to fully understand how MySQL works in this respect (charsets, collations, etc…) but reading documentation and memorizing it wasn't very easy. Plus, there's a thousands blog posts on the matter, not always 100% accurate.
So I thought I'd better get hands on and I wrote a kind of database test suite.
Now this test suite is hooked up to the main project builds on Jenkins. Here's a sample output:
[...]
[workspace] $ /bin/sh -xe /tmp/hudson3255767718598715423.sh
+ ./bin/run-dbtest-suite
basedir=/var/lib/jenkins/jobs/auth-db/workspace
/var/lib/jenkins/jobs/auth-db/workspace/t/database-tests/__initdb__.my ...........................................
1..2
ok 1 - Using utf8tests database
ok 2 - Server charset is latin1
ok
/var/lib/jenkins/jobs/auth-db/workspace/t/database-tests/collation-utf8_bin.my ...................................
1..6
ok 1 - All our records are there. No duplicate key error.
ok 2 - utf8_bin collation does not collate a/â/à/A/...
ok 3 - utf8_bin collation does not collate a/â/à/A/...
ok 4 - utf8_bin collation does not collate a/â/à/A/...
ok 5 - Query for mixed-case username does not return lowercase username
ok 6 - Query for upper-case username does not return lowercase username
ok
/var/lib/jenkins/jobs/auth-db/workspace/t/database-tests/collation-utf8_general_ci.my ............................
1..7
ok 1 - Collation for t007 is utf8_general_ci
ok 2 - utf8_general_ci collation normalizes accents, diacritics and the like
ok 3 - A and Å are collated to the same character in the utf8_general_ci table
ok 4 - å and Å are collated to the same character in the utf8_general_ci table
ok 5 - lower/upper case chars are collated in the utf8_general_ci table
ok 6 - lower/upper case chars are collated in the utf8_general_ci table
ok 7 - We are allowed to insert all records just because there is no unique constraint
ok
/var/lib/jenkins/jobs/auth-db/workspace/t/database-tests/collation-utf8_unicode_ci.my ............................
1..7
ok 1 - Collation for t005 is utf8_unicode_ci
ok 2 - utf8_unicode_ci collation normalizes accents, diacritics and the like
ok 3 - A and Å are collated to the same character in the utf8_unicode_ci table
ok 4 - å and Å are collated to the same character in the utf8_unicode_ci table
ok 5 - lower/upper case chars are collated in the utf8_unicode_ci table
ok 6 - lower/upper case chars are collated in the utf8_unicode_ci table
ok 7 - We are allowed to insert all records just because there is no unique constraint
ok
/var/lib/jenkins/jobs/auth-db/workspace/t/database-tests/default-table-charset.my ................................
1..3
ok 1 - Default character set is utf8 when no charset is specified (from server)
ok 2 - Default character set is utf8 when "CHARSET utf8" specified in the CREATE TABLE
ok 3 - Default character set is utf8 when "CHARSET utf8" and "COLLATE" specified in the CREATE TABLE
ok
...
/var/lib/jenkins/jobs/auth-db/workspace/t/database-tests/username-with-utf8-chars.my .............................
1..5
ok 1 - We have some UTF-8 encoded string in our hands (hex)
ok 2 - We have some UTF-8 encoded string in our hands (charset)
ok 3 - Can select back UTF-8 content from a CHARSET utf8 table
ok 4 - Given string is exactly 24 bytes long (length)
ok 5 - Given string is exactly 8 (wide) characters long (char_length)
ok
All tests successful.
Files=11, Tests=80, 0.739731 wallclock secs ( 0.05 usr 0.02 sys + 0.07 cusr 0.01 csys = 0.15 CPU)
Result: PASS
Recording test results
Finished: SUCCESS
And here's an example of "sanity check" test case, which doesn't do much:
1 -- Check that we can insert and retrieve UTF-8 content correctly
2
3 BEGIN;
4
5 SET NAMES utf8;
6
7 SELECT tap.plan(5);
8
9 USE auth_utf8tests;
10
11 SET @username = '今日话题今日话题';
12 SET @encoded = 'C1BB8AE697A5E8AF9DE9A298E4BB8AE697A5E8AF9DE9A298';
13
14 SELECT tap.eq(
15 HEX(@username),
16 @encoded,
17 'We have some UTF-8 encoded string in our hands (hex)'
18 );
19
20 SELECT tap.eq(
21 CHARSET(@username),
22 'utf8',
23 'We have some UTF-8 encoded string in our hands (charset)'
24 );
25
26 INSERT INTO t001 (f1) VALUES (@username);
27
28 SELECT tap.eq(
29 (SELECT HEX(f1) FROM t001 WHERE f1 = @username),
30 @encoded,
31 'Can select back UTF-8 content from a CHARSET utf8 table'
32 );
33
34 SELECT tap.eq(
35 (SELECT LENGTH(f1) FROM t001 WHERE f1 = @username),
36 24,
37 'Given string is exactly 24 bytes long (length)'
38 );
39
40 SELECT tap.eq(
41 (SELECT CHAR_LENGTH(f1) FROM t001 WHERE f1 = @username),
42 8,
43 'Given string is exactly 8 (wide) characters long (char_length)'
44 );
45
46 -- Finish the tests and clean up.
47 CALL tap.finish();
48 ROLLBACK;
This SQL test code uses mytap. You can see how the SELECT tap.*
calls are just the equivalents of the TAP testing framework of Perl. SELECT tap.eq()
is the equivalent of Test::More::is()
, and so on.
Another, more interesting test case, is the following:
1 --
2 -- Verify how the utf8_unicode_ci collation works
3 --
4
5 BEGIN;
6
7 SET NAMES utf8;
8
9 SELECT tap.plan(12);
10
11 USE auth_utf8tests;
12
[...]
40 SELECT tap.eq(
41 (SELECT TABLE_COLLATION FROM information_schema.TABLES WHERE TABLE_SCHEMA=SCHEMA() AND TABLE_NAME='t015'),
42 'utf8_unicode_ci',
43 'Collation for t015 is utf8_unicode_ci'
44 );
[...]
48
49 SELECT tap.eq(
50 (SELECT GROUP_CONCAT(id) FROM t015 WHERE username = 'testuser1a' ORDER BY id),
51 '10',
52 'utf8_unicode_ci collation normalizes accents, diacritics and the like'
53 );
54
55 SELECT tap.eq(
56 (SELECT GROUP_CONCAT(id) FROM t015 WHERE username = 'testuser1Å' ORDER BY id),
57 '10',
58 'A and Å are collated to the same character in the utf8_unicode_ci table'
59 );
60
61 SELECT tap.eq(
62 (SELECT GROUP_CONCAT(id) FROM t015 WHERE username = 'testuser1å' ORDER BY id),
63 '10',
64 'å and Å are collated to the same character in the utf8_unicode_ci table'
65 );
66
67 SELECT tap.eq(
68 (SELECT GROUP_CONCAT(id) FROM t015 WHERE username = 'TestUser1A' ORDER BY id),
69 '10',
70 'lower/upper case chars are collated in the utf8_unicode_ci table'
71 );
72
73 SELECT tap.eq(
74 (SELECT GROUP_CONCAT(id) FROM t015 WHERE username = 'TESTUSER1A' ORDER BY id),
75 '10',
76 'lower/upper case chars are collated in the utf8_unicode_ci table'
77 );
78
79 SELECT tap.eq(
80 (SELECT COUNT(*) FROM t015),
81 1,
82 'We are allowed to insert only 1 record, because the others collate to the same string'
83 );
84
85 -- Finish the tests and clean up.
86 CALL tap.finish();
87 ROLLBACK;
An interesting thing that I didn't know how to do in the beginning is how to trap errors. I left out that part from the test code to simplify, but here it is:
13 DELIMITER //
14
15 DROP PROCEDURE IF EXISTS populate_table //
16
17 CREATE PROCEDURE populate_table ()
18 BEGIN
19
20 DECLARE CONTINUE HANDLER FOR SQLSTATE '23000' BEGIN
21 SELECT tap.ok(
22 1,
23 'We should get dupkey errors when inserting data with collation utf8_unicode_ci'
24 );
25 END;
26
27 INSERT INTO t015 (id,username,note) VALUES (10, 'testuser1a', 'plain');
28 INSERT INTO t015 (id,username,note) VALUES (20, 'testuser1â', 'circumflex a');
29 INSERT INTO t015 (id,username,note) VALUES (30, 'testuser1à', 'a grave');
30 INSERT INTO t015 (id,username,note) VALUES (40, 'testuser1Å', 'A circ');
31 INSERT INTO t015 (id,username,note) VALUES (50, 'TestUser1A', 'mixed case');
32 INSERT INTO t015 (id,username,note) VALUES (60, 'TESTUSER1A', 'upper case');
33
34 END;
35
36 //
37
38 DELIMITER ;
39
46 /* Should generate 5 dupkey errors (taken as successful tests) */
47 CALL populate_table;
It's a bit convoluted. To trap errors you have use the DECLARE HANDLER
statement. DECLARE CONTINUE HANDLER FOR SQLSTATE '23000'
means that whenever SQLSTATE is '23000', and that corresponds to a duplicate key error, then execute this block of code. All of that must necessarily be wrapped into a stored procedure. Handlers outside of stored procedures are not allowed.
In this particular tests, the table uses the utf8_unicode_ci
collation table, so we are expecting a duplicate key error on username whenever we insert the string 'testuser1à'
or 'TESTUSER1A'
, because 'testuser1a'
was already inserted at the beginning. Of all the INSERT
statements, only the first one is bound to succeed, so I put a SELECT tap.ok(1)
for the duplicate key HANDLER and I expect 5 tests when I make the CALL populate_table;
.
This of course may seem trivial. And I guess it is, but for me it's a much better way of learning than scouring through the manuals or the many blog posts out there that may or may not reflect the environment I'm working with.
Routinely running this kind of test suite makes it possible and easy to verify the database behaviour:
- instantly
- after upgrades (5.1 -> 5.5? -> 6?) or storage engine changes
- after mysql configuration changes. For example, I discovered in this way that adding
default-charset=utf8
in my MySQL config breaks everything.
I consider this my live documentation on how MySQL works. I would really appreciate if you have any feedback on this. Have fun!