Tag Archives: unicode

Survival guide to UTF-8

Please, future me, and please, you cool programmer that one way or another, one day or the other, have struggled understanding UTF-8 (in Perl or not), do yourself a really big favor and read the following links:

After reading these few articles, you will be a much better human being. I promise. In the meantime, Perl programmer, remember that:

  • use utf8; is only for the source code, not for the encoding of your data. Let's say you define a scalar variable like:
    
    my $username = 'ネオ';
    

    Ok. Now, if you happen to have use utf8 or not inside your script, there will be no whatsoever difference in the actual content of that scalar variable. Exactly, no difference. Except there's one difference. The variable itself (the $username box) will be flagged as containing UTF-8 characters (if you used utf8, of course). Clear, right?

  • For the rest, open your filehandles declaring the encoding (open my $fh, '<:utf8', $file;), or explicitly use Encode::(en|de)code_utf8($data).
  • You can make sure the strings you define in your source code are UTF-8 encoded by opening and then writing to your source code file with an editor that supports UTF-8 encoding, for example vim has a :set encoding=utf8 command.
  • Also, make sure your terminal, if you're using one, is set to UTF-8 encoding otherwise you will see gibberish instead of your beloved Unicode characters. You can do that with any terminal on this planet, bar the windows cmd.exe shell… If anyone knows how to, please tell me.
  • And finally, use a font with Unicode characters in it, like Bitstream Vera Sans Mono (the default Linux font), Envy R, plain Courier, etc… or you will just see the broken-UTF8-character-of-doom. Yes, this one → :-)

There's an additional problem, and that is when you need to feed some strings to a Digest module like Digest::SHA1, to obtain back a hash. In that case, I presume the SHA1 algorithm, as MD5 and others, they don't really work on Unicode characters, or UTF8-encoded characters, they just work on bytes, or octets.

So, if you try something like:


use utf8;
use Digest::SHA1;

my $string = "ログインメールアドレス";
my $sha1 = Digest::SHA1->new();
$sha1->add($string);

print $sha1->hexdigest();

it will miserably fail (Wide character in subroutine entry at line 6) because $string is marked as containing "wide" characters, so it must be turned into octets, by doing:


use utf8;
use Encode;
use Digest::SHA1;

my $string = "ログインメールアドレス";
my $sha1 = Digest::SHA1->new();
$sha1->add( Encode::encode_utf8($string) );

print $sha1->hexdigest();

I need to remind myself all the time that:

  • Encode::encode_utf8($string) wants a string with Unicode characters and will give you a string converted to UTF-8 octets, with the UTF8 flag *turned off*. Basically bytes. You can then do anything with them, print, put in a file, calculate a hash, etc…
  • Encode::decode_utf8($octets) wants a string of (possibly UTF-8) octets, and will give you a string of Unicode characters, with the UTF8 flag *turned on*, so for example trying to lowercase (lc) a "Å" will result in a "å" character.

So, there you go! Now you are a 1st level UTF-8 wizard. Go and do your UTF-8 magic!

Epilogue: now I'm sure: in a couple of weeks I will come back to this post, and think that I still don't understand how UTF-8 works in Perl… :-)