Monthly Archives: September 2010

Survival guide to UTF-8

Please, future me, and please, you cool programmer that one way or another, one day or the other, have struggled understanding UTF-8 (in Perl or not), do yourself a really big favor and read the following links:

Encodings and Unicode in Perl (http://perlgeek.de/en/article/encodings-and-unicode)

Unicode-processing issues in Perl and how to cope with it (http://www.ahinea.com/en/tech/perl-unicode-struggle.html)

and:
[Slides] UTF-8 survival guide (http://www.webtuesday.ch/_media/meetings/utf-8_survival.pdf),

and if you still have spare time to read:

Characters vs. Bytes (http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF).

After reading these few articles, you will be a much better human being. I promise. In the meantime, Perl programmer, remember that:

use utf8; is only for the source code, not for the encoding of your data. Let's say you define a scalar variable like:
```
my $username = 'ãƒã‚ª';
```
Ok. Now, if you happen to have use utf8 or not inside your script, there will be no whatsoever difference in the actual content of that scalar variable. Exactly, no difference. Except there's one difference. The variable itself (the $username box) will be flagged as containing UTF-8 characters (if you used utf8, of course). Clear, right?
For the rest, open your filehandles declaring the encoding (open my $fh, '<:utf8', $file;), or explicitly use Encode::(en|de)code_utf8($data).
You can make sure the strings you define in your source code are UTF-8 encoded by opening and then writing to your source code file with an editor that supports UTF-8 encoding, for example vim has a :set encoding=utf8 command.
Also, make sure your terminal, if you're using one, is set to UTF-8 encoding otherwise you will see gibberish instead of your beloved Unicode characters. You can do that with any terminal on this planet, bar the windows cmd.exe shell… If anyone knows how to, please tell me.
And finally, use a font with Unicode characters in it, like Bitstream Vera Sans Mono (the default Linux font), Envy R, plain Courier, etc… or you will just see the broken-UTF8-character-of-doom. Yes, this one → ï¿½ :-)

There's an additional problem, and that is when you need to feed some strings to a Digest module like Digest::SHA1, to obtain back a hash. In that case, I presume the SHA1 algorithm, as MD5 and others, they don't really work on Unicode characters, or UTF8-encoded characters, they just work on bytes, or octets.

So, if you try something like:


use utf8;
use Digest::SHA1;

my $string = "ãƒã‚°ã‚¤ãƒ³ãƒ¡ãƒ¼ãƒ«ã‚¢ãƒ‰ãƒ¬ã‚¹";
my $sha1 = Digest::SHA1->new();
$sha1->add($string);

print $sha1->hexdigest();

it will miserably fail (Wide character in subroutine entry at line 6) because $string is marked as containing "wide" characters, so it must be turned into octets, by doing:


use utf8;
use Encode;
use Digest::SHA1;

my $string = "ãƒã‚°ã‚¤ãƒ³ãƒ¡ãƒ¼ãƒ«ã‚¢ãƒ‰ãƒ¬ã‚¹";
my $sha1 = Digest::SHA1->new();
$sha1->add( Encode::encode_utf8($string) );

print $sha1->hexdigest();

I need to remind myself all the time that:

Encode::encode_utf8($string) wants a string with Unicode characters and will give you a string converted to UTF-8 octets, with the UTF8 flag *turned off*. Basically bytes. You can then do anything with them, print, put in a file, calculate a hash, etc…
Encode::decode_utf8($octets) wants a string of (possibly UTF-8) octets, and will give you a string of Unicode characters, with the UTF8 flag *turned on*, so for example trying to lowercase (lc) a "Ã…" will result in a "Ã¥" character.

So, there you go! Now you are a 1st level UTF-8 wizard. Go and do your UTF-8 magic!

Epilogue: now I'm sure: in a couple of weeks I will come back to this post, and think that I still don't understand how UTF-8 works in Perl… :-)

Another Ubiquity for Opera update, DuckDuckGo search

Leave a reply

My small Ubiquity for Opera experiment gets another quick update.

This time I added one of my favorite search engines, DuckDuckGo. Despite being a young project, I think it's really interesting, and its results are highly relevant and up-to-date. I like it! Plus, it's a Perl project.

So that's it, I just added the duckduckgo command.

This new version also fixes an annoying problem with a couple of Google-related search commands, that were showing just 1 result, instead of the default 20 search results. There's so much more that could be improved, but I rarely find the time to work on it…

As always, the updated code is available on the Ubiquity for Opera project page, where you will also find the minified version (~40 kb instead of ~70).

Enjoy!

Running Ubiquity on Google Chrome

Leave a reply

It's been a while since I started working on Ubiquity for Opera. It's my limited, but for me totally awesome, port of Mozilla's Ubiquity project, originally only for Firefox, to the Opera browser.

I had several people ask me through my blog or email to write a version for Google Chrome. And, by popular demand, here it is! To my surprise, it took much less than I had originally thought. I had a few small problems though, from the event handlers, different from Opera UserJS model, to style attributes for dynamically created elements, and other minor things as well.

It's still lacking Ajax/xmlHttpRequest support, but that shouldn't be a huge problem.

I uploaded it to userscripts.org too. You can see it here: http://userscripts.org/scripts/show/85550.

The code, as usual, is up on Github

http://github.com/cosimo/ubiquity-chrome/.

If you try it, I would be interested to know what you think. Have fun!

If Text::Hunspell never worked for you, now it’s time to try it again!

Leave a reply

If you don't know, Hunspell is the spell checker engine of OpenOffice.org, and it's also included in the Opera and Mozilla browsers.

We were trying to use it from Perl, using the old Text::Hunspell module, version 1.3, but we had problems with it. Big problems. Like segfaults and tests that wouldn't run.

A bright hacker from Italy :) was then called in to fix the problem, with the promise of a fantastic prize he hasn't seen yet… [ping?] :-)

During the process, I found out I know absolutely nothing about dictionary files and stuff, and my fixes were – I would say – definitely horrible.

But! There's a bright side, of course, and that is that the module works just fine now, at least on Debian/Ubuntu systems. Before using Text::Hunspell, you want to install the following packages:

hunspell
libhunspell-dev

The example in the POD documentation (and in the examples dir) uses the standard US english dictionary. If you don't have that, you will need to change the script slightly. But the code is tested and should work without a problem. If you try it out and you have feedback, by all means let me know. Thanks!

Source code available on GitHub at:
http://github.com/cosimo/perl5-text-hunspell/

The module, tagged as 2.00 because it's cool :), will be up on CPAN shortly at this address:
http://search.cpan.org/dist/Text-Hunspell/

Random hacking

Assume nothing. Code defensively. Keep it simple, stupid!

Monthly Archives: September 2010

Survival guide to UTF-8

Another Ubiquity for Opera update, DuckDuckGo search

Running Ubiquity on Google Chrome

If Text::Hunspell never worked for you, now it’s time to try it again!