Monthly Archives: October 2012

Using Perl and Google Chromium’s CLD to identify the language of a text

For a new project I'm working on, given a body of text, I need to identify which language it's written in (English, Russian, Chinese, etc…).

I'm not exactly the first person on Earth to do this, so it turns out there's Google's CLD library. Surprisingly, several people around here didn't know it. The library is open source and very good too, so I immediately looked for Perl bindings for it.

There is a great Perl module on CPAN called Lingua::Identify::CLD. This module bundles a copy of the CLD library, and fully automates build and link steps too. So I gave it a shot.

How to use Lingua::Identify::CLD

It's amazingly easy to use. Here's a sample of the code:


#!/usr/bin/perl

use strict;
use Lingua::Identify::CLD ();

my $text;
while (readline) { $text .= $_ }
chomp $text;

# In my case, the content is HTML
my $cld = Lingua::Identify::CLD->new(isPlainText => 0);

# Example: (ENGLISH, en, 64)
my @lang = $cld->identify($text);
say "Language: $lang[0]";

Failing tests

I decided to start using this module into my project. The build phase went fine (perl ./Build), while the tests were failing (./Build test). Here's the log of a failed test run:


$ ./Build test
cc -I/usr/lib/perl/5.14/CORE -fPIC -c -D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fstack-protector -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -O2 -g -o /tmp/gAc_glZta2/library.o /tmp/gAc_glZta2/library.c
cc -I/usr/lib/perl/5.14/CORE -fPIC -c -D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fstack-protector -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -O2 -g -o /tmp/gAc_glZta2/test.o /tmp/gAc_glZta2/test.c
cc -shared -L/usr/local/lib -fstack-protector -o /tmp/gAc_glZta2/libfoo.so /tmp/gAc_glZta2/library.o
cc -fstack-protector -L/usr/local/lib -o /tmp/gAc_glZta2/foo /tmp/gAc_glZta2/test.o -L/tmp/gAc_glZta2 -lfoo

** Preparing XS code
t/00-load.t ....... 1/1 Bailout called.  Further testing stopped:  

#   Failed test 'use Lingua::Identify::CLD;'
#   at t/00-load.t line 6.
#     Tried to use 'Lingua::Identify::CLD'.
#     Error:  Not a CODE reference at /usr/lib/perl/5.14/DynaLoader.pm line 207.
# END failed--call queue aborted at .../Lingua-Identify-CLD-0.05/blib/lib/Lingua/Identify/CLD.pm line 207.
# BEGIN failed--compilation aborted at .../Lingua-Identify-CLD-0.05/blib/lib/Lingua/Identify/CLD.pm line 24.
# Compilation failed in require at (eval 4) line 2.
# BEGIN failed--compilation aborted at (eval 4) line 2.
Use of uninitialized value $Lingua::Identify::CLD::VERSION in concatenation (.) or string at t/00-load.t line 9.
# Testing Lingua::Identify::CLD , Perl 5.014002, /usr/bin/perl
# Looks like you failed 1 test of 1.
FAILED--Further testing stopped.

Just the day before I had successfully compiled and run the tests for the same version of the module, but on Ubuntu 11.10, which I was using. Then I decided to upgrade to 12.10, and that's where I got this failed test run.

Contacting the author

Then I decided to contact the author of the module. Being Alberto quite a known author, with lots of CPAN contributions, I hoped he would answer my query within 2-3 days. That would give me some time to do other stuff, and hopefully would give him time to analyze the failure.

As usual with the best CPAN authors ;-) he answered in a couple of hours, which was fantastic for me. He had already identified a few failures like mine thanks to another awesome resource we have in the Perl community, the CPAN Testers service.

CPAN Testers

CPAN testers is a group of users that regularly (or not) report back the build/test status of everything that's released to CPAN in a multitude of platforms and versions of Perl. I think this is one of the most underestimated awesome features we have in the Perl community. The CPAN testers status of Lingua::Identify::CLD shows one report that looks exactly the same as the failure I experienced. This is on Ubuntu 12.10 with the stock perl 5.14.2.

The ugly patch

I tried to analyze the problem, apparently located in DynaLoader, and came up with a shotgun-debugging-driven patch that I copy/paste here for reference:

@@ -18,10 +18,23 @@ Version 0.05

 our $VERSION = '0.05';
 
-use XSLoader;
-BEGIN {
+eval {
+
+    require XSLoader;
     XSLoader::load('Lingua::Identify::CLD', $VERSION);
-}
+
+} or do {
+
+    # This warning triggers on Ubuntu 12.10 with the
+    # stock perl 5.14.2. Strangely enough, this doesn't
+    # seem to affect the tests at all.
+    #
+    # Not a CODE reference at /usr/lib/perl/5.14/DynaLoader.pm line 207.
+    # END failed--call queue aborted at .../blib/lib/Lingua/Identify/CLD.pm line 207.
+    # ) at .../blib/lib/Lingua/Identify/CLD.pm line 28."
+    #
+    #warn "Something's wrong with XSLoader? ($@)";
+};
 
 =head1 SYNOPSIS

It's shotgun debugging because I don't really know what's going on, I just came up with this patch because of the assumptions and information I gathered during the years on how DynaLoader/XSLoader and BEGIN {} blocks work or interact with the rest of the code :-)

Anyway, it makes the tests pass again, even with a weird warning. I agree with Alberto that it's not wise to incorporate this patch into Lingua::Identify::CLD, until we have understood why the original code fails, and why just for 2 people in the world.

All this blah-blah, to say: please do help! If you have seen the same problem, help us figure out what it is. My repository with the forked/patched code is on Github:

https://github.com/cosimo/Lingua-Identify-CLD

Have fun!