{"id":470,"date":"2010-09-24T09:57:11","date_gmt":"2010-09-24T08:57:11","guid":{"rendered":"http:\/\/www.streppone.it\/cosimo\/blog\/2010\/09\/survival-guide-to-utf-8\/"},"modified":"2010-09-24T09:57:11","modified_gmt":"2010-09-24T08:57:11","slug":"survival-guide-to-utf-8","status":"publish","type":"post","link":"https:\/\/www.streppone.it\/cosimo\/blog\/2010\/09\/survival-guide-to-utf-8\/","title":{"rendered":"Survival guide to UTF-8"},"content":{"rendered":"<p>Please, future me, and please, you cool programmer that one way or another, one day or the other, have struggled understanding UTF-8 (in Perl or not), <b>do yourself a really big favor<\/b> and read the following links:<\/p>\n<ul>\n<li>\n<p><a href=\"http:\/\/perlgeek.de\/en\/article\/encodings-and-unicode\" rel=\"nofollow\">Encodings and Unicode in Perl<\/a> (<a href=\"http:\/\/perlgeek.de\/en\/article\/encodings-and-unicode\" rel=\"nofollow\" target=\"_blank\">http:\/\/perlgeek.de\/en\/article\/encodings-and-unicode<\/a>)<\/p>\n<p><a href=\"http:\/\/www.ahinea.com\/en\/tech\/perl-unicode-struggle.html\" rel=\"nofollow\">Unicode-processing issues in Perl and how to cope with it<\/a> (<a href=\"http:\/\/www.ahinea.com\/en\/tech\/perl-unicode-struggle.html\" rel=\"nofollow\" target=\"_blank\">http:\/\/www.ahinea.com\/en\/tech\/perl-unicode-struggle.html<\/a>)<\/p>\n<p>and:<\/p>\n<li>\n<p><a href=\"http:\/\/www.webtuesday.ch\/_media\/meetings\/utf-8_survival.pdf\" rel=\"nofollow\">[Slides] UTF-8 survival guide<\/a> (<a href=\"http:\/\/www.webtuesday.ch\/_media\/meetings\/utf-8_survival.pdf\" rel=\"nofollow\">http:\/\/www.webtuesday.ch\/_media\/meetings\/utf-8_survival.pdf<\/a>),<\/p>\n<\/li>\n<p>and if you still have spare time to read:<\/p>\n<li>\n<p><a href=\"http:\/\/www.tbray.org\/ongoing\/When\/200x\/2003\/04\/26\/UTF\" rel=\"nofollow\">Characters vs. Bytes<\/a> (<a href=\"http:\/\/www.tbray.org\/ongoing\/When\/200x\/2003\/04\/26\/UTF\" rel=\"nofollow\" target=\"_blank\">http:\/\/www.tbray.org\/ongoing\/When\/200x\/2003\/04\/26\/UTF<\/a>).<\/p>\n<\/li>\n<\/li>\n<\/ul>\n<p>After reading these few articles, you will be a much better human being. I promise. In the meantime, Perl programmer, remember that:<\/p>\n<ul>\n<li><code>use utf8;<\/code> is only for the source code, <b>not for the encoding of your data<\/b>. Let&#39;s say you define a scalar variable like:\n<pre><code>\r\nmy $username = &#39;\u00e3\u0192\u008d\u00e3\u201a\u00aa&#39;;\r\n<\/code><\/pre>\n<p>Ok. Now, if you happen to have <code>use utf8<\/code> or not inside your script, there will be <b>no whatsoever difference in the actual content of that scalar variable<\/b>. Exactly, no difference. Except there&#39;s one difference. The variable itself (the <code>$username<\/code> box) will be flagged as containing UTF-8 characters (if you <code>use<\/code>d <code>utf8<\/code>, of course). Clear, right?<\/p>\n<\/li>\n<li>For the rest, open your filehandles declaring the encoding (<code>open my $fh, &#39;&lt;:utf8&#39;, $file;<\/code>), or explicitly use <code>Encode::(en|de)code_utf8($data)<\/code>.<\/li>\n<li>You can make sure the strings you define in your source code are UTF-8 encoded by opening and then writing to your source code file with <b>an editor that supports UTF-8 encoding<\/b>, for example <a href=\"http:\/\/www.vim.org\/\" rel=\"nofollow\">vim<\/a> has a <code>:set encoding=utf8<\/code> command.<\/li>\n<li>Also, make sure your terminal, if you&#39;re using one, <b>is set to UTF-8 encoding<\/b> otherwise you will see gibberish instead of your beloved Unicode characters. You can do that with any terminal on this planet, bar the windows cmd.exe shell&#8230; If anyone knows how to, please tell me.<\/li>\n<li>And finally, use a <b>font with Unicode characters in it<\/b>, like Bitstream Vera Sans Mono (the default Linux font), <a href=\"http:\/\/damieng.com\/blog\/2008\/05\/26\/envy-code-r-preview-7-coding-font-released\" rel=\"nofollow\">Envy R<\/a>, plain Courier, etc&#8230; or you will just see the broken-UTF8-character-of-doom. Yes, this one &#x2192; <strong>\u00ef\u00bf\u00bd<\/strong> :-) <\/li>\n<\/ul>\n<p>There&#39;s an additional problem, and that is when you need to feed some strings to a Digest module like <a href=\"http:\/\/search.cpan.org\/dist\/Digest-SHA1\/\" rel=\"nofollow\">Digest::SHA1<\/a>, to obtain back a hash. In that case, I presume the SHA1 algorithm, as MD5 and others, they don&#39;t really work on Unicode characters, or UTF8-encoded characters, they just work on bytes, or octets.<\/p>\n<p>So, if you try something like:<\/p>\n<pre><code>\r\nuse utf8;\r\nuse Digest::SHA1;\r\n\r\nmy $string = &quot;\u00e3\u0192\u00ad\u00e3\u201a\u00b0\u00e3\u201a\u00a4\u00e3\u0192\u00b3\u00e3\u0192\u00a1\u00e3\u0192\u00bc\u00e3\u0192\u00ab\u00e3\u201a\u00a2\u00e3\u0192\u2030\u00e3\u0192\u00ac\u00e3\u201a\u00b9&quot;;\r\nmy $sha1 = Digest::SHA1-&gt;new();\r\n$sha1-&gt;add($string);\r\n\r\nprint $sha1-&gt;hexdigest();\r\n<\/code><\/pre>\n<p>it will miserably fail (<code>Wide character in subroutine entry at line 6<\/code>) because <code>$string<\/code> is marked as containing &quot;wide&quot; characters, so it must be turned into octets, by doing:<\/p>\n<pre><code>\r\nuse utf8;\r\nuse Encode;\r\nuse Digest::SHA1;\r\n\r\nmy $string = &quot;\u00e3\u0192\u00ad\u00e3\u201a\u00b0\u00e3\u201a\u00a4\u00e3\u0192\u00b3\u00e3\u0192\u00a1\u00e3\u0192\u00bc\u00e3\u0192\u00ab\u00e3\u201a\u00a2\u00e3\u0192\u2030\u00e3\u0192\u00ac\u00e3\u201a\u00b9&quot;;\r\nmy $sha1 = Digest::SHA1-&gt;new();\r\n$sha1-&gt;add( Encode::encode_utf8($string) );\r\n\r\nprint $sha1-&gt;hexdigest();\r\n<\/code><\/pre>\n<p>I need to remind myself all the time that:<\/p>\n<ul>\n<li><code>Encode::encode_utf8($string)<\/code> wants a string with Unicode characters and will give you a string converted to UTF-8 octets, with the UTF8 flag *turned off*. Basically bytes. You can then do anything with them, print, put in a file, calculate a hash, etc&#8230;<\/li>\n<li><code>Encode::decode_utf8($octets)<\/code> wants a string of (possibly UTF-8) octets, and will give you a string of Unicode characters, with the UTF8 flag *turned on*, so for example trying to lowercase (lc) a &quot;\u00c3\u2026&quot; will result in a &quot;\u00c3\u00a5&quot; character.<\/li>\n<\/ul>\n<p>So, there you go! Now you are a <b>1st level UTF-8 wizard<\/b>. Go and do your UTF-8 magic!<\/p>\n<p>Epilogue: now I&#39;m sure: in a couple of weeks I will come back to this post, and think that I still don&#39;t understand how UTF-8 works in Perl&#8230; :-)<\/p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Please, future me, and please, you cool programmer that one way or another, one day or the other, have struggled understanding UTF-8 (in Perl or not), do yourself a really big favor and read the following links: Encodings and Unicode in Perl (http:\/\/perlgeek.de\/en\/article\/encodings-and-unicode) Unicode-processing issues in Perl and how to cope with it (http:\/\/www.ahinea.com\/en\/tech\/perl-unicode-struggle.html) and: [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[69,50,71,57,70],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v22.9 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Survival guide to UTF-8 - Random hacking<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.streppone.it\/cosimo\/blog\/2010\/09\/survival-guide-to-utf-8\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Survival guide to UTF-8 - Random hacking\" \/>\n<meta property=\"og:description\" content=\"Please, future me, and please, you cool programmer that one way or another, one day or the other, have struggled understanding UTF-8 (in Perl or not), do yourself a really big favor and read the following links: Encodings and Unicode in Perl (http:\/\/perlgeek.de\/en\/article\/encodings-and-unicode) Unicode-processing issues in Perl and how to cope with it (http:\/\/www.ahinea.com\/en\/tech\/perl-unicode-struggle.html) and: [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.streppone.it\/cosimo\/blog\/2010\/09\/survival-guide-to-utf-8\/\" \/>\n<meta property=\"og:site_name\" content=\"Random hacking\" \/>\n<meta property=\"article:published_time\" content=\"2010-09-24T08:57:11+00:00\" \/>\n<meta name=\"author\" content=\"cosimo\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"cosimo\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"3 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/www.streppone.it\/cosimo\/blog\/2010\/09\/survival-guide-to-utf-8\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/www.streppone.it\/cosimo\/blog\/2010\/09\/survival-guide-to-utf-8\/\"},\"author\":{\"name\":\"cosimo\",\"@id\":\"https:\/\/www.streppone.it\/cosimo\/blog\/#\/schema\/person\/c443bedbf6ecf99550d6395620801df1\"},\"headline\":\"Survival guide to UTF-8\",\"datePublished\":\"2010-09-24T08:57:11+00:00\",\"dateModified\":\"2010-09-24T08:57:11+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/www.streppone.it\/cosimo\/blog\/2010\/09\/survival-guide-to-utf-8\/\"},\"wordCount\":561,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/www.streppone.it\/cosimo\/blog\/#\/schema\/person\/c443bedbf6ecf99550d6395620801df1\"},\"keywords\":[\"encoding\",\"perl\",\"unicode\",\"utf8\",\"vim\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/www.streppone.it\/cosimo\/blog\/2010\/09\/survival-guide-to-utf-8\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.streppone.it\/cosimo\/blog\/2010\/09\/survival-guide-to-utf-8\/\",\"url\":\"https:\/\/www.streppone.it\/cosimo\/blog\/2010\/09\/survival-guide-to-utf-8\/\",\"name\":\"Survival guide to UTF-8 - Random hacking\",\"isPartOf\":{\"@id\":\"https:\/\/www.streppone.it\/cosimo\/blog\/#website\"},\"datePublished\":\"2010-09-24T08:57:11+00:00\",\"dateModified\":\"2010-09-24T08:57:11+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/www.streppone.it\/cosimo\/blog\/2010\/09\/survival-guide-to-utf-8\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.streppone.it\/cosimo\/blog\/2010\/09\/survival-guide-to-utf-8\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.streppone.it\/cosimo\/blog\/2010\/09\/survival-guide-to-utf-8\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.streppone.it\/cosimo\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Survival guide to UTF-8\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.streppone.it\/cosimo\/blog\/#website\",\"url\":\"https:\/\/www.streppone.it\/cosimo\/blog\/\",\"name\":\"Random hacking\",\"description\":\"Assume nothing. Code defensively. Keep it simple, stupid!\",\"publisher\":{\"@id\":\"https:\/\/www.streppone.it\/cosimo\/blog\/#\/schema\/person\/c443bedbf6ecf99550d6395620801df1\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.streppone.it\/cosimo\/blog\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":[\"Person\",\"Organization\"],\"@id\":\"https:\/\/www.streppone.it\/cosimo\/blog\/#\/schema\/person\/c443bedbf6ecf99550d6395620801df1\",\"name\":\"cosimo\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.streppone.it\/cosimo\/blog\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/cb1d938720df45a2720724aae99e3bfc?s=96&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/cb1d938720df45a2720724aae99e3bfc?s=96&r=g\",\"caption\":\"cosimo\"},\"logo\":{\"@id\":\"https:\/\/www.streppone.it\/cosimo\/blog\/#\/schema\/person\/image\/\"},\"url\":\"https:\/\/www.streppone.it\/cosimo\/blog\/author\/cosimo\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Survival guide to UTF-8 - Random hacking","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.streppone.it\/cosimo\/blog\/2010\/09\/survival-guide-to-utf-8\/","og_locale":"en_US","og_type":"article","og_title":"Survival guide to UTF-8 - Random hacking","og_description":"Please, future me, and please, you cool programmer that one way or another, one day or the other, have struggled understanding UTF-8 (in Perl or not), do yourself a really big favor and read the following links: Encodings and Unicode in Perl (http:\/\/perlgeek.de\/en\/article\/encodings-and-unicode) Unicode-processing issues in Perl and how to cope with it (http:\/\/www.ahinea.com\/en\/tech\/perl-unicode-struggle.html) and: [&hellip;]","og_url":"https:\/\/www.streppone.it\/cosimo\/blog\/2010\/09\/survival-guide-to-utf-8\/","og_site_name":"Random hacking","article_published_time":"2010-09-24T08:57:11+00:00","author":"cosimo","twitter_card":"summary_large_image","twitter_misc":{"Written by":"cosimo","Est. reading time":"3 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.streppone.it\/cosimo\/blog\/2010\/09\/survival-guide-to-utf-8\/#article","isPartOf":{"@id":"https:\/\/www.streppone.it\/cosimo\/blog\/2010\/09\/survival-guide-to-utf-8\/"},"author":{"name":"cosimo","@id":"https:\/\/www.streppone.it\/cosimo\/blog\/#\/schema\/person\/c443bedbf6ecf99550d6395620801df1"},"headline":"Survival guide to UTF-8","datePublished":"2010-09-24T08:57:11+00:00","dateModified":"2010-09-24T08:57:11+00:00","mainEntityOfPage":{"@id":"https:\/\/www.streppone.it\/cosimo\/blog\/2010\/09\/survival-guide-to-utf-8\/"},"wordCount":561,"commentCount":0,"publisher":{"@id":"https:\/\/www.streppone.it\/cosimo\/blog\/#\/schema\/person\/c443bedbf6ecf99550d6395620801df1"},"keywords":["encoding","perl","unicode","utf8","vim"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.streppone.it\/cosimo\/blog\/2010\/09\/survival-guide-to-utf-8\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.streppone.it\/cosimo\/blog\/2010\/09\/survival-guide-to-utf-8\/","url":"https:\/\/www.streppone.it\/cosimo\/blog\/2010\/09\/survival-guide-to-utf-8\/","name":"Survival guide to UTF-8 - Random hacking","isPartOf":{"@id":"https:\/\/www.streppone.it\/cosimo\/blog\/#website"},"datePublished":"2010-09-24T08:57:11+00:00","dateModified":"2010-09-24T08:57:11+00:00","breadcrumb":{"@id":"https:\/\/www.streppone.it\/cosimo\/blog\/2010\/09\/survival-guide-to-utf-8\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.streppone.it\/cosimo\/blog\/2010\/09\/survival-guide-to-utf-8\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/www.streppone.it\/cosimo\/blog\/2010\/09\/survival-guide-to-utf-8\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.streppone.it\/cosimo\/blog\/"},{"@type":"ListItem","position":2,"name":"Survival guide to UTF-8"}]},{"@type":"WebSite","@id":"https:\/\/www.streppone.it\/cosimo\/blog\/#website","url":"https:\/\/www.streppone.it\/cosimo\/blog\/","name":"Random hacking","description":"Assume nothing. Code defensively. Keep it simple, stupid!","publisher":{"@id":"https:\/\/www.streppone.it\/cosimo\/blog\/#\/schema\/person\/c443bedbf6ecf99550d6395620801df1"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.streppone.it\/cosimo\/blog\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-US"},{"@type":["Person","Organization"],"@id":"https:\/\/www.streppone.it\/cosimo\/blog\/#\/schema\/person\/c443bedbf6ecf99550d6395620801df1","name":"cosimo","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.streppone.it\/cosimo\/blog\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/cb1d938720df45a2720724aae99e3bfc?s=96&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/cb1d938720df45a2720724aae99e3bfc?s=96&r=g","caption":"cosimo"},"logo":{"@id":"https:\/\/www.streppone.it\/cosimo\/blog\/#\/schema\/person\/image\/"},"url":"https:\/\/www.streppone.it\/cosimo\/blog\/author\/cosimo\/"}]}},"_links":{"self":[{"href":"https:\/\/www.streppone.it\/cosimo\/blog\/wp-json\/wp\/v2\/posts\/470"}],"collection":[{"href":"https:\/\/www.streppone.it\/cosimo\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.streppone.it\/cosimo\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.streppone.it\/cosimo\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.streppone.it\/cosimo\/blog\/wp-json\/wp\/v2\/comments?post=470"}],"version-history":[{"count":0,"href":"https:\/\/www.streppone.it\/cosimo\/blog\/wp-json\/wp\/v2\/posts\/470\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.streppone.it\/cosimo\/blog\/wp-json\/wp\/v2\/media?parent=470"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.streppone.it\/cosimo\/blog\/wp-json\/wp\/v2\/categories?post=470"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.streppone.it\/cosimo\/blog\/wp-json\/wp\/v2\/tags?post=470"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}