Micro-ISV.asia

Thursday, 26 March 2009

Why Not Use UTF-8 for Everything?

Filed under: Cyberspace — Jan Goyvaerts @ 12:07

So why all this hassle of dealing with code pages instead of just using UTF-8, which as a Unicode transformation supports all of the world’s modern languages (and some not so modern ones)?

Unicode is necessary when using multple scripts in one file, and two or more of the scripts use different code pages. You can mix Thai and English, because TIS-620 also includes the ASCII symbols. But you cannot mix Thai and Greek without using Unicode, because Thai and Greek require different code pages.

But Unicode wastes bandwidth when using only a single language. In TIS-620, the single byte code page for Thai, all characters takes up one byte. But in UTF-8, Thai characters take up three bytes each. Many people think UTF-8 is efficient because ASCII characters take up only one byte. In reality, UTF-8 is only efficient if most of your file consists of ASCII. If half your file consists of HTML tags (in ASCII), and the other half is Thai body text, then saving the file in UTF-8 makes it take up twice as much space than TIS-620 would. For HTML files, which provide a reliable way to specify the encoding via the meta tag, this is simply a waste of bandwidth.

It gets worse as the non-ASCII content of your file increases. If your file is half Thai (3 bytes per character in UTF-8) and half Chinese (3 or 4 bytes per character in UTF-8), even UTF-16 is a better choice than UTF-8. While UTF-16 is often seen as inefficient because it uses 2 bytes for ASCII characters, it also uses only 2 bytes for Thai and 2 bytes for most Chinese characters (but 4 bytes for supplementary characters, for which UTF-8 also uses 4 bytes).

Unicode certainly offers a lot of flexibility and takes away a lot of problems. But Unicode isn’t free. In situations where bandwidth is at a premium, and only one script (plus ASCII) is used, using a legacy character set can make a lot of sense.

4 Comments »

  1. Nice article. I have always wondered why UTF-8 wasn’t just used for everything. Now I know! Thanks for the info.

    Comment by Lance — Thursday, 2 April 2009 @ 6:49

  2. It would have been less confusing when only one type of code is used. I hope UTF-8 could evolve in future so that everybody could use it. Thank you for the useful information about Unicode.

    Comment by Rosie @ Car Parts — Saturday, 11 April 2009 @ 14:09

  3. One size will never fit all. If it did, UTF-8 would have never been defined in the first place. UTF-8 exists because some people didn’t want to use two bytes for ASCII characters. (Unicode was initially conceived as a 16-bit character set.)

    Comment by Jan Goyvaerts — Sunday, 12 April 2009 @ 13:57

  4. TIS-620 is archaic and is ready to be retired. UTF-8 is the way forward. I don’t believe it is worth being concerned about the compactness of Unicode vs single byte codings; single byte codings bring too much confusion. Most webservers support on the fly compression and people “zip” large documents.

    We have a nation of Thai people who still can’t use Thai language properly on computer systems, MP3 players, DVDs, etc because those products aren’t made for the Thai market and therefore don’t understand TIS-620. Besides, how do you know the coding is TIS-620 ? With Unicode, there is no question about the language.

    Look at CJK languages and you’ll see the another problem: many types of coding and no way to know which one to use. UTF-8 will solve this problem. Admittedly, the Unicode consortium messed up thinking that Chinese characters/Kanji/Hanja can be unified. Aside from that minor mistake UTF-8 is best at the moment.

    Comment by koan — Tuesday, 21 April 2009 @ 14:30

TrackBack URL

Leave a comment

Note: comments are moderated, so your comment will not appear instantly.