Micro-ISV.asia

Thursday, 26 March 2009

Why Not Use UTF-8 for Everything?

Filed under: Cyberspace — Jan Goyvaerts @ 12:07

So why all this hassle of dealing with code pages instead of just using UTF-8, which as a Unicode transformation supports all of the world’s modern languages (and some not so modern ones)?

Unicode is necessary when using multple scripts in one file, and two or more of the scripts use different code pages. You can mix Thai and English, because TIS-620 also includes the ASCII symbols. But you cannot mix Thai and Greek without using Unicode, because Thai and Greek require different code pages.

But Unicode wastes bandwidth when using only a single language. In TIS-620, the single byte code page for Thai, all characters takes up one byte. But in UTF-8, Thai characters take up three bytes each. Many people think UTF-8 is efficient because ASCII characters take up only one byte. In reality, UTF-8 is only efficient if most of your file consists of ASCII. If half your file consists of HTML tags (in ASCII), and the other half is Thai body text, then saving the file in UTF-8 makes it take up twice as much space than TIS-620 would. For HTML files, which provide a reliable way to specify the encoding via the meta tag, this is simply a waste of bandwidth.

It gets worse as the non-ASCII content of your file increases. If your file is half Thai (3 bytes per character in UTF-8) and half Chinese (3 or 4 bytes per character in UTF-8), even UTF-16 is a better choice than UTF-8. While UTF-16 is often seen as inefficient because it uses 2 bytes for ASCII characters, it also uses only 2 bytes for Thai and 2 bytes for most Chinese characters (but 4 bytes for supplementary characters, for which UTF-8 also uses 4 bytes).

Unicode certainly offers a lot of flexibility and takes away a lot of problems. But Unicode isn’t free. In situations where bandwidth is at a premium, and only one script (plus ASCII) is used, using a legacy character set can make a lot of sense.

5 Comments

  1. Nice article. I have always wondered why UTF-8 wasn’t just used for everything. Now I know! Thanks for the info.

    Comment by Lance — Thursday, 2 April 2009 @ 6:49

  2. One size will never fit all. If it did, UTF-8 would have never been defined in the first place. UTF-8 exists because some people didn’t want to use two bytes for ASCII characters. (Unicode was initially conceived as a 16-bit character set.)

    Comment by Jan Goyvaerts — Sunday, 12 April 2009 @ 13:57

  3. TIS-620 is archaic and is ready to be retired. UTF-8 is the way forward. I don’t believe it is worth being concerned about the compactness of Unicode vs single byte codings; single byte codings bring too much confusion. Most webservers support on the fly compression and people “zip” large documents.

    We have a nation of Thai people who still can’t use Thai language properly on computer systems, MP3 players, DVDs, etc because those products aren’t made for the Thai market and therefore don’t understand TIS-620. Besides, how do you know the coding is TIS-620 ? With Unicode, there is no question about the language.

    Look at CJK languages and you’ll see the another problem: many types of coding and no way to know which one to use. UTF-8 will solve this problem. Admittedly, the Unicode consortium messed up thinking that Chinese characters/Kanji/Hanja can be unified. Aside from that minor mistake UTF-8 is best at the moment.

    Comment by koan — Tuesday, 21 April 2009 @ 14:30

  4. So the bottom line is – if in doubt, use UTF-8. The arguments make sense but talking about bandwidth seems superficial for most cases, and it may become less relevant in the future as new hardware and standards such as fibere-to-the-node appear.

    When we talk about UTF-8 causing compatibility issues, we are probably talking about legacy systems or poorly developed software, and I don’t think adoption of UTF-8 should put too much emphasis on that the argument from compatibility.

    The WWW is an international space, and I think that it makes very little sense nowadays to move to a standard most people can agree on, and that would seem to be UTF-8.

    Comment by jack — Wednesday, 1 August 2012 @ 17:56

  5. “I hope UTF-8 could evolve in future so that everybody could use it”

    UTF-8 is as evolved as it’s ever going to be and everyone CAN use it.

    “Unicode was initially conceived as a 16-bit character set”

    But that is completely unrelated to character ENCODING.

    “TIS-620 is archaic”

    That is a bullshit appeal to emotion and not at all accurate.

    Comment by Craig — Friday, 1 February 2013 @ 9:05

Sorry, the comment form is closed at this time.