Micro-ISV.asia

Thursday, 26 March 2009

Why Not Use UTF-8 for Everything?

Filed under: Cyberspace — Jan Goyvaerts @ 12:07

So why all this hassle of dealing with code pages instead of just using UTF-8, which as a Unicode transformation supports all of the world’s modern languages (and some not so modern ones)?

Unicode is necessary when using multple scripts in one file, and two or more of the scripts use different code pages. You can mix Thai and English, because TIS-620 also includes the ASCII symbols. But you cannot mix Thai and Greek without using Unicode, because Thai and Greek require different code pages.

But Unicode wastes bandwidth when using only a single language. In TIS-620, the single byte code page for Thai, all characters takes up one byte. But in UTF-8, Thai characters take up three bytes each. Many people think UTF-8 is efficient because ASCII characters take up only one byte. In reality, UTF-8 is only efficient if most of your file consists of ASCII. If half your file consists of HTML tags (in ASCII), and the other half is Thai body text, then saving the file in UTF-8 makes it take up twice as much space than TIS-620 would. For HTML files, which provide a reliable way to specify the encoding via the meta tag, this is simply a waste of bandwidth.

It gets worse as the non-ASCII content of your file increases. If your file is half Thai (3 bytes per character in UTF-8) and half Chinese (3 or 4 bytes per character in UTF-8), even UTF-16 is a better choice than UTF-8. While UTF-16 is often seen as inefficient because it uses 2 bytes for ASCII characters, it also uses only 2 bytes for Thai and 2 bytes for most Chinese characters (but 4 bytes for supplementary characters, for which UTF-8 also uses 4 bytes).

Unicode certainly offers a lot of flexibility and takes away a lot of problems. But Unicode isn’t free. In situations where bandwidth is at a premium, and only one script (plus ASCII) is used, using a legacy character set can make a lot of sense.

Wednesday, 25 March 2009

Make Sure Your Web Site Is Always Displayed With The Right Characters

Filed under: Cyberspace, Just Great Software — Jan Goyvaerts @ 12:07

A common problem with web pages that use characters beyond the basic Latin letters A to Z is that those characters aren’t always displayed correctly. They’re substituted with characters from other writing systems.

Suppose a Thai webmaster creates this web page. On his PC, which is set for Thai, the page will display the Thai greeting สวัสดีครับ just fine. But on my PC, which is set for Western European languages, the page will display ÊÇÑÊ´Õ¤ÃѺ which does not make sense. The reason that this happens is that this web page fails to specify the character set that it was written with. The browser has no way of guessing which character set to use. So the page is displayed with the default character set, and will only be displayed as intended if the default character set of the visitor’s browser is the same as that of the webmaster’s browser.

The solution is to use a meta tag to specify the character set. This web page correctly specifies the TIS-620 character set. It will display correctly on any computer that has a Thai font installed, regardless of the computer’s or the browser’s default language settings. If the computer does not have any Thai fonts installed, squares will appear instead. On Windows, you can install Thai fonts by going into the Control Panel, opening Regional and Language Options, clicking on the Languages tab, ticking the checkbox “Install files for complex script and right-to-left languages (including Thai)”, and clicking OK.

Our text editor EditPad detects the charset meta tag just like a web browser does. (What follows applies to EditPad Lite and Pro 6.4.5.) If you open both files in EditPad with the Western European code page Windows 1252 as the default, the page without the meta tag shows as:

The page with the meta tag shows as:

The status bar indicator in the bottom right corner shows that EditPad is using the Windows 874 code page, which is Microsoft’s extension of TIS-620. You can enable that status bar indicator via Options, Preferences, Statusbar in EditPad Pro (but not Lite).

To make the HTML file without the meta tag display correctly in EditPad, you can use the Text Encoding item in the Convert menu. Select the “interpret” option, and Windows 874 as the new encoding. The “interpret” option tells EditPad that the file’s contents are correct (and should remain unchanged), but that it isn’t being displayed with the right character set. This is the same as telling your web browser to display the file with a different encoding.

Changing the encoding this way is only a temporary solution. If you close and reopen the file, it will be shown with the default encoding again. While you could change the default in Options, Configure File Types, Encoding, your file still won’t be displayed correctly on other people’s computers, and other files on your computer may now be displayed incorrectly.

The solution is to add the proper meta tag. In EditPad (Lite and Pro), you can paste it right in:

If you then try to save the file, EditPad will give you a warning:

The warning may seem a bit scary because it’s so verbose, but it’s really quite simple. We pasted a meta tag that says TIS-620 into a file that EditPad is showing with the Windows 1252 code page. EditPad knows that when you open the file again in EditPad or in a browser, it will be displayed with TIS-620 (or Windows 874) instead of Windows 1252. EditPad asks you to choose which is correct: the encoding specified by the meta tag, or the encoding EditPad is displaying the file with.

In this case, we pasted in the correct meta tag into a file that EditPad is displaying with the wrong encoding. The proper response to the warning is to click the Keep Meta Tag button. EditPad then saves the file, and automatically updates the display to use Windows 874 as the meta tag specifies.

If you want to avoid the warning, you’d have to use Convert, Text Encoding first to interpret the file as Windows 874, then paste in the meta tag, and then save.

Now for a different scenario. Say you have created a proper English web page using the ISO-8859-1 code page, better known as Latin 1:

You send this file to your Thai translator, who uses Notepad to come up with this:

Your translator saves the file using the ANSI setting in Notepad, which means Windows 874 on a Thai computer. Notepad doesn’t care about the incorrect meta tag. It only supports UTF-8, UTF-16, and “ANSI”, which means whichever Windows code page that happens to be your computer’s default. But when you open your translator’s file in your browser or in EditPad, you get:

That’s obviously not right. While you could fix this by pasting in the right meta tag like we did earlier, let’s assume you don’t know which meta tag to use for Thai. To get the file to display correctly in EditPad, you use the Text Encoding item in the Convert menu to interpret the file with Windows 874 for Thai. That’s easy, because the Text Encoding dialog box indicates the languages that each encoding is used for, and it shows a preview of what the file will look like. Now the text is readable:

But the meta tag is still wrong. When you save, you get the same warning as you did before:

This time, EditPad is showing the file with the Windows 874 code page, but EditPad knows that when you open the file again in EditPad or in a browser, it will be displayed with the ISO-8859-1 code page specified by the meta tag. Again, EditPad asks you to choose which is correct: the encoding specified by the meta tag, or the encoding EditPad is displaying the file with.

This time, the meta tag is wrong. Thus, you need to click the Change Meta Tag button. This tells EditPad to keep the file’s display as it is, but to change the charset=ISO-8859-1 bit in the HTML file into charset=tis-620.

EditPad knows the code page names of all the encodings that is supports. If you don’t know the proper meta tag, EditPad will correct it for you if you specify a valid but incorrect meta tag. E.g. if your file specifies charset=utf-8 but you’re not using UTF-8, EditPad will offer to change. If your file specifies charset=doesnotexist, EditPad ignores the meta tag that it doesn’t understand.