Micro-ISV.asia

Wednesday, 25 March 2009

Make Sure Your Web Site Is Always Displayed With The Right Characters

Filed under: Just Great Software,Cyberspace — Jan Goyvaerts @ 12:07

A common problem with web pages that use characters beyond the basic Latin letters A to Z is that those characters aren’t always displayed correctly. They’re substituted with characters from other writing systems.

Suppose a Thai webmaster creates this web page. On his PC, which is set for Thai, the page will display the Thai greeting สวัสดีครับ just fine. But on my PC, which is set for Western European languages, the page will display ÊÇÑÊ´Õ¤ÃѺ which does not make sense. The reason that this happens is that this web page fails to specify the character set that it was written with. The browser has no way of guessing which character set to use. So the page is displayed with the default character set, and will only be displayed as intended if the default character set of the visitor’s browser is the same as that of the webmaster’s browser.

The solution is to use a meta tag to specify the character set. This web page correctly specifies the TIS-620 character set. It will display correctly on any computer that has a Thai font installed, regardless of the computer’s or the browser’s default language settings. If the computer does not have any Thai fonts installed, squares will appear instead. On Windows, you can install Thai fonts by going into the Control Panel, opening Regional and Language Options, clicking on the Languages tab, ticking the checkbox “Install files for complex script and right-to-left languages (including Thai)”, and clicking OK.

Our text editor EditPad detects the charset meta tag just like a web browser does. (What follows applies to EditPad Lite and Pro 6.4.5.) If you open both files in EditPad with the Western European code page Windows 1252 as the default, the page without the meta tag shows as:

The page with the meta tag shows as:

The status bar indicator in the bottom right corner shows that EditPad is using the Windows 874 code page, which is Microsoft’s extension of TIS-620. You can enable that status bar indicator via Options, Preferences, Statusbar in EditPad Pro (but not Lite).

To make the HTML file without the meta tag display correctly in EditPad, you can use the Text Encoding item in the Convert menu. Select the “interpret” option, and Windows 874 as the new encoding. The “interpret” option tells EditPad that the file’s contents are correct (and should remain unchanged), but that it isn’t being displayed with the right character set. This is the same as telling your web browser to display the file with a different encoding.

Changing the encoding this way is only a temporary solution. If you close and reopen the file, it will be shown with the default encoding again. While you could change the default in Options, Configure File Types, Encoding, your file still won’t be displayed correctly on other people’s computers, and other files on your computer may now be displayed incorrectly.

The solution is to add the proper meta tag. In EditPad (Lite and Pro), you can paste it right in:

If you then try to save the file, EditPad will give you a warning:

The warning may seem a bit scary because it’s so verbose, but it’s really quite simple. We pasted a meta tag that says TIS-620 into a file that EditPad is showing with the Windows 1252 code page. EditPad knows that when you open the file again in EditPad or in a browser, it will be displayed with TIS-620 (or Windows 874) instead of Windows 1252. EditPad asks you to choose which is correct: the encoding specified by the meta tag, or the encoding EditPad is displaying the file with.

In this case, we pasted in the correct meta tag into a file that EditPad is displaying with the wrong encoding. The proper response to the warning is to click the Keep Meta Tag button. EditPad then saves the file, and automatically updates the display to use Windows 874 as the meta tag specifies.

If you want to avoid the warning, you’d have to use Convert, Text Encoding first to interpret the file as Windows 874, then paste in the meta tag, and then save.

Now for a different scenario. Say you have created a proper English web page using the ISO-8859-1 code page, better known as Latin 1:

You send this file to your Thai translator, who uses Notepad to come up with this:

Your translator saves the file using the ANSI setting in Notepad, which means Windows 874 on a Thai computer. Notepad doesn’t care about the incorrect meta tag. It only supports UTF-8, UTF-16, and “ANSI”, which means whichever Windows code page that happens to be your computer’s default. But when you open your translator’s file in your browser or in EditPad, you get:

That’s obviously not right. While you could fix this by pasting in the right meta tag like we did earlier, let’s assume you don’t know which meta tag to use for Thai. To get the file to display correctly in EditPad, you use the Text Encoding item in the Convert menu to interpret the file with Windows 874 for Thai. That’s easy, because the Text Encoding dialog box indicates the languages that each encoding is used for, and it shows a preview of what the file will look like. Now the text is readable:

But the meta tag is still wrong. When you save, you get the same warning as you did before:

This time, EditPad is showing the file with the Windows 874 code page, but EditPad knows that when you open the file again in EditPad or in a browser, it will be displayed with the ISO-8859-1 code page specified by the meta tag. Again, EditPad asks you to choose which is correct: the encoding specified by the meta tag, or the encoding EditPad is displaying the file with.

This time, the meta tag is wrong. Thus, you need to click the Change Meta Tag button. This tells EditPad to keep the file’s display as it is, but to change the charset=ISO-8859-1 bit in the HTML file into charset=tis-620.

EditPad knows the code page names of all the encodings that is supports. If you don’t know the proper meta tag, EditPad will correct it for you if you specify a valid but incorrect meta tag. E.g. if your file specifies charset=utf-8 but you’re not using UTF-8, EditPad will offer to change. If your file specifies charset=doesnotexist, EditPad ignores the meta tag that it doesn’t understand.

3 Comments

  1. Your example looks messed up — the page http://www.micro-isv.asia/download/sawatdee.html actually does include the charset so it looks okay when it shouldn’t

    Comment by Tom Crosley — Sunday, 5 July 2009 @ 13:58

  2. I’ve corrected the examples so one doesn’t have the charset while the other one does, as the text of the article indicates.

    Comment by Jan Goyvaerts — Friday, 7 August 2009 @ 16:15

  3. I think it is a nice that the bummer dialog is changed.
    Most times when you write with foreign characters, those characters aren’t supported by you keyboard either, and you will have to use &#xx; for them.
    Then the characters in the source code is supported by all encodings since it contains only standard characters.
    In this case it doesn’t matter which encoding the document is saved in.
    And then it is nice that the “keep meta” option enables saving of those kind of documents where it is the meta encoding that is important.

    I think that, if possible, a “convert none-standard characters to &#xx;” option would be nice, thus removing the problem of which encoding the document is saved in.

    It is highly recommended (by me too) to always write none-standard characters using &#xx;

    Comment by Lars Andersen — Friday, 6 November 2009 @ 18:25

Sorry, the comment form is closed at this time.