Micro-ISV.asia

Thursday, 26 March 2009

Why Not Use UTF-8 for Everything?

Filed under: Cyberspace — Jan Goyvaerts @ 12:07

So why all this hassle of dealing with code pages instead of just using UTF-8, which as a Unicode transformation supports all of the world’s modern languages (and some not so modern ones)?

Unicode is necessary when using multple scripts in one file, and two or more of the scripts use different code pages. You can mix Thai and English, because TIS-620 also includes the ASCII symbols. But you cannot mix Thai and Greek without using Unicode, because Thai and Greek require different code pages.

But Unicode wastes bandwidth when using only a single language. In TIS-620, the single byte code page for Thai, all characters takes up one byte. But in UTF-8, Thai characters take up three bytes each. Many people think UTF-8 is efficient because ASCII characters take up only one byte. In reality, UTF-8 is only efficient if most of your file consists of ASCII. If half your file consists of HTML tags (in ASCII), and the other half is Thai body text, then saving the file in UTF-8 makes it take up twice as much space than TIS-620 would. For HTML files, which provide a reliable way to specify the encoding via the meta tag, this is simply a waste of bandwidth.

It gets worse as the non-ASCII content of your file increases. If your file is half Thai (3 bytes per character in UTF-8) and half Chinese (3 or 4 bytes per character in UTF-8), even UTF-16 is a better choice than UTF-8. While UTF-16 is often seen as inefficient because it uses 2 bytes for ASCII characters, it also uses only 2 bytes for Thai and 2 bytes for most Chinese characters (but 4 bytes for supplementary characters, for which UTF-8 also uses 4 bytes).

Unicode certainly offers a lot of flexibility and takes away a lot of problems. But Unicode isn’t free. In situations where bandwidth is at a premium, and only one script (plus ASCII) is used, using a legacy character set can make a lot of sense.

Wednesday, 25 March 2009

Make Sure Your Web Site Is Always Displayed With The Right Characters

Filed under: Cyberspace, Just Great Software — Jan Goyvaerts @ 12:07

A common problem with web pages that use characters beyond the basic Latin letters A to Z is that those characters aren’t always displayed correctly. They’re substituted with characters from other writing systems.

Suppose a Thai webmaster creates this web page. On his PC, which is set for Thai, the page will display the Thai greeting สวัสดีครับ just fine. But on my PC, which is set for Western European languages, the page will display ÊÇÑÊ´Õ¤ÃѺ which does not make sense. The reason that this happens is that this web page fails to specify the character set that it was written with. The browser has no way of guessing which character set to use. So the page is displayed with the default character set, and will only be displayed as intended if the default character set of the visitor’s browser is the same as that of the webmaster’s browser.

The solution is to use a meta tag to specify the character set. This web page correctly specifies the TIS-620 character set. It will display correctly on any computer that has a Thai font installed, regardless of the computer’s or the browser’s default language settings. If the computer does not have any Thai fonts installed, squares will appear instead. On Windows, you can install Thai fonts by going into the Control Panel, opening Regional and Language Options, clicking on the Languages tab, ticking the checkbox “Install files for complex script and right-to-left languages (including Thai)”, and clicking OK.

Our text editor EditPad detects the charset meta tag just like a web browser does. (What follows applies to EditPad Lite and Pro 6.4.5.) If you open both files in EditPad with the Western European code page Windows 1252 as the default, the page without the meta tag shows as:

The page with the meta tag shows as:

The status bar indicator in the bottom right corner shows that EditPad is using the Windows 874 code page, which is Microsoft’s extension of TIS-620. You can enable that status bar indicator via Options, Preferences, Statusbar in EditPad Pro (but not Lite).

To make the HTML file without the meta tag display correctly in EditPad, you can use the Text Encoding item in the Convert menu. Select the “interpret” option, and Windows 874 as the new encoding. The “interpret” option tells EditPad that the file’s contents are correct (and should remain unchanged), but that it isn’t being displayed with the right character set. This is the same as telling your web browser to display the file with a different encoding.

Changing the encoding this way is only a temporary solution. If you close and reopen the file, it will be shown with the default encoding again. While you could change the default in Options, Configure File Types, Encoding, your file still won’t be displayed correctly on other people’s computers, and other files on your computer may now be displayed incorrectly.

The solution is to add the proper meta tag. In EditPad (Lite and Pro), you can paste it right in:

If you then try to save the file, EditPad will give you a warning:

The warning may seem a bit scary because it’s so verbose, but it’s really quite simple. We pasted a meta tag that says TIS-620 into a file that EditPad is showing with the Windows 1252 code page. EditPad knows that when you open the file again in EditPad or in a browser, it will be displayed with TIS-620 (or Windows 874) instead of Windows 1252. EditPad asks you to choose which is correct: the encoding specified by the meta tag, or the encoding EditPad is displaying the file with.

In this case, we pasted in the correct meta tag into a file that EditPad is displaying with the wrong encoding. The proper response to the warning is to click the Keep Meta Tag button. EditPad then saves the file, and automatically updates the display to use Windows 874 as the meta tag specifies.

If you want to avoid the warning, you’d have to use Convert, Text Encoding first to interpret the file as Windows 874, then paste in the meta tag, and then save.

Now for a different scenario. Say you have created a proper English web page using the ISO-8859-1 code page, better known as Latin 1:

You send this file to your Thai translator, who uses Notepad to come up with this:

Your translator saves the file using the ANSI setting in Notepad, which means Windows 874 on a Thai computer. Notepad doesn’t care about the incorrect meta tag. It only supports UTF-8, UTF-16, and “ANSI”, which means whichever Windows code page that happens to be your computer’s default. But when you open your translator’s file in your browser or in EditPad, you get:

That’s obviously not right. While you could fix this by pasting in the right meta tag like we did earlier, let’s assume you don’t know which meta tag to use for Thai. To get the file to display correctly in EditPad, you use the Text Encoding item in the Convert menu to interpret the file with Windows 874 for Thai. That’s easy, because the Text Encoding dialog box indicates the languages that each encoding is used for, and it shows a preview of what the file will look like. Now the text is readable:

But the meta tag is still wrong. When you save, you get the same warning as you did before:

This time, EditPad is showing the file with the Windows 874 code page, but EditPad knows that when you open the file again in EditPad or in a browser, it will be displayed with the ISO-8859-1 code page specified by the meta tag. Again, EditPad asks you to choose which is correct: the encoding specified by the meta tag, or the encoding EditPad is displaying the file with.

This time, the meta tag is wrong. Thus, you need to click the Change Meta Tag button. This tells EditPad to keep the file’s display as it is, but to change the charset=ISO-8859-1 bit in the HTML file into charset=tis-620.

EditPad knows the code page names of all the encodings that is supports. If you don’t know the proper meta tag, EditPad will correct it for you if you specify a valid but incorrect meta tag. E.g. if your file specifies charset=utf-8 but you’re not using UTF-8, EditPad will offer to change. If your file specifies charset=doesnotexist, EditPad ignores the meta tag that it doesn’t understand.

Friday, 21 November 2008

Who Owns Your Domain Name?

Filed under: Cyberspace — Jan Goyvaerts @ 18:24

You can check who owns your domain name simply by looking up your domain’s WHOIS record. The person or company indicated as the registrant in the WHOIS record is the owner of the domain. If the registrant data doesn’t show your name and your contact details, you don’t own your domain.

Most registrars sell domain privacy services. The selling point is that WHOIS records are harvested by spammers, and surely you’d like to pay a small fee to receive less spam. But what you’re really paying for is for your registrar to own your domain name on your behalf. If there’s ever a problem with your registrar, they will be the owner of your domain, not you. Good luck suing them for breaking their domain privacy service contract. Quite a few registrars have gone bankrupt or had their registrar accreditations terminated in recent years. In such a situation, ICANN needs your contact info in the WHOIS records when they do a bulk transfer to another registrar. My domain names are very valuable to me. I’d rather get some more spam than to take risks with my ownership of my domains. The article WHOIS Masking Considered Harmful on CircleID discusses this issue in depth.

Another issue are technical glitches. Recently, I looked up the WHOIS records of some domains I had recently transferred to a different registrar. To my surprise, the new registar was listed as the registrant of the domains I had transferred. When I logged onto the registrar’s control panel, the contact details for those domains were totally blank. It seems the contact information was lost during the transfer for reasons unknown, and the registrar had put their own information in the WHOIS for my domain because they have to put in something. Fortunately, all it took for me to fix this problem was to enter my contact information for those domains.

Monday, 5 May 2008

Book Publishing in a Digital World

Filed under: Cyberspace — Jan Goyvaerts @ 18:10

The software industry has long gone digital. The actual product has always been digital. But until the late 90s, most software was still sold in physical boxes. Today, it’s mostly only bargain bin stuff and major computer games (too big to download) that are still sold in boxes.

Most photographers have gone digital. A modern digital SLR takes better pictures than a 35mm film SLR, particularly at high sensitivity (ISO). Home video and TV is all digital. Only feature films are still shot on film. But the Red One camera may soon change that. The music industry tried to fight the trend. But I don’t believe that anyone selling music still believes that there’s a future in CDs.

The book industry doesn’t seem to have figured out the trend yet. Or maybe they’re just sticking their heads into the sand. But it’s inevitable that books too will soon be digital products. Only collector’s editions will be available on paper.

There is some ongoing discussion about the (programming) book industry in the blogosphere. I’m sure there will be a lot more talk before shipping printed books is no longer Amazon.com’s core business.

Talk of the average programmer reading less than a book per year is meaningless. People haven’t been reading books since… Well, people haven’t been reading books. Googling for average american reads books per year points me to One in Four [Americans] Read No Books [in 2006]. That’s not very different from 10 years ago. People who like to read still read. As for me, I read no books last month. I average about half a dozen per year. It used to be more before the internet. Reference material is better accessed online.

I’ve had a bit of a passion for writing ever since I inherited the typewriter my mother used in college. I was 7 or 8 and my fingers really hurt. :-)

Some sing the praises of lulu.com. I have actually used lulu.com as a reader and as a writer. It’s a mixed bag. I think it’s the swan song of the printed book. Just like the best steam trains were built when it was already obvious that diesel was the better technology.

In 2002, I wrote a detailed regular expressions tutorial for the documentation with PowerGREP. I didn’t want to release what I intended to be the world’s most powerful grep tool with a two-page syntax list like most other grep tools. I told you I like to write! In 2003, I published it online at regular-expressions.info to drive search engine traffic to PowerGREP and later RegexBuddy.

I regularly got requests for a printed or printable version of the web site. In 2006, I signed up with lulu.com to print my regex tutorial as a paperback. Creating the PDF with the proper dimensions and formatting, approving the printed proof (which I had to buy), setting up retail distribution, etc. took me about 4 days and $150. It was easy enough. Though what I made easily paid for 4 days of my time, sales were disappointing. I decided to kill the project when I delved deeper into the lulu.com stats: I was selling more PDF downloads than paperbacks! There is no demand for dead trees. Particularly not the kind that ships slowly and expensively. Lulu.com simply can’t compete with Amazon.com in terms of order fulfillment. People complain if they have to wait 15 minutes to download their licensed copy of RegexBuddy. I had to stop using my e-commerce provider that manually verified all orders and switch to one that processed orders in real time. But lulu.com takes many days just to print the book! There’s no way they can compete with Amazon Prime. And even that is already to slow in the digital world. If you sell something tangible, you cannot avoid working with retailers if you want to sell any kind of volume.

And on-demand printing is expensive. Traditional publishers pay very low royalies. But their printing costs are quite low. If you want to price your lulu.com book competitively, your royalty won’t be much higher. Far worse is that nobody will stock your book. Amazon.com won’t discount it. So you’ll end up spending your higher royalty on doing your own marketing.

As a reader, I purchased two books from lulu.com: The Tomes of Delphi: Algorithms and Data Structures and The ECO-III book.

The Tomes book was first published through a traditional publisher. When it went out of print, the author regained the copyright and made it available for print again on lulu.com. It was and is an excellent book. The author and original publisher did a great job. It’s quite timeless by computer book standards. This is what lulu.com does best: converting existing text into a paperback. That’s what I used it for as well, though with less success. Incidentally, Julian does not offer his book as a PDF download. I think he should. Lulu will pay him the same royalty for the download, but the reader won’t have to pay for and wait for shipping.

The ECO-III book is about the Enterprise Core Objects technology in Delphi for .NET (Architect edition). It should revolutionize developing database applications, but I can’t wrap my head around it. That’s why I got the book. I didn’t get beyond the first few chapters. I could look beyond the poor layout. But it simply does not explain the technology. It talks about classes and methods and where to click, but it doesn’t tell me why I should care about ECO or teach me the underlying methodology. I can read about classes and methods in ECO’s help files.

There’s no point in complaining about bad editors. There are a lot of bad editors, just like there are a lot of bad programmers. The fact that a bad editor can drive a book project into the ground only proves that you can’t publish a great book without a good editor who works well together with the author. I actually enjoyed the time I spent making a pixel-perfect PDF for lulu.com. But then I also enjoyed smashing my little fingers into a typewriter as a kid. Unfortunately, most authors really just want to write the book, not edit out all the little issues that keep a decent book from being great.

I’m sure Marco Cantu does well with his books on lulu.com. He made a name for himself with his traditionally published “Mastering Delphi” series. There will be room for lulu.com while printed books have the edge in readability. But I don’t believe that will last much longer.

If Amazon offered the Kindle in Thailand, I would buy it sight unseen. Unfortunately, its core feature of wirelessly downloading books won’t work here until Amazon signs on one of the local cellular networks. But there’s no doubt that something like the Kindle will soon eliminate the printed book. Instantaneous delivery, 200-book storage space and non-destructive yet persistent note-taking are only some features that can never happen with a printed book. And how long do you think it will be politically correct to make schoolchildren break their backs lugging dead trees? Give Moore’s law a few more years, and you’ll get a free Kindle with every newspaper or magazine subscription. Just like you get a cell phone with a calling plan.

PDFs and ebooks haven’t really caught on yet, because the transmissive low-resolution screens on PCs and laptops are uncomfortable for long reading sessions. The Kindle, however, has a reflective screen that’s just as nice as ink on paper, except that color is still on the to do list. But color is too expensive for the typical paperback too.

I don’t know if the Kindle will be successful in the long run or not. But I’m sure that something like it will be the only “book” people will carry physically in the future.

And that will really open up the long tail in book publishing. People go to lulu.com because they don’t want to deal with publishers, or publishers don’t want to deal with them. It’s only a matter of time before there’s a Kindle that uses an open format that allows anybody to publish their own books on their own web sites.

But publishers aren’t going to go away. In fact, I predict that in the end, publishers will come out stronger. At least the ones that survive. When everybody can publish their own drivel, a publisher’s cachet becomes even more valuable.

I was recently asked if I was interested in co-writing a book. I signed the contract last week. I didn’t sign for the money. I signed mainly for four reasons. I’ve wanted to write a real book since the days I hurt myself on an old typewriter. I’m only writing half the book, leaving enough time for my software business. The book is on a topic that I’m an expert on. It will be published by the #1 publisher of computer books.

Actually, the last reason is the only real reason. It’s the one that makes be believe that the book will be successful. It’ll actually be in real bookstores where people can browse through it. Amazon will have plenty of stock and a deep discount. People won’t recognize my or my co-author’s name. But they’ll recognize the publisher just from the style of the cover. People do judge books by their covers.