Micro-ISV.asia

Sunday, 28 December 2008

Choose The Right File Format for Your Delphi Source Code

Filed under: Programming — Jan Goyvaerts @ 12:39

I just got a bug report for the latest version of TPerlRegEx. A user from China was getting these errors trying to compile TPerlRegEx:

[DCC Warning] PerlRegEx.pas(265): W1063 Widening given AnsiChar constant (#$B7) to WideChar lost information
[DCC Error] PerlRegEx.pas(265): E2030 Duplicate case label

Needless to say, TPerlRegEx compiled just fine for me on my own system. I sometimes release code with embarrasing bugs (like the one fixed in the latest TPerlRegEx my Chinese user was trying to compile), but I’ve never released code that doesn’t actually compile. So what happened?

The fact that the bug report came from China is relevant. This is what the code looks like on my own system:

#0..'&', '(', '*', '+', ',', '-', '.', '?', '< ', '[', '{', '·':

Notice the bullet point at the end. This character is #$B7 in the Windows 1252 code page, and #$00B7 in Unicode. It was stored in PerlRegEx.pas as a single byte $B7.

The problem is that the byte $B7 doesn't represent a bullet point in all Windows code pages. In both Chinese code pages, $B7 is a lead byte. It cannot occur on its own. The character that follows it is interpreted as a trail byte. $27 (ASCII code for the single quote) is not a valid trail byte. The conversion of $B7$27 in code page 936 to Unicode fails. This causes the Delphi compiler to substitute a question mark and emit warning W1063. It further causes the code to fail to compile, because the last item in the case statement is now an unclosed string that begins with a question mark.

To summarize, the problem is that I used a non-ASCII character in a Delphi source code file with the file format (text encoding) set to ANSI. This will always lead to trouble when your source code is used by developers who use a different system code page. The non-ASCII characters will be interpreted differently when those developers compile your code, at best causing your application to display the wrong characters, at worst causing it to fail to compile.

If you have only a few non-ASCII characters in your source code, you can easily replace them with their numeric representation. I fixed PerlRegEx.pas by replacing the literal bullet character with #$00B7. The two zeros are important. In Delphi 2009, #$B7 is interpreted as an AnsiChar, according to the code page used to compile the source code. #$00B7 is interpreted as a WideChar. While these happen to represent the same character in all single-byte Windows code pages, except the one for Thai, this certainly isn't a rule. E.g. #$80 is the euro symbol in all Windows code pages, while #$0080 is a control character. #$20AC is the euro symbol in Unicode.

In Delphi 2007 and prior, #$0080 and #$80 are the same. They are both interpreted as the Ansi character #$80.

If you have a lot of non-ASCII text in your source code, typing in the text directly is the only workable solution. To avoid code page issues, right-click on your source code in the Delphi 2009 IDE, and select File Format. Then choose UTF-8 or Little Endian UCS-2. The latter is the encoding used by WideString and UnicodeString. (I'm glossing over the difference between UCS-2 and UTF-16. The Delphi IDE uses UCS-2, i.e. "surrogates" don't work.)

Besides saving your file as Unicode, selecting one of these options also makes Delphi add a byte order marker to the start of your file. For USC-2/UTF-16 this appears as two bytes FF FE in a hex editor. For UTF-8, the byte order marker is three bytes EF BB BF. When another developer opens this file in the Delphi 2009 IDE or a text editor that supports Unicode, it will detect the byte order marker, and automatically interpret the file's encoding correctly. That way, there will be no surprises caused by different language settings in Windows.

3 Comments

  1. Is there a disadvantage to saving my source code in UTF-8 format?

    Comment by Django Dunn — Monday, 29 December 2008 @ 1:16

  2. The only reason not to use UTF-8 is if you want your code to compile with Delphi 7 or earlier too. Then Ansi is your only choice. The File Format context menu item does not exist in Delphi 7.

    Comment by Jan Goyvaerts — Monday, 29 December 2008 @ 12:44

  3. Is there a way to change the default source encoding for Delphi source ?
    I use Delphi 2009 and I would like an UTF-8 encoding for all the new Delphi sources.
    I know, I can use right click/File format/UTF-8 but an option in the IDE will be welcomed.

    Comment by Oliver — Friday, 25 February 2011 @ 16:20

Sorry, the comment form is closed at this time.