Saturday, 9 August 2008

I Like My Bytes Raw

Filed under: Programming — Jan Goyvaerts @ 19:14

In my previous post, I introduced the new typed AnsiStrings in Delphi 2009. These allow you to store text in any code page supported by Windows. String literals are automatically stored in the correct code page at compile time. At runtime, the compiler injects calls to WideCharToMultiByte() and/or MultiByteToWideChar() to make sure your strings hold their data in the encoding that you specified.

But sometimes, the automatic conversions may get in your way. Try this code:

procedure PrintByteLength(const S: AnsiString);

var F: UTF8String;
F := 'Tiburón';

What does this print? In Delphi 2007, it prints 7. That’s because in Delphi 2007, there’s no difference between UTF8String and AnsiString. It’s all code page 1252 (if your system is set for US or Western Europe). F is passed PrintByteLength() unchanged.

In Delphi 2009, F stores “Tiburón” encoded as UTF-8, using eight bytes (two bytes for the ó). But the code still prints 7. That’s because because PrintByteLength takes an AnsiString parameter, which expects code page 1252. The Delphi compiler actually inserts code to convert the string from UTF-8 to UTF-16, and then to CP 1252, before passing it to PrintByteLength. The round trip to UTF-16 is needed because there’s no Win32 API call to convert between two non-UTF-16 code pages. It can only convert from and to UTF-16.

Normally, these automatic conversions are a good thing. If you go through the trouble of declaring your own EBCDICString type, you want the compiler to give you your EBCDIC goodies. But PrintByteLength() wants to receive the string unchanged, no matter the code page. For this purpose, Delphi 2009 declares yet another variant of AnsiString:

type RawByteString = type AnsiString($FFFF);

65535 is not a valid Windows code page. This number is a special value that tells the Delphi compiler not to do any code page conversions when you assign a typed AnsiString to a RawByteString. If you change AnsiString with RawByteString in the code sample at the top, it will print 8 when compiled with Delphi 2009.

Under the hood, all AnsiString types are the same. They’re an array of bytes, with a few extra fields before the array. In Delphi 2007, those fields are the length and the reference count. Delphi 2009 adds the element size, and the code page. Thus, if you assign a UTF8String to a RawByteString, and assign that to an EBCDICString, the UTF8->EBCDIC conversion will be done correctly upon the second assignment, even though RawByteString has no code page affinity, and no conversion was done upon the first assignment. The reason is that UTF8String stores 65001 in the code page field of the string, which is copied over upon assignment to the RawByteString. Upon conversion to EBCDICString, the source code page is determined by the value stored in the string, rather than the type being converted from.

It also stores the element size. This is 1 for all AnsiString types, because SizeOf(AnsiChar) = 1, and 2 for UnicodeString because SizeOf(WideChar) = 2. Technically, UnicodeString is yet another variety of Delphi’s reference-counted string type.

The venerable AnsiString type has long been abused for something else: byte storage. I don’t blame anyone. I do it myself! The reference-counted AnsiString type makes it all just too easy. With RawByteString, you don’t need to feel guilty any more.

1 Comment

  1. […] I Like My Bytes Raw – Jan Goyvaerts […]

    Pingback by Tiburon - Unicode e RawByteString | Cesar Romero — Wednesday, 15 April 2009 @ 23:32

Sorry, the comment form is closed at this time.