Friday, 8 August 2008

Will The Real UTF8String Stand Up?

Filed under: Programming — Jan Goyvaerts @ 18:52

For some time, Delphi has had a little-know type called UTF8String. It was little-know, because it didn’t really work as advertised. Try this in Delphi 2007:

var S: UTF8String;
S := "Tiburón";

Though S is declared as UTF8String, it stores the string using the default Windows code page, instead of UTF-8, with a length of 7 bytes. That’s because in Delphi 2007, you’ll find this declaration in System.pas:

type UTF8String = type string;

This means that in Delphi 2007, there’s really no difference between UTF8String and AnsiString. In Delphi 2009, however, you’ll find this declaration:

type UTF8String = type AnsiString(65001);

65001 is the code page number for UTF-8 on the Windows platform. You can declare your own string types this way using any code page understood by the WideCharToMultiByte() and MultiByteToWideChar() API calls. E.g. if you assign a UnicodeString to a UTF8String, WideCharToMultiByte(65001) is called to convert the string from UTF-16 to UTF-8. This is no different than Delphi 2007 (or 2009) calling WideCharToMultiByte(0) when you assign a WideString to an AnsiString.

In Delphi 2009, the code snippet at the top of this post will convert “Tiburón” to UTF-8 at compile time. At runtime, 8 bytes are loaded directly into S. There will be no call to WideCharToMultiByte() at runtime for this literal assignment. The accented ó takes up two bytes when encoded as UTF-8. Length(S) will return 8.

You can easily declare your own typed AnsiStrings in Delphi 2009. If UTF8String is too modern for you, try this:

type EBCDICString = type AnsiString(37);

1 Comment

  1. […] strings (i.e. one byte per character), and UTF-8. Hence my decision to make TPerlRegEx use the new and improved UTF8String in Delphi […]

    Pingback by TPerlRegEx for Delphi 2009 - Regex Guru — Tuesday, 19 August 2008 @ 12:13

Sorry, the comment form is closed at this time.