Micro-ISV.asia

Wednesday, 1 October 2008

Needless String Checks with EnsureUnicodeString

Filed under: Programming — Jan Goyvaerts @ 17:56

Suppose you have the following code in Delphi 2009:

procedure TForm1.AddLengthTwice(const S: UnicodeString);
begin
  Inc(Len, Length(S));
  Inc(Len, Length(S));
end;

What would you expect the disassembly to look like? Something quick and efficient like this?

Unit1.pas.28: Inc(Len, Length(S));
mov ebx,edx
mov ecx,ebx
test ecx,ecx
jz $00461976
sub ecx,$04
mov ecx,[ecx]
add [eax+$00000378],ecx
Unit1.pas.29: Inc(Len, Length(S));
mov edx,ebx
test edx,edx
jz $00461987
sub edx,$04
mov edx,[edx]
add [eax+$00000378],edx

Or something like this, with a bunch of extra checks and a function call thrown in for good measure:

Unit1.pas.28: Inc(Len, Length(S));
mov eax,[ebp-$04]
test eax,eax
jz $00461e3d
mov edx,eax
sub edx,$0a
cmp word ptr [edx],$02
jz $00461e3d
lea eax,[ebp-$04]
mov edx,[ebp-$04]
call @InternalUStrFromLStr
test eax,eax
jz $00461e46
sub eax,$04
mov eax,[eax]
add [ebx+$00000378],eax
Unit1.pas.29: Inc(Len, Length(S));
mov eax,[ebp-$04]
test eax,eax
jz $00461e69
mov edx,eax
sub edx,$0a
cmp word ptr [edx],$02
jz $00461e69
lea eax,[ebp-$04]
mov edx,[ebp-$04]
call @InternalUStrFromLStr
test eax,eax
jz $00461e72
sub eax,$04
mov eax,[eax]
add [ebx+$00000378],eax

I copied and pasted both the above disassemblies from the CPU view in Delphi 2009. The difference is one compiler setting. Unfortunately, the default option is the one that produces the bloated compiled code.

If you select Project|Options in Delphi 2009, you’ll immediately notice the compiler options section had quite a makeover since Delphi 2007. Instead of checkboxes, there’s now a list of options, with a drop-down list for each option. That makes it much easier to see how build configurations (new in Delphi 2007) impact your compiler settings.

The offending option is “String format checking“. It’s the last option in the first category, labeled “Code generation”. In your Delphi code, you can use {$STRINGCHECKS ON} or {$STRINGCHECKS OFF} to control this option. It’s on by default. If we set “Code inlining control” to “Off”, the disassembly shows the real difference in the code generated by “String format checking”. With both options off, we get:

Unit1.pas.28: Inc(Len, Length(S));
mov eax,[ebp-$08]
call @UStrLen
mov edx,[ebp-$04]
add [edx+$00000378],eax
Unit1.pas.29: Inc(Len, Length(S));
mov eax,[ebp-$08]
call @UStrLen
mov edx,[ebp-$04]
add [edx+$00000378],eax

Turning off code inlining, and turning on string format checking:

Unit1.pas.28: Inc(Len, Length(S));
lea eax,[ebp-$08]
call @EnsureUnicodeString
call @UStrLen
mov edx,[ebp-$04]
add [edx+$00000378],eax
Unit1.pas.29: Inc(Len, Length(S));
lea eax,[ebp-$08]
call @EnsureUnicodeString
call @UStrLen
mov edx,[ebp-$04]
add [edx+$00000378],eax

$STRINGCHECKS ON adds an extra call to EnsureUnicodeString each time your procedure does something with a string that it received as a parameter. This call is defined as _EnsureUnicodeString in System.pas. There’s also _EnsureAnsiString which is used in the same way for AnsiString.

String format checking is new in Delphi 2009. These two calls check whether a variable declared as UnicodeString or AnsiString really holds a UnicodeString or AnsiString. If a UnicodeString is found to have an Ansi payload, it is converted to Unicode. If an AnsiString is found to have a Unicode payload, it is converted to Ansi.

Technically, UnicodeString and AnsiString are the same type. UnicodeString as an element size of 2 (WideChar), and AnsiString has an element size of 1 (AnsiChar). UnicodeString supports only one encoding (UTF-16 LE), while AnsiString supports all encodings supported by WideCharToMultiByte(). While the bytes storing the text are different, the string record that is stored in memory immediately before the string has the same format for both UnicodeString and AnsiString. This implementation makes a lot of sense, because it allows low-level routines to work on both string types, with no or minimal changes between Unicode and Ansi versions, saving CodeGear’s compiler and RTL teams a lot of work.

So why are these string checks needed? Delphi is a strongly typed language. The compiler should be able to keep track of two different string types, regardless of implementation details. In fact: it does. If you don’t deal in black hat pointer arithmetic or consort with bad assembly code your mother warned you about, the Delphi compiler will always keep your string payloads correct, no matter what mixture of UnicodeString or AnsiString you concoct. It even handles typed AnsiString with different code pages (such as UTF8String) perfectly well.

If you’re a Delphi developer, turn off “String format checking” and don’t look back. You don’t need it. It only bloats your code.

So why does this option exist? It exists for C++Builder developers. Unlike Delphi, C++Builder does not always pass strings with the correct payload (Unicode vs. Ansi) to function calls. Specifically, C++Builder 2009 allows mismatched strings in event handlers.

In the VCL, there are a bunch of event handlers that have a parameter declared as string. In Delphi 2007, string is an alias for AnsiString. In Delphi 2009, it is an alias for UnicodeString. In Delphi, however, this change doesn’t matter. When you double-click on the event in the Object Inspector to create an event handler for the event, your event handler will declare the same parameter as string, whether you use Delphi 2007 or 2009. When you migrate this code from Delphi 2007 to Delphi 2009, your event handler automatically upgrades to UnicodeString, and it continues to work correctly. No changes to your code or DFM files are needed.

In C++Builder 2007, double-clicking this event handler in the Object Inspector results in an event handler that declares the parameter as AnsiString. In C++Builder 2009, the parameter is declared as UnicodeString. C++Builder does not have the equivalent of a “string” alias as Delphi does. This isn’t a problem if you use C++Builder 2007 or 2009 exclusively. The problem occurs when you migrate code from C++Builder 2007 to 2009.

When migrating code from C++Builder 2007 to 2009, AnsiString event handler parameters have to be changed to UnicodeString. Apparently, CodeGear thinks it would be too much to ask of C++Builder developers to make this change, even though the IDE could do the heavy lifting via some sort of wizard. Instead, CodeGear has decided to allow C++Builder to compile and link code that assigns event handlers using AnsiString parameters to events that expect a UnicodeString parameter (and vice versa, though that hardly matters as no old code could use UnicodeString). The burden is put on the Delphi code to coerce the string it received from the C++ side.

Instead of a one-time job for C++Builder developers to update their event handlers, all Delphi and C++Builder code is now stuck converting strings at runtime with each and every function call. Delphi programmers can turn it off in their own code, but the .dcu files supplied by CodeGear are compiled with the option turned on. So how bad is the runtime penalty?

If you’re using only Delphi, you’ll never get mismatched strings. With code inlining enabled, the string checks only add three assembly statements that are actually executed:

sub edx,$0a
cmp word ptr [edx],$02
jz $00461e3d

This essentially checks if the string element size is 2 (WideChar), skipping the InternalUStrFromLStr because in Delphi it always is for a UnicodeString parameter. Delphi developers shouldn’t lose sleep over this.

Ironically, while this “feature” is designed to make the lives of C++Builder developers easier, it’s the C++Builder developers who are stuck with the runtime penalty. If an event expects a UnicodeString but your event handler is declared with an AnsiString parameter, your code will end up doing lots of needless Unicode->Ansi and Ansi->Unicode conversions.

I don’t agree with the decision CodeGear has made in this respect, but I understand where it comes from. Upgrading from previous versions of Delphi or C++Builder has always been rather straightforward. If you have all the source code to your components, it usually all just works after a recompile. Delphi 2009 is the biggest change since Delphi 2. Delphi 1 was 16-bit, and Delphi 2 was 32-bit. C++Builder was 32-bit from the start (released after Delphi 2). Delphi 2009 and C++Builder 2009 are all Unicode, while previous releases were all Ansi. Upgrading to Delphi 2009 and C++Builder 2009 is more involved that upgrading to previous releases. By allowing mismatched event handlers, C++Builder 2009 developers have one less hurdle to get their old code to compile, resulting in a better out-of-the-box or trial download experience.

But I think that ultimately, it’s the wrong decision. While EnsureUnicodeString can convert from Ansi to Unicode, it cannot magically restore characters that couldn’t be represented in the AnsiString. If you want your C++Builder application to actually support Unicode, you have to update the event handlers anyway.

None of this should stop you from upgrading to Delphi 2009 or C++Builder 2009. Certainly not if you actually want the Unicode support. But it is important to be aware of the issues.

3 Comments

  1. I need to optimize my Delphi 2009 string handling as much as possible. Thank you for this superb article about the performance penalties of that undocumented String Format Checking option, and the assurance that there is no harm in turning it off.

    Comment by Louis Kessler — Monday, 20 October 2008 @ 0:31

  2. Thanks!! Just turned it off and forgotten as a nightmare!

    Comment by Net — Thursday, 7 January 2010 @ 17:31

  3. [...] the first Delphi version to produce Unicode applications. When it was released I ranted about the needless strings checks that it did. Those were a bit of a hack to allow C++Builder developers to port their applications [...]

    Pingback by $STRINGCHECKS Gone in Delphi XE - Micro-ISV.asia — Friday, 17 September 2010 @ 8:10

Sorry, the comment form is closed at this time.