Needless String Checks with EnsureUnicodeString
Suppose you have the following code in Delphi 2009:
procedure TForm1.AddLengthTwice(const S: UnicodeString); begin Inc(Len, Length(S)); Inc(Len, Length(S)); end;
What would you expect the disassembly to look like? Something quick and efficient like this?
Unit1.pas.28: Inc(Len, Length(S)); mov ebx,edx mov ecx,ebx test ecx,ecx jz $00461976 sub ecx,$04 mov ecx,[ecx] add [eax+$00000378],ecx Unit1.pas.29: Inc(Len, Length(S)); mov edx,ebx test edx,edx jz $00461987 sub edx,$04 mov edx,[edx] add [eax+$00000378],edx
Or something like this, with a bunch of extra checks and a function call thrown in for good measure:
Unit1.pas.28: Inc(Len, Length(S)); mov eax,[ebp-$04] test eax,eax jz $00461e3d mov edx,eax sub edx,$0a cmp word ptr [edx],$02 jz $00461e3d lea eax,[ebp-$04] mov edx,[ebp-$04] call @InternalUStrFromLStr test eax,eax jz $00461e46 sub eax,$04 mov eax,[eax] add [ebx+$00000378],eax Unit1.pas.29: Inc(Len, Length(S)); mov eax,[ebp-$04] test eax,eax jz $00461e69 mov edx,eax sub edx,$0a cmp word ptr [edx],$02 jz $00461e69 lea eax,[ebp-$04] mov edx,[ebp-$04] call @InternalUStrFromLStr test eax,eax jz $00461e72 sub eax,$04 mov eax,[eax] add [ebx+$00000378],eax
I copied and pasted both the above disassemblies from the CPU view in Delphi 2009. The difference is one compiler setting. Unfortunately, the default option is the one that produces the bloated compiled code.
If you select Project|Options in Delphi 2009, you’ll immediately notice the compiler options section had quite a makeover since Delphi 2007. Instead of checkboxes, there’s now a list of options, with a drop-down list for each option. That makes it much easier to see how build configurations (new in Delphi 2007) impact your compiler settings.
The offending option is “String format checking“. It’s the last option in the first category, labeled “Code generation”. In your Delphi code, you can use {$STRINGCHECKS ON}
or {$STRINGCHECKS OFF}
to control this option. It’s on by default. If we set “Code inlining control” to “Off”, the disassembly shows the real difference in the code generated by “String format checking”. With both options off, we get:
Unit1.pas.28: Inc(Len, Length(S)); mov eax,[ebp-$08] call @UStrLen mov edx,[ebp-$04] add [edx+$00000378],eax Unit1.pas.29: Inc(Len, Length(S)); mov eax,[ebp-$08] call @UStrLen mov edx,[ebp-$04] add [edx+$00000378],eax
Turning off code inlining, and turning on string format checking:
Unit1.pas.28: Inc(Len, Length(S)); lea eax,[ebp-$08] call @EnsureUnicodeString call @UStrLen mov edx,[ebp-$04] add [edx+$00000378],eax Unit1.pas.29: Inc(Len, Length(S)); lea eax,[ebp-$08] call @EnsureUnicodeString call @UStrLen mov edx,[ebp-$04] add [edx+$00000378],eax
$STRINGCHECKS ON
adds an extra call to EnsureUnicodeString
each time your procedure does something with a string that it received as a parameter. This call is defined as _EnsureUnicodeString
in System.pas. There’s also _EnsureAnsiString
which is used in the same way for AnsiString.
String format checking is new in Delphi 2009. These two calls check whether a variable declared as UnicodeString or AnsiString really holds a UnicodeString or AnsiString. If a UnicodeString is found to have an Ansi payload, it is converted to Unicode. If an AnsiString is found to have a Unicode payload, it is converted to Ansi.
Technically, UnicodeString and AnsiString are the same type. UnicodeString as an element size of 2 (WideChar), and AnsiString has an element size of 1 (AnsiChar). UnicodeString supports only one encoding (UTF-16 LE), while AnsiString supports all encodings supported by WideCharToMultiByte()
. While the bytes storing the text are different, the string record that is stored in memory immediately before the string has the same format for both UnicodeString and AnsiString. This implementation makes a lot of sense, because it allows low-level routines to work on both string types, with no or minimal changes between Unicode and Ansi versions, saving CodeGear’s compiler and RTL teams a lot of work.
So why are these string checks needed? Delphi is a strongly typed language. The compiler should be able to keep track of two different string types, regardless of implementation details. In fact: it does. If you don’t deal in black hat pointer arithmetic or consort with bad assembly code your mother warned you about, the Delphi compiler will always keep your string payloads correct, no matter what mixture of UnicodeString or AnsiString you concoct. It even handles typed AnsiString with different code pages (such as UTF8String) perfectly well.
If you’re a Delphi developer, turn off “String format checking” and don’t look back. You don’t need it. It only bloats your code.
So why does this option exist? It exists for C++Builder developers. Unlike Delphi, C++Builder does not always pass strings with the correct payload (Unicode vs. Ansi) to function calls. Specifically, C++Builder 2009 allows mismatched strings in event handlers.
In the VCL, there are a bunch of event handlers that have a parameter declared as string
. In Delphi 2007, string
is an alias for AnsiString. In Delphi 2009, it is an alias for UnicodeString. In Delphi, however, this change doesn’t matter. When you double-click on the event in the Object Inspector to create an event handler for the event, your event handler will declare the same parameter as string
, whether you use Delphi 2007 or 2009. When you migrate this code from Delphi 2007 to Delphi 2009, your event handler automatically upgrades to UnicodeString, and it continues to work correctly. No changes to your code or DFM files are needed.
In C++Builder 2007, double-clicking this event handler in the Object Inspector results in an event handler that declares the parameter as AnsiString. In C++Builder 2009, the parameter is declared as UnicodeString. C++Builder does not have the equivalent of a “string” alias as Delphi does. This isn’t a problem if you use C++Builder 2007 or 2009 exclusively. The problem occurs when you migrate code from C++Builder 2007 to 2009.
When migrating code from C++Builder 2007 to 2009, AnsiString event handler parameters have to be changed to UnicodeString. Apparently, CodeGear thinks it would be too much to ask of C++Builder developers to make this change, even though the IDE could do the heavy lifting via some sort of wizard. Instead, CodeGear has decided to allow C++Builder to compile and link code that assigns event handlers using AnsiString parameters to events that expect a UnicodeString parameter (and vice versa, though that hardly matters as no old code could use UnicodeString). The burden is put on the Delphi code to coerce the string it received from the C++ side.
Instead of a one-time job for C++Builder developers to update their event handlers, all Delphi and C++Builder code is now stuck converting strings at runtime with each and every function call. Delphi programmers can turn it off in their own code, but the .dcu files supplied by CodeGear are compiled with the option turned on. So how bad is the runtime penalty?
If you’re using only Delphi, you’ll never get mismatched strings. With code inlining enabled, the string checks only add three assembly statements that are actually executed:
sub edx,$0a cmp word ptr [edx],$02 jz $00461e3d
This essentially checks if the string element size is 2 (WideChar), skipping the InternalUStrFromLStr
because in Delphi it always is for a UnicodeString parameter. Delphi developers shouldn’t lose sleep over this.
Ironically, while this “feature” is designed to make the lives of C++Builder developers easier, it’s the C++Builder developers who are stuck with the runtime penalty. If an event expects a UnicodeString but your event handler is declared with an AnsiString parameter, your code will end up doing lots of needless Unicode->Ansi and Ansi->Unicode conversions.
I don’t agree with the decision CodeGear has made in this respect, but I understand where it comes from. Upgrading from previous versions of Delphi or C++Builder has always been rather straightforward. If you have all the source code to your components, it usually all just works after a recompile. Delphi 2009 is the biggest change since Delphi 2. Delphi 1 was 16-bit, and Delphi 2 was 32-bit. C++Builder was 32-bit from the start (released after Delphi 2). Delphi 2009 and C++Builder 2009 are all Unicode, while previous releases were all Ansi. Upgrading to Delphi 2009 and C++Builder 2009 is more involved that upgrading to previous releases. By allowing mismatched event handlers, C++Builder 2009 developers have one less hurdle to get their old code to compile, resulting in a better out-of-the-box or trial download experience.
But I think that ultimately, it’s the wrong decision. While EnsureUnicodeString can convert from Ansi to Unicode, it cannot magically restore characters that couldn’t be represented in the AnsiString. If you want your C++Builder application to actually support Unicode, you have to update the event handlers anyway.
None of this should stop you from upgrading to Delphi 2009 or C++Builder 2009. Certainly not if you actually want the Unicode support. But it is important to be aware of the issues.
I need to optimize my Delphi 2009 string handling as much as possible. Thank you for this superb article about the performance penalties of that undocumented String Format Checking option, and the assurance that there is no harm in turning it off.
Comment by Louis Kessler — Monday, 20 October 2008 @ 0:31
Thanks!! Just turned it off and forgotten as a nightmare!
Comment by Net — Thursday, 7 January 2010 @ 17:31
[…] the first Delphi version to produce Unicode applications. When it was released I ranted about the needless strings checks that it did. Those were a bit of a hack to allow C++Builder developers to port their applications […]
Pingback by $STRINGCHECKS Gone in Delphi XE - Micro-ISV.asia — Friday, 17 September 2010 @ 8:10