Monday, 6 October 2008

Don’t Get Hung Up on Milliseconds

Filed under: Programming — Jan Goyvaerts @ 16:36

In my past few posts I’ve talked about string performance in Delphi 2009. You may have noticed that I did not include any benchmark code or timings, even though I could have easily called AnsiLowerCase() a million times in a loop on a variable declared as “string”, and declared Delphi 2009 the hands down winner. I didn’t and I won’t, because unless your application actually calls AnsiLowerCase() a million times in a loop, the result is meaningless.

Suppose you have two routines A and B in your application. You’ve benchmarked them by calling them a thousand times in a loop, and routine A takes 100 milliseconds per call, while routine B takes one millisecond. Which one should you optimize? You’re on a deadline, so you’re looking for real performance gains, not to merely gold-plate or show off your coding skills. Without real performance gains, your time is better spent adding new features.

The only correct answer is that you don’t know. You’ve run an isolated benchmark, not a real-world benchmark. Real-world benchmarks are run with a profiler on your actual application, not on some dummy test code in a tight loop. In a tight loop, your routine might fit into CPU cache, while your real application might be trashing out to disk, meaning you need to fix memory usage before worrying about CPU ticks.

Let’s say your real-world benchmark shows routine A does take 100 milliseconds, and is called 1000 times, for a total of 100 seconds. Routine B is called 50,000 times, for a total of 50 seconds. What does this tell us? Nothing!

Are routine A and B independent? If they are, you need to optimize the one that makes the user wait. If routine A runs in a background thread that never blocks the main thread for more than a few milliseconds, but routine C, which calls routine B 200 times, routinely causes the application to become unresponsive for 0.2 seconds during keyboard input or mouse action, it’s routine B that needs to be optimized. Though routine A is the CPU hog, it’s routine B that annoys the user by causing the application to respond in a jerky manner.

If A and B are not independent, you need to check how they call each other. If 25,000 of the 50,000 calls to routine B are made by routine A, making routine B twice as fast will cut 12.5 seconds off routine A’s execution time as well. But the final decision still depends on whether the user waits on routine A, or on the other 25,000 calls to routine B. And the solution may be to add another background thread to take advantage of multi-core CPUs, even if the added synchronization code makes your routines slower when called in a loop. Using 10% more CPU time is fine if it enables you to use two cores instead of one.

But real-world benchmarking is still only part of the story. Think about code maintenance. Is it worth speeding up a routine, if doing so turns it into a complicated monster that even you’ll have a hard time understanding when you have to change it in version 2.0? Think about portability. Hand-coded assembler is cool, until you want to run your app on a different CPU architecture, or as managed code. Think about features. So what if Unicode did make your app twice as big and twice as slow? If the son of Japanese immigrants marries the daughter of Greek immigrants, wouldn’t it be cool if they could write both their names in their native scripts in your genealogy application?

Joylon Smith posted a comment to one of my previous articles with a link to his own Delphi 2009 String Performance article. (By the way: I’m cool with people posting comments mentioning their own posts, if they’re related to mine.) I have several issues with his test results.

1. Artificial benchmarks are largely meaningless. The fact that he posted a follow-up article that gives different results on the same tests, because he made some changes to his benchmarking framework, only proves my point. Real-world applications don’t flush the CPU cache. The performance of any string test is irrelevant if the routine is only used a few times. It becomes relevant when the routine is called over and over, mixed with other code, by a larger routine. If a subroutine responds well to CPU caching, that will improve real-world performance.

2. A few of his tests are false. He explains the low speed of IntToStr() when using AnsiStrings in Delphi 2009, because IntToStr() returns a UnicodeString. There’s no AnsiString overload, because Delphi doesn’t support overloads that differ only on their return type. For StringReplace(), labeled Replace in the chart, he’s using the UnicodeString version from SysUtils for both the string, AnsiString and WideString tests. Delphi 2009 does have an AnsiString overload for StringReplace(), but its in a separate, new unit called AnsiStrings (plural).

3. He doesn’t compare UnicodeString in Delphi 2009 with WideString in previous versions. This is actually the most important test, because this is where you’re comparing the same functionality. WideString is the only Unicode string type in Delphi 2007 and earlier. UnicodeString is the preferred Unicode string type in Delphi 2009, which still has WideString for compatibility with COM.

4. Related to #1 and #3, his code uses short (46 characters) ASCII-only strings. What about longer strings? What about really long strings? What about strings that use all characters in the default code page? What about strings that use characters that can’t be represented in the default code page? Some routines may have heavy overhead that dominates on short strings, while other routines may be very inefficient dealing with long strings. E.g. UnicodeString is orders of magnitude (!) faster than WideString when dealing with very long strings (like a million characters). Code page conversions can be affected by whether there’s actually anything to convert. You should benchmark your actual code on actual data.

Read Joylon’s article the way I intend my articles on performance to be read: they give you an idea of general issues. Don’t go by the numbers. For most Delphi applications, moving to Delphi 2009 and Unicode will have no significant performance impact. Joylon’s numbers are all very close to 100%, even the ones comparing AnsiString to UnicodeString. And you don’t have to field complaints any more from customers who saw their foreign language text vanish into a forest of question marks!

In his comment, Joylon also links to a video by respected Delphi author Macro Cantù. Marco observes that calling the GetUserName() Win32 API in a tight loop takes about as long in Delphi 2007 as in 2009, while SetWindowText() takes about 3 times as long. The idea is to test the speed difference in calling Ansi or Wide versions. I don’t know where to begin! There are just too many unknowns.

No application calls SetWindowText() 1000 times in a loop to set the form’s caption. And certainly not to set it to the same text over and over. Does Windows actually set the caption 1000 times, or does it set it once, and check 999 times that the caption hasn’t changed? The Delphi 2007 window is an Ansi window, while the Delphi 2009 window is Unicode. Does the Win32 API treat such windows differently, or does that only determine how messages such as WM_CHAR are sent? Are Unicode window captions displayed with Uniscribe? Uniscribe supports correct rendering of complex scripts. Do Ansi windows simply convert their text to Unicode, or do they call ExtTextOutA() directly, bypassing Uniscribe? Is Application.MainFormOnTaskBar true? If it is, what’s the additional impact of updating the taskbar button caption along with the window caption? Does SetWindowText() wait for the text to be displayed, meaning it is affected by graphics card performance or VMware perfomance, or does it return immediately after writing the string to a buffer and posting a message somwhere?

In my article discussing the potential speed benefits of using the native Win32 string type, I didn’t say every API call was going to be significantly faster, or that every application would run faster. Most calls will show no impact, because the extra string conversion done by the “A” version of the call takes hardly any time compared with the actual work of the call, particularly if that call involves the display, file I/O or network I/O. But the speed benefit is real when doing string manipulation with the Win32 API. And Delphi developers do that more than they think. All functions with “Ansi” in their name, such as AnsiLowerCase() and AnsiCompareText(), are all thin wrappers around Win32 API calls. If your application does serious string manipulations involving those calls, the gain in speed from the W versions of the calls will outweigh the extra effort of having to move twice as much data around, when you migrate from AnsiString to UnicodeString.


  1. 1. Artificial benchmarks are largely meaningless.

    Yes they are, but it depends on what you are trying to achieve. If the intent is to benchmark Delphi 2009 string performance in absolute terms, it is meaningless. But if comparing on a like-for-like basis with other Delphi versions then RELATIVE benchmarking is perfectly valid.

    i.e. “all other things being equal”.

    The changes to the benchmarking mechanism were made to ensure that the playing field was properly leveled for all tests, i.e. specifically to ensure that all other things were as equal as possible. The only variable in the tests was the compiler involved in each case.

    This I think was borne out by the final results which largely confirmed the expected results (an overall slight performance penalty in string manipulation deriving from the move from ANSI to Unicode, with some unsurprising significant penalties in specific cases), where-as the initial test results contained some odd anomalies.

    2. A few of his tests are false.

    The lack of an IntToStr overload for ANSI is of course explained by the fact that you cannot overload by return type, but that does NOT explain why there is no IntToStrA(), or ANSIIntToStr(). i.e. if you have code that needs to perform an IntToStr() when working with ANSI strings you are required to either use a wholly different alternative to IntToStr() or accept the Unicode-ANSI conversion overhead.

    That’s not a false test, it’s a demonstrative test highlighting a potential performance penalty that you may experience in practice *IF* you have performance critical code involving IntToStr() that is required to continue working in ANSIString terms.

    3. He doesn’t compare UnicodeString in Delphi 2009 with WideString in previous versions.

    No, and I explained why not in reply to comments suggesting that I do so.

    We KNOW that WideString is and was slow. We KNOW that UnicodeString incorporates Wide Char support AND performance improvements (relative to WideString) deriving from reference counting among other things.

    Benchmarking those differences WOULD be FALSE, because it would tell us nothing that we don’t already know.

    It also is largely irrelevant to most people using Delphi, who will not be using WideString commonly in their code, especially not performance critical code (typically only in COM exposures where performance is not likely a concern in any event).

    My benchmarking exercise was intended to either confirm or lay to rest the fears of anyone currently working on ANSI code, considering the move to Delphi 2009 and the unilateral switch to Unicode that that would necessarily entail unless they take specific steps to circumvent the Unicode shift inherent in using that compiler.

    i.e. if I take Project “X” currently in Delphi 2/2007 and recompile in Delphi 2009, what performance penalty – if any – am I likely to suffer?

    If Unicode support is paramount and performance not a concern, then this is likely to be of mild curiosity only. If in any particular case you don’t want and don’t need Unicode, but performance is critical, then the RELATIVE performance of “String” handling is likely to be of significant interest.

    You yourself advocate testing real world data.

    I would go further and advocate real-world *usage*.

    Strings passing to/fro across the WinAPI boundary are RARELY cause for performance concern – it is internal string manipulations NOT involving the WinAPI that are primarily the focus of performance concerns and strategies.

    e.g. a visitor to my blog contacted me with a MASSIVE performance problem he was experiencing (in Delphi 2009) in processing ANSI text files. The code had largely been converted correctly, but was inadvertently using the UnicodeString version of PosEx(). I pointed out this oversight and with the fix in place a routine that was taking 2 seconds to simply read a 3MB file was improved to the point that the same file could now be read in less than 1 ms (in my correspondents real world case the files were many 100’s of MB in size).

    No amount of improvement in the WinAPI boundary was going to help offset the penalty arising from that inappropriate use of an RTL Unicode string routine.

    For PosEx() there is an ANSI string implementation. In the case of IntToStr() there is no immediate alternative.

    That’s a key oversight in the RTL and something that could prove hugely important to some developers.

    But back to WideString…..

    In *my* real world there is no WIDEString in performance critical code (by definition) – there is only ANSIString, therefore the only relevant “real world” test is to compare UnicodeString with ANSIString.

    Having said that, if I recall correctly, the results data DOES include results of WideString testing from D7 and D2007, so if that is important to you you could draw those comparisons yourself.

    I do not pretend to represent a universal view that applies to all. I credit visitors to my blog with sufficient intelligence to know what does or does not apply to their own situation.

    By providing the source to my tests and the results, anyone and everyone is able to examine the tests and apply their own judgements as to whether my tests are a) valid and/or b) even apply to their particular circumstances, and even setting aside the tests themselves, can decide for themselves whether the conclusions I draw from the results make sense to them in the light of the actual data.

    They are perfectly free to make those decisions themselves, and disagree with my own conclusions as they relate to me, if they wish.

    It’s one thing to find points of disagreement on individual conclusions, it’s quite another to say that because your individual circumstances are different from those being tested, that the tests themselves are inherently flawed.

    And it’s certainly not appropriate to apply general theory of benchmarking strategies and (very correct) concerns about obsession with milliseconds to specific tests intended and designed to compare and contrast RELATIVE performance under conditions designed specifically to test one specific variable – the code generated by specific versions of a compiler.

    Comment by Jolyon Smith — Tuesday, 7 October 2008 @ 3:24

  2. Joylon,

    I said that “I have several issues with [your] test results”. I didn’t say your article is complete rubbish, or not worth reading. If I did think that, I wouldn’t write about it at all.

    Comment by Jan Goyvaerts — Tuesday, 7 October 2008 @ 13:56

  3. Jan, the specific comments of yours that I responded to gave the distinct impression that you considered my benchmarking exercise to be if not worthless then certainly of little/no value.

    You described it as an artificial benchmark and largely meaningless and proceeded to pick apart aspects the methodology (aspects of which specifically do not focus on milliseconds – quite the contrary)

    Flushing the CPU cache will artificially reduce performance compared to that seen in the “real world” – given the title of your post I might be forgiven for being disappointed in not being PRAISED for the methodology.


    You stated that “some” of the tests are false, but then only mention one, which isn’t in fact “false” at all but specifically and deliberately relevant to real world scenarios.

    You stated that “the most important test” was not conducted, which again was actually an irrelevant test when considering the most common real world scenarios (and in any event *was* conducted, and test results provided, I just didn’t discuss the results of those specific comparisons because they are not generally going to be of interest).

    If that over-stated your view of my post then I can only say that your actual view didn’t come across very well and I felt I needed to explain why your criticisms on those specific points were misplaced.

    But I’m glad we managed to clarify all those points.

    Comment by Jolyon Smith — Wednesday, 8 October 2008 @ 2:24

Sorry, the comment form is closed at this time.