Micro-ISV.asia

Saturday, 23 August 2008

Using RawByteString Effectively

Filed under: Programming — Jan Goyvaerts @ 9:31

Two weeks ago I blogged about the new RawByteString type in Delphi 2009. The main purpose of RawByteString is to use it as a method parameter that preserves the string’s encoding. Without RawByteString, you’d have to define an overload for the method for each and every kind of AnsiString.

Suppose we have these variables:

var
  A: AnsiString;
  F: UTF8String;

A := 'Tiburón';
F := 'Tiburón';

And these procedures:

procedure SaveWithSystemCodePage(const S: AnsiString);
begin
  BlockWrite(SomeFile, S[1], Length(S));
end;

procedure SaveWithUTF8(const S: UTF8String);
begin
  BlockWrite(SomeFile, S[1], Length(S));
end;

procedure SaveWithAnyCodePage(const S: RawByteString);
begin
  BlockWrite(SomeFile, S[1], Length(S));
end;

Though the three functions have the same implementation, their actual behavior is different. The typed AnsiString parameters tell the compiler which code page conversions to do. On a computer with the “default language for non-Unicode applications” set to English, the default code page is 1252 for Western European languages. The two variables and three procedures can be combined in six ways:

  • SaveWithSystemCodePage(A): Saves the text in A encoded in CP 1252. No implicit conversion.
  • SaveWithUTF8(A): Saves the text in A encoded in UTF-8. Compiler does implicit CP1252->UTF-8 conversion.
  • SaveWithAnyCodePage(A): Saves the text in A encoded in CP 1252. No implicit conversion.
  • SaveWithSystemCodePage(F): Saves the text in F encoded in CP 1252. Compiler does implicit UTF-8->CP1252 conversion, which may cause data loss. If CP1252 cannot represent certain characters in the string, they will be saved as question marks.
  • SaveWithUTF8(F): Saves the text in F encoded in UTF-8. No implicit conversion.
  • SaveWithAnyCodePage(F): Saves the text in F encoded in UTF-8. No implicit conversion.

Essentially, RawByteString disables implicit conversion. It simply takes over the payload of any typed AnsiString you assign to it.

RawByteString (and AnsiString too) stores the code page in the string’s payload. If you have this procedure:

procedure ShowCodePage(const S: RawByteString);
begin
  Form1.Caption := IntToStr(StringCodePage(S));
end;

Then ShowCodePage(A) will display 1252, and ShowCodePage(F) will display 65001. The StringCodePage function gets the code page information stored in the string at runtime, not the declared code page, which is 65535 for RawByteString. This means that StringCodePage is only meaningful when calling it on a string variable that actually holds a string. If it doesn’t hold a string, StringCodePage cannot retrieve the code page, and it will always return the default code page.

Because each string remembers its own code page, the following procedures are mostly redundant:

procedure ShowWithSystemCodePage(const S: AnsiString);
begin
  Form1.Caption := S;
end;

procedure ShowWithUTF8(const S: UTF8String);
begin
  Form1.Caption := S;
end;

procedure ShowWithAnyCodePage(const S: RawByteString);
begin
  Form1.Caption := S;
end;

They’re redundant, because the implicit conversion to UnicodeString (TForm.Caption) uses the code page stored in the string’s payload, rather than the declared code page. ShowWithAnyCodePage(A) converts from CP1252 to UTF-16 inside the procedure, while ShowWithAnyCodePage(F) converts from UTF-8 to UTF-16. RawByteString disables implicit conversions from one code page to another, but not from or to Unicode.

I say “mostly” redundant, because ShowWithSystemCodePage still does the implicit conversion to CP1252 before the function call. Data loss can still occur. If you call ShowWithSystemCodePage(F) and your UTF-8 string holds characters that can’t be represented by AnsiString, those will be shown as question marks. The subsequent CP1252->UTF-16 conversion inside the procedure has no way to magically resurrect the original characters. This what makes the RawByteString type so useful: it allows strings in any code page to pass through unchanged.

Since 65535 is not a valid code page, assigning literals to RawByteString variables is tricky. If you do:

var
  R: RawByteString;
R := 'Tiburón';
Form1.Caption := StringCodePage(R);

The caption will indicate 1252, or whichever is the default code page on your computer. The code page for literal strings is determined at compile time. For RawByteString and AnsiString, this is the code page you set in the compiler options. This is set to zero by default, which tells the compiler to use your computer’s default Ansi code page. For typed AnsiStrings such as UTF8String, the compiler is smart enough to use the declared code page of the string (e.g. UTF-8).

A potential problem occurs with code such as this:

var
  F: UTF8String;
  R: RawByteString;
F := 'Tiburón';
R := F;
F := F + 'Líterâl';
R := R + 'Líterâl';
Label1.Caption := F;
Label2.Caption := R;

The first caption will be correct, but the second will not be. When appending a literal to UTF8String, the compiler stores the literal as UTF-8 at compile time. The concatenation is done without any conversions. The compiler assumes the payload of F matches its declared code page (UTF-8).

This fails with RawByteString, because the compiler can’t assume the correct payload. It will assume the code page for AnsiString set at compile time. This means you end up with a RawByteString with the first ‘Tiburón’ encoded as UTF-8, and the second as CP1252. Obviously, the implicit conversion to UTF-16 for the caption’s label can’t get this one right.

This code does proper literal RawByteString concatenation that assumes no prior knowledge of the actual code page used by the RawByteString variable:

var
  F: UTF8String;
  R, R2: RawByteString;
F := 'Tiburón';
R := F;
R2 := 'Líterâl';
SetCodePage(R2, StringCodePage(R), True);
R := R + R2;
Label2.Caption := R;

The result of this code is that R holds a string properly encoded as UTF-8, which properly converts to UTF-16 when assigned to the Caption property.

5 Comments

  1. I just cannot believe this insanity. Instead of making things more simple Borland (sorry, Embarcadero) invents one more string type.
    There are now (only those I can remember)
    string
    string[100]
    AnsiString
    UTF8String
    WideString
    RawByteString

    Where can I get the stuff they are smoking?

    Comment by al — Tuesday, 16 December 2008 @ 17:05

  2. They have been listening to developers such as myself who lobbied for those string types, in particular the typed AnsiStrings which include UTF8String and RawByteString.

    A lot of Delphi developers will get by just using “string”, and can safely ignore everything else. But for those developing applications that are heavy on text, all the various string types are a blessing.

    Comment by Jan Goyvaerts — Tuesday, 16 December 2008 @ 18:48

  3. […] Using RawByteString Effectively – Jan Goyvaerts […]

    Pingback by Tiburon - Unicode e RawByteString | Cesar Romero — Wednesday, 15 April 2009 @ 23:49

  4. Look at this and tell me why???

    RAD2010:

    var
    rs1 :RawByteString;
    b :byte;
    begin
    b := 208;
    rs1 := char(b);
    now stop here and look for length(rs1) and value of rs1[1]

    We have:
    Length(rs1) == 1
    rs[1] == 63!!! (not 208)
    //b=203 is for test purposes only. It’s equal Russian ‘к’ in win1251 (that set in the system by default on XP Prof SP3 eng)

    You can find the same thing inside Indy TIdURI.URLDecode where the same on Result := Result + Char(CharCode);

    Comment by Lucefer — Monday, 16 August 2010 @ 21:32

  5. In Delphi 2010, Char is the same as WideChar. RawByteString is a string of AnsiChars without a predetermined code page. You cannot assign a WideChar to a RawByteString and expect the compiler to know what you mean. When you cast the byte 208 to WideChar you get #$00D0 (latin capital letter eth) regardless of your computer’s code page. When you try to assign this to an AnsiString, the compiler tries to convert the Unicode character to your computer’s default code page. #$00D0 cannot be represented in code page 1521 and thus becomes a question mark.

    Try this:

    SetLength(rs1, 1);
    SetCodePage(rs1, 1251, False);
    rs1[1] := AnsiChar(b);

    I did not check the Indy sources, but if they cast a Byte to WideChar and then concatenate to RawByteString, that is a bug.

    Comment by Jan Goyvaerts — Tuesday, 17 August 2010 @ 12:15

Sorry, the comment form is closed at this time.