Character encoding confusion

HxD will extend character encoding support, and I am looking for the best way to name character encodings. So far, you can only pick between the following four to affect the text display in the editor window:

  • Windows (ANSI)
  • DOS/IBM-PC (OEM)
  • Macintosh
  • EBCDIC

Additionally, in the Search window, Unicode (UCS-2LE) can be selected using a checkbox to override the current editor window encoding. I’d like the character encoding selection to be more uniform, flexible, and clear in future.

Clearing up historical terms and names

There has been a long history of confusion regarding character encodings and related terms. I will just briefly cover some important points, the Wikipedia link before and W3C’s Character encodings: Essential concepts explain this in more detail.

Character sets vs. character encodings

Characters have two meanings, the abstract concept of characters and their concrete representation.

Humans represent characters commonly using hand writing, print or speech, while computers use numbers (actually electricity) to represent characters. Encoding is just another name for representation. All the aforementioned encodings are different, however they still all refer to the abstract concept of characters. For example, if you say A, write A, or encode A as number 65 (A is 65 in US-ASCII encoding) you always refer to the abstract character A, no matter if through sound, drawing, electricity, or any other encoding.

A character set is a set of characters in the abstract concept sense: the elements are not ordered or represented in a fixed way, on the other hand set elements in computers can only be stored in an order and encoded in a concrete and specific way.

Computers do not store abstract character sets, but concrete character encodings.

Misnamed character encodings

Another confusion is due to misnamed character encodings:

Originally, Windows code page 1252, the code page commonly used for English and other Western European languages, was based on an American National Standards Institute (ANSI) draft. That draft eventually became ISO 8859-1, but Windows code page 1252 was implemented before the standard became final, and is not exactly the same as ISO 8859-1.

Windows code page 1252 and related code pages (code page is another term used in Windows to mean character encoding) are based on ISO/IEC 8859 standard. Due to their history, there are various names around to refer to the same encodings: ANSI, Windows-<Code page> such as Windows-1252, or simply cp-<code page> such as cp-1252. Even more confusingly Windows calls Windows-1252 “ANSI – Latin I” in user interfaces. All of the names mean the same, but the ANSI and Latin I names are especially misleading, suggesting a conformance to an ANSI standard or the ISO 8859-1 encoding (which is also called Latin-1).

IANA has a pretty complete overview of character encodings that clears up some of this confusion, and lists alias names but also defines standard names that are not ambiguous.

Character encoding categories

As hinted above, some character encoding names do not refer to a fixed encoding but to an encoding category.

For example, “ANSI” under Windows usually stands for the standard system code page. The actual code page/encoding varies depending on the Windows language and regional settings (which can be set in the control panel). On a Western Windows “ANSI” will mean “Windows-1252”, while on a Russian Windows “ANSI” stands for “Windows-1251”.

It may add to the confusion that there are alternative names for “ANSI” such as “Windows” or “CP_ACP”.

Similar to the ANSI default encoding, Windows also defines default encodings for the categories of DOS-IBM/PC and Macintosh encodings.

The default “DOS/IBM-PC” encoding is also called “OEM” or “CP_OEM”. And the default “Macintosh” encoding has the alias “CP_MAC”.

Terms and names in HxD

Following the reasoning so far, character encoding is the proper term and will also be adopted in HxD instead of character set.

Many programs, including HxD in the past (see the introduction), do not distinguish sufficiently between character encodings that have a fixed meaning and those that represent a temporary selection from a character encoding category.

The following labeling should clear this up:

  • Encodings
    • System variables
      • Windows/”ANSI”
      • DOS/IBM-PC/OEM
      • Macintosh

When hovering over any item of this list in HxD, a hint will show the actual encoding currently set in Windows for this encoding variable. So, on an unmodified Western Windows hovering over “Windows/”ANSI”” will show this hint: “Current encoding: ANSI – Latin1 (Windows-1252)”.

The EBCDIC encoding in HxD version 1.7.7.0 is not a generic name, just a short hand, and always stands for encoding “EBCDIC 500” (called “IBM EBCDIC – International” by Microsoft). It will be renamed accordingly in the coming version.

Another consideration is finding encoding names for the fixed meaning encodings. They should be user friendly as far as possible and would need to be translated, which might prove difficult for translators — What technical parts should remain unchanged and which should be translated? — but also tedious.

An overview of encoding naming conventions in user interfaces (UIs) would help, but there seems to be none. Based on the sampling I have done, I found several strategies for creating lists of encodings for UIs:

  • Raw list of encodings
    • List of encodings returned by Windows
    • Subset of the list of encodings returned by Windows
    • Application provided (library like libiconv or custom code)
  • Ordering
    • Order encodings by language and/or region
    • Order by technical encoding categories (like origin system or inventor)
  • Naming (in the user interface)
    • Technical names
      • MIME or IANA names
      • Code page names (Windows)
    • Simplified technical names
    • User friendly names (language or region name disregarding technical names)
    • A mix of the above
      • For example: names provided by Windows (intended for the user interface)

The degree of technicality of names presented in the user interface varies. Firefox for example almost removes any technical information and keeps only the language name/region and a generic reference to a standard (“ISO” or “Windows”). This is not precise enough for users of a hex editor that want to pick an encoding with certainty. On the other hand I want to use encoding names that are common and popular and help people remember something they saw in the past without having rigorously studied encodings. This means it should include common alias names.

Windows does a pretty good job of naming encodings that way when they are more exotic, but for common ones it just removes this information. For example, “ANSI – Cyrillic” misses any clear indication that it is Windows specific. Given all the ambiguity there has been regarding encodings, I prefer to add to the encoding name provided by Windows the technical name as defined by IANA: “ANSI – Cyrillic (Windows-1251)”.

Sometimes encoding names returned by Windows already include some code page references, e.g., “IBM EBCDIC – Germany (20273 + Euro)”. If the reference to the Windows specific code page 20273 was replaced by a reference to the proper EBCDIC encoding, it would be more clear: “IBM EBCDIC – Germany (EBCDIC 237 + Euro)”. As the exact wording may change with translations or Windows versions, I cannot modify the strings safely to match the desired output.

Conclusion

Except for the referencing issue, this should give names that are recognizable to the user (if they saw it in Windows in the past) but also the technical user because of the additional information added after the Windows provided name.

On top of that encodings will be ordered by language/region and technical encoding categories: Windows, DOS/IBM-PC, and Macintosh (so the categories mentioned above) but also possibly more such categories, like ISO 8859, EBCDIC, Unicode, etc.. That way, both types of interests can be met: finding a possible encoding for a known language or selecting the exact technical encoding.

7 responses on “Character encoding confusion

  1. wall of wolf street

    Though HxD is quite mature already, it’s still great to hear that it’s being maintained. I’ve always loved the clean look of HxD and it remains my favorite hex editor. Thanks Maël!

  2. AzzaAzza69

    I am trying to draw the ascii characters onto a canvas and I can’t get the 1-31 to appear like they do in HxD…
    I have tried both TextOut and DrawText (in case there was a difference) but to no avail – can you offer some help.
    Thanks.

    example code:
    var
    nLoop, nX: Integer;
    oRect: TRect;
    nCW: Integer;
    sStr: String;
    begin
    with Image1.Canvas do begin
    Font.Name:='Courier New';
    Font.Size:=14;
    // Font.Charset:=ANSI_CHARSET; // windows
    Font.Charset:=OEM_CHARSET; // DOS/IBM-PC (OEM)
    nCW:=TextWidth('W');
    oRect:=Rect(0, 0, Image1.Width, Image1.Height);
    for nLoop:=1 to 31 do begin
    sStr:=Char(nLoop);
    DrawText(Handle, PChar(sStr), 1, oRect, DT_SINGLELINE or DT_NOPREFIX);
    TextOut(oRect.Left, 50, sStr);
    inc(oRect.Left, nCW);
    end;
    end;

    1. mael Post author

      Font.Charset is ignored when you use the Unicode WinAPI functions. If you have Delphi 2009 or newer TextOut and DrawText map to TextOutW and DrawTextW.

      You could use TextOutA and DrawTextA instead. But I suggest you rather work with Unicode strings as much as possible to avoid unexpected conversion errors (because some parts of your program or WinAPIs assume codepage x while others assume codepage y).

      Using Unicode, when a font is missing some of the glyphs you want to display, Windows will automatically search for a font that contains them. The result will be a mix of glyphs from different fonts, but only where necessary. This is also what browsers do.

      So I suggest to use MultiByteToWideChar with the option MB_USEGLYPHCHARS, and then use TextOutW, ExtTextOutW, DrawTextW or any other Unicode Win-API to print the text.

      You can also use a wrapper class defined in SysUtils that makes conversion a little easier by avoiding P*Chars:
      encoding := TMBCSEncoding.Create(CodePage, MBToWCharFlags, WCharToMBFlags: Integer); where MBToWCharFlags are the flags as mentioned in the link above.
      Then, you can convert simply by calling unicodestring := encoding.GetString(bytes);

      1. AzzaAzza69

        Thank you ever so much. With your pointers, I have successfully managed to output the ASCII characters.
        Here’s the example modified (in case it helps someone else) for Delphi 2007:

        var
        nLoop: Integer;
        oRect: TRect;
        sStr: String;
        nSizeW: Integer;
        sStrW: WideString;
        begin
        with Image1.Canvas do begin
        Font.Name:='Courier New';
        Font.Size:=14;
        oRect:=Rect(0, 0, Image1.Width, Image1.Height);
        sStr:='';
        for nLoop:=1 to 31 do
        sStr:=sStr+Chr(nLoop);

        nSizeW:=MultiByteToWideChar(CP_OEMCP, MB_USEGLYPHCHARS, PChar(sStr), Length(sStr), nil, 0); // calc size of wide string
        SetLength(sStrW, nSizeW);
        MultiByteToWideChar(CP_OEMCP, MB_USEGLYPHCHARS, PChar(sStr), Length(sStr), PWideChar(sStrW), nSizeW); // get unicode representation of ascii char

        DrawTextW(Handle, PWideChar(sStrW), -1, oRect, DT_SINGLELINE or DT_NOPREFIX); // don't translate CR/LF/&
        end;
        end;

  3. Rob

    So when can we expect a new version of HxD? Looking forward eagerly to an update of one of my favorite programs!

  4. Ficus strangulensis

    Hello, I noticed you mentioned art might be in your blog. I use y’r HxD and one of my hobbies is Mail Art [although I can’t draw]. Intro/background for mail art at IUOMA online. It’s like penpals with art trading.

    ANYway…

    I would be glad to send you an example of mail art if you’ll share with me a snail address. [Mine is posted on IUOMA in the name of my artistic _Nom De Plume_ Ficus strangulensis.

    Y’r [new] ol’ Bud,

    Fike

Leave a Reply

Your email address will not be published. Required fields are marked *