HxD will extend character encoding support, and I am looking for the best way to name character encodings. So far, you can only pick between the following four to affect the text display in the editor window:
- Windows (ANSI)
- DOS/IBM-PC (OEM)
Additionally, in the Search window, Unicode (UCS-2LE) can be selected using a checkbox to override the current editor window encoding. I’d like the character encoding selection to be more uniform, flexible, and clear in future.
Clearing up historical terms and names
There has been a long history of confusion regarding character encodings and related terms. I will just briefly cover some important points, the Wikipedia link before and W3C’s Character encodings: Essential concepts explain this in more detail.
Character sets vs. character encodings
Characters have two meanings, the abstract concept of characters and their concrete representation.
Humans represent characters commonly using hand writing, print or speech, while computers use numbers (actually electricity) to represent characters. Encoding is just another name for representation. All the aforementioned encodings are different, however they still all refer to the abstract concept of characters. For example, if you say A, write A, or encode A as number 65 (A is 65 in US-ASCII encoding) you always refer to the abstract character A, no matter if through sound, drawing, electricity, or any other encoding.
A character set is a set of characters in the abstract concept sense: the elements are not ordered or represented in a fixed way, on the other hand set elements in computers can only be stored in an order and encoded in a concrete and specific way.
Computers do not store abstract character sets, but concrete character encodings.
Misnamed character encodings
Another confusion is due to misnamed character encodings:
Originally, Windows code page 1252, the code page commonly used for English and other Western European languages, was based on an American National Standards Institute (ANSI) draft. That draft eventually became ISO 8859-1, but Windows code page 1252 was implemented before the standard became final, and is not exactly the same as ISO 8859-1.
Windows code page 1252 and related code pages (code page is another term used in Windows to mean character encoding) are based on ISO/IEC 8859 standard. Due to their history, there are various names around to refer to the same encodings: ANSI, Windows-<Code page> such as Windows-1252, or simply cp-<code page> such as cp-1252. Even more confusingly Windows calls Windows-1252 “ANSI – Latin I” in user interfaces. All of the names mean the same, but the ANSI and Latin I names are especially misleading, suggesting a conformance to an ANSI standard or the ISO 8859-1 encoding (which is also called Latin-1).
IANA has a pretty complete overview of character encodings that clears up some of this confusion, and lists alias names but also defines standard names that are not ambiguous.
Character encoding categories
As hinted above, some character encoding names do not refer to a fixed encoding but to an encoding category.
For example, “ANSI” under Windows usually stands for the standard system code page. The actual code page/encoding varies depending on the Windows language and regional settings (which can be set in the control panel). On a Western Windows “ANSI” will mean “Windows-1252”, while on a Russian Windows “ANSI” stands for “Windows-1251”.
It may add to the confusion that there are alternative names for “ANSI” such as “Windows” or “CP_ACP”.
Similar to the ANSI default encoding, Windows also defines default encodings for the categories of DOS-IBM/PC and Macintosh encodings.
The default “DOS/IBM-PC” encoding is also called “OEM” or “CP_OEM”. And the default “Macintosh” encoding has the alias “CP_MAC”.
Terms and names in HxD
Following the reasoning so far, character encoding is the proper term and will also be adopted in HxD instead of character set.
Many programs, including HxD in the past (see the introduction), do not distinguish sufficiently between character encodings that have a fixed meaning and those that represent a temporary selection from a character encoding category.
The following labeling should clear this up:
- System variables
- System variables
When hovering over any item of this list in HxD, a hint will show the actual encoding currently set in Windows for this encoding variable. So, on an unmodified Western Windows hovering over “Windows/”ANSI”” will show this hint: “Current encoding: ANSI – Latin1 (Windows-1252)”.
The EBCDIC encoding in HxD version 18.104.22.168 is not a generic name, just a short hand, and always stands for encoding “EBCDIC 500” (called “IBM EBCDIC – International” by Microsoft). It will be renamed accordingly in the coming version.
Another consideration is finding encoding names for the fixed meaning encodings. They should be user friendly as far as possible and would need to be translated, which might prove difficult for translators — What technical parts should remain unchanged and which should be translated? — but also tedious.
An overview of encoding naming conventions in user interfaces (UIs) would help, but there seems to be none. Based on the sampling I have done, I found several strategies for creating lists of encodings for UIs:
- Raw list of encodings
- List of encodings returned by Windows
- Subset of the list of encodings returned by Windows
- Application provided (library like libiconv or custom code)
- Order encodings by language and/or region
- Order by technical encoding categories (like origin system or inventor)
- Naming (in the user interface)
- Technical names
- MIME or IANA names
- Code page names (Windows)
- Simplified technical names
- User friendly names (language or region name disregarding technical names)
- A mix of the above
- For example: names provided by Windows (intended for the user interface)
- Technical names
The degree of technicality of names presented in the user interface varies. Firefox for example almost removes any technical information and keeps only the language name/region and a generic reference to a standard (“ISO” or “Windows”). This is not precise enough for users of a hex editor that want to pick an encoding with certainty. On the other hand I want to use encoding names that are common and popular and help people remember something they saw in the past without having rigorously studied encodings. This means it should include common alias names.
Windows does a pretty good job of naming encodings that way when they are more exotic, but for common ones it just removes this information. For example, “ANSI – Cyrillic” misses any clear indication that it is Windows specific. Given all the ambiguity there has been regarding encodings, I prefer to add to the encoding name provided by Windows the technical name as defined by IANA: “ANSI – Cyrillic (Windows-1251)”.
Sometimes encoding names returned by Windows already include some code page references, e.g., “IBM EBCDIC – Germany (20273 + Euro)”. If the reference to the Windows specific code page 20273 was replaced by a reference to the proper EBCDIC encoding, it would be more clear: “IBM EBCDIC – Germany (EBCDIC 237 + Euro)”. As the exact wording may change with translations or Windows versions, I cannot modify the strings safely to match the desired output.
Except for the referencing issue, this should give names that are recognizable to the user (if they saw it in Windows in the past) but also the technical user because of the additional information added after the Windows provided name.
On top of that encodings will be ordered by language/region and technical encoding categories: Windows, DOS/IBM-PC, and Macintosh (so the categories mentioned above) but also possibly more such categories, like ISO 8859, EBCDIC, Unicode, etc.. That way, both types of interests can be met: finding a possible encoding for a known language or selecting the exact technical encoding.