Charset vs. encoding

I have always been discouraged by the fact that the words "charset" and "encoding" seem to be used interchangeably—if they are the same term, why using both words?

  • The HTML specification uses both words interchangeably: "The charset attribute specifies the character encoding used by the document."
  • The XML specification seems to do a similar thing, only it specifies a declaration named encoding instead of charset, but recommends the names of IANA charsets to be specified as values: "It is recommended that character encodings registered (as charsets) with the Internet Assigned Numbers Authority, other than those just listed, be referred to using their registered names".

My patience burst, and I decided to dive into the matter. This post is essentially a version of the Unicode Character Encoding Model shortened and creatively retold by me, and I invite you to read the original instead, if you are so inclined. I also recommend reading the article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky.

Contents
  1. Abstract character repertoire (ACR)
  2. Character map (CM)
    1. Coded character set (CCS)
    2. Character encoding form (CEF)
    3. Character encoding scheme (CES)
  3. Examples
    1. Coded character set
    2. Character map

Abstract character repertoire (ACR)

An unordered set of abstract characters is called an abstract character repertoire (ACR). Abstract characters are often referred to as just characters.

The word "abstract" emphasizes that these objects are defined by convention. For example, the capital letter "A" in the Latin alphabet is an abstract character named LATIN CAPITAL LETTER A in the Unicode standard. Regardless of the glyph we use to represent this character, e.g., glyph examples, we mean the same abstract character.

Character map (CM)

So a charset is not actually a set of characters, as one might have anticipated based on the word choice.

CES in the above definition may be compound, which means there may be multiple CEF/CCS for a given CM, which is also then called compound. This definition is to an extent similar to the definition given by the RFC 2978, though it does not seem like they are identical, and the definition in the RFC makes much less sense to me than the one in the Unicode standard.

Coded character set (CCS)

Coded character set (CCS), a.k.a. code page — a mapping from an ACR to the set of non-negative integers, which are called code points. If a CCS assigns a code point to an abstract character, then such a character is called an encoded character.

Character encoding form (CEF)

Character encoding form (CEF) — a mapping from code points used in a CCS to the set of sequences of code units. While a code unit is an integer with a bit width fixed for a given CEF, the sequences of code units representing code points do not necessarily have the same length.

This concept arises from the way numbers are represented in computers—as sequences of bytes; thus a CES enables character representation as actual data in a computer. For example, the UTF-8 CEF is a variable-width encoding form that represents code points as a mix of one to four 8-bit code units in the Unicode standard.

Character encoding scheme (CES)

Character encoding scheme (CES) — a reversible transformation of sequences of code units to sequences of bytes.

Applying CES is the last step in the process of representing an abstract character as binary data in a computer. It may introduce compression or care about byte order (like UTF-16 CES does, hence the little endian (LE) / big endian (BE) byte order marks (BOM)).

Examples

Coded character set

ISO/IEC 10646 defines a CCS called Universal Coded Character Set (UCS). The Unicode standard uses this CCS. UCS includes many interesting characters, e.g., ⑧ 🦠 ∬, but not everything you might want, for example, it does not include Apple logo. The complete CCS used by the Unicode standard is available at https://www.unicode.org/charts/.

Unicode code points are written in the format U+HHHH or U+HHHHHH, where H is a hexadecimal digit, and have values from U+0000 (0) to U+10FFFF (1_114_111). Note that some values do not have assigned characters and are reserved in the Unicode standard for internal use. Such code points are called noncharacters.

Character map

We often refer to something called "UTF-8" as "encoding", but Java SE API specification refers to it as Charset. So what is it exactly? According to ISO/IEC 10646, or the Unicode standard (they are kept synchronized), there is UTF-8 CEF and UTF-8 CES. RFC 3629 defines UTF-8 charset that is registered as an IANA character set. So we may say that

UTF-8 charset = UCS CCS + UTF-8 CEF + UTF-8 CES.