Code Unit

Encoding & Standards

The minimum bit combination used for encoding a character: 8-bit for UTF-8, 16-bit for UTF-16, and 32-bit for UTF-32.

A code unit is the fundamental building block of a Unicode encoding form. It's important to distinguish code units from code points — a single code point may require multiple code units depending on the encoding.

In UTF-8, a code unit is 8 bits (1 byte). The emoji 😀 requires 4 code units. In UTF-16, a code unit is 16 bits (2 bytes). The same emoji requires 2 code units (a surrogate pair). In UTF-32, it's 1 code unit (4 bytes).

Many programming language string APIs operate on code units rather than code points, which is why string length calculations can be confusing with emoji.

Related Terms

Surrogate Pair Surrogate Pair
Two UTF-16 code units (a high surrogate U+D800-U+DBFF followed by a low surrogate U+DC00-U+DFFF) that together represent a character above U+FFFF.
UTF-16 UTF-16
A variable-width Unicode encoding that uses 2 or 4 bytes per character, used internally by JavaScript, Java, and Windows.
UTF-32 UTF-32
A fixed-width Unicode encoding that uses exactly 4 bytes per character, providing direct code point mapping at the cost of space.
UTF-8 UTF-8
A variable-width Unicode encoding that uses 1 to 4 bytes per character, dominant on the web (used by 98%+ of websites).

Related Tools

🔢 Unicode Lookup Unicode Lookup
Enter a codepoint like U+1F600 and get the emoji, encoding details, UTF-8/16 bytes, and HTML entities.