UTF-32

Encoding & Standards

A fixed-width Unicode encoding that uses exactly 4 bytes per character, providing direct code point mapping at the cost of space.

UTF-32 is the simplest Unicode encoding: every character uses exactly 4 bytes, and the value directly corresponds to the code point. This makes random access and character counting trivial.

However, UTF-32 uses 4x the memory of ASCII text and 2x that of UTF-16 for most common characters. It's rarely used for storage or transmission but can be convenient for internal string processing.

Python 3's internal string representation uses a variable-width encoding (Latin-1, UCS-2, or UCS-4) depending on the highest code point in the string, which is why `len('😀')` correctly returns 1.

Related Terms

BOM (BOM) BOM (BOM)
The Byte Order Mark (U+FEFF) placed at the start of a text file to indicate byte order (endianness) in UTF-16/UTF-32 encodings.
Code Unit Code Unit
The minimum bit combination used for encoding a character: 8-bit for UTF-8, 16-bit for UTF-16, and 32-bit for UTF-32.
UTF-16 UTF-16
A variable-width Unicode encoding that uses 2 or 4 bytes per character, used internally by JavaScript, Java, and Windows.
UTF-8 UTF-8
A variable-width Unicode encoding that uses 1 to 4 bytes per character, dominant on the web (used by 98%+ of websites).

Related Tools

🔢 Unicode Lookup Unicode Lookup
Enter a codepoint like U+1F600 and get the emoji, encoding details, UTF-8/16 bytes, and HTML entities.