Code Unit

Encoding & Standards

Tổ hợp bit tối thiểu dùng để mã hóa một ký tự: 8 bit cho UTF-8, 16 bit cho UTF-16 và 32 bit cho UTF-32.

A code unit is the fundamental building block of a Unicode encoding form. It's important to distinguish code units from code points — a single code point may require multiple code units depending on the encoding.

In UTF-8, a code unit is 8 bits (1 byte). The emoji 😀 requires 4 code units. In UTF-16, a code unit is 16 bits (2 bytes). The same emoji requires 2 code units (a surrogate pair). In UTF-32, it's 1 code unit (4 bytes).

Many programming language string APIs operate on code units rather than code points, which is why string length calculations can be confusing with emoji.

Thuật ngữ liên quan

Công cụ liên quan

Bài viết liên quan

Unicode Lookup: Find Any Emoji by Code Point

Enter a Unicode code point like U+1F600 or paste an emoji to see its full encoding breakdown — UTF-8, UTF-16, HTML entities, and more.

How to Handle Emojis in JavaScript: Strings, Length, and Rendering

Learn how to work with emojis in JavaScript: string length pitfalls, Unicode code points, regex, grapheme segmentation, and React rendering tips.

Why string.length Lies: Grapheme Clusters and Emoji Length

Why string.length fails for emoji, what grapheme clusters are, and how to correctly count characters in JavaScript, Python, and more.

Emoji String Length Gotchas: Surrogate Pairs, Grapheme Clusters, and Byte Counts

Why emoji cause surprising string length results in Python, JavaScript, and Go — surrogate pairs, grapheme clusters, and how to count correctly.

Emoji Security: Homoglyphs, Spoofing, Invisible Characters, and Filtering

Security risks from emoji in user input: homoglyph spoofing, invisible Unicode characters, emoji in SQL/code injection, and how to filter and sanitize safely.

Emoji Regex Patterns: Matching Emojis in JavaScript and Python

Learn how to write reliable regex patterns to match emojis in JavaScript and Python, including ZWJ sequences, skin tones, and flags.