Emoji Encoding Guide: UTF-8, UTF-16 & Surrogate Pairs

Why EmojiEmoji
A Japanese word (็ตตๆ–‡ๅญ—) meaning 'picture character' โ€” small graphical symbols used in digital communication to express ideas, emotions, and objects.
Break Your String.length

If you've ever been surprised that '๐Ÿ˜€'.length === 2 in JavaScript, you've encountered one of the most common emoji encoding pitfalls. This guide explains why it happens and how to handle emoji correctly in code.

The Three UnicodeUnicode
Universal character encoding standard that assigns a unique number to every character across all writing systems and symbol sets, including emoji.
Encodings

Unicode defines three encoding forms. Each uses a different strategy to convert code points into bytes:

UTF-8UTF-8
A variable-width Unicode encoding that uses 1 to 4 bytes per character, dominant on the web (used by 98%+ of websites).
: The Web Standard

UTF-8 uses 1-4 bytes per character and is backward-compatible with ASCII:

Code PointCode Point
A unique numerical value assigned to each character in the Unicode standard, written in the format U+XXXX (e.g., U+1F600 for ๐Ÿ˜€).
Range
Bytes Example
U+0000 - U+007F 1 byte A โ†’ 0x41
U+0080 - U+07FF 2 bytes รฉ โ†’ 0xC3 0xA9
U+0800 - U+FFFF 3 bytes ํ•œ โ†’ 0xED 0x95 0x9C
U+10000 - U+10FFFF 4 bytes ๐Ÿ˜€ โ†’ 0xF0 0x9F 0x98 0x80

Every emoji requires 4 bytes in UTF-8 because emoji code points are above U+FFFF. A ZWJZero Width Joiner (ZWJ)
An invisible Unicode character (U+200D) used to join multiple emoji into a single composite emoji, such as combining people and objects into profession emoji.
sequence like ๐Ÿ‘ฉโ€๐Ÿ’ป uses 11 bytes (4 + 3 + 4 for person + ZWJ + laptop).

UTF-16UTF-16
A variable-width Unicode encoding that uses 2 or 4 bytes per character, used internally by JavaScript, Java, and Windows.
: JavaScript and Java's Native Encoding

UTF-16 uses 2 or 4 bytes per character:

Code Point Range Code Units Example
U+0000 - U+FFFF (BMP) 1 unit (2 bytes) A โ†’ 0x0041
U+10000 - U+10FFFF (SMPSupplementary Multilingual Plane (SMP)
Unicode Plane 1 (U+10000 to U+1FFFF), where the majority of emoji code points are allocated.
)
2 units (4 bytes) ๐Ÿ˜€ โ†’ 0xD83D 0xDE00

Characters above U+FFFF โ€” including virtually all emoji โ€” require a surrogate pairSurrogate Pair
Two UTF-16 code units (a high surrogate U+D800-U+DBFF followed by a low surrogate U+DC00-U+DFFF) that together represent a character above U+FFFF.
: two 16-bit code units that encode one code point.

UTF-32UTF-32
A fixed-width Unicode encoding that uses exactly 4 bytes per character, providing direct code point mapping at the cost of space.
: Simple but Wasteful

UTF-32 uses exactly 4 bytes per code point. Simple for processing (string[i] always gives you one code point), but uses 4x the memory of ASCII text.

Surrogate Pairs Explained

Surrogate pairs are the key to understanding JavaScript emoji behavior. Here's the math:

Code Point: U+1F600 (๐Ÿ˜€)
Offset: 0x1F600 - 0x10000 = 0xF600

High Surrogate: 0xD800 + (0xF600 >> 10) = 0xD800 + 0x3D = 0xD83D
Low Surrogate:  0xDC00 + (0xF600 & 0x3FF) = 0xDC00 + 0x200 = 0xDE00

Result: 0xD83D 0xDE00

This is why '๐Ÿ˜€'.charCodeAt(0) returns 55357 (0xD83D) โ€” the high surrogate โ€” and '๐Ÿ˜€'.charCodeAt(1) returns 56832 (0xDE00) โ€” the low surrogate.

Common Pitfalls

1. String Length

// WRONG: counts UTF-16 code units
'๐Ÿ˜€'.length              // 2
'๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘ง'.length           // 8

// CORRECT: counts grapheme clusters
[...new Intl.Segmenter().segment('๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘ง')].length  // 1

2. String Slicing

// WRONG: splits surrogate pair
'Hello ๐Ÿ˜€'.slice(0, 7)  // 'Hello \uD83D' (broken!)

// CORRECT: use spread or Array.from
[...'Hello ๐Ÿ˜€'].slice(0, 7).join('')  // 'Hello ๐Ÿ˜€'

3. Regular Expressions

// WRONG: . doesn't match emoji by default
/^.$/.test('๐Ÿ˜€')  // false

// CORRECT: use u flag for Unicode awareness
/^.$/u.test('๐Ÿ˜€')  // true

4. Database Storage

When using MySQL, ensure your column uses utf8mb4 (not utf8 which only supports 3-byte characters). PostgreSQL handles this correctly by default with its TEXT type.

-- MySQL: must use utf8mb4 for emoji
ALTER TABLE posts MODIFY content TEXT CHARACTER SET utf8mb4;

Language-Specific Tips

Python 3

Python 3 handles emoji gracefully โ€” len('๐Ÿ˜€') returns 1 because Python uses code points internally.

emoji = '๐Ÿ‘ฉโ€๐Ÿ’ป'
len(emoji)           # 3 (code points: woman + ZWJ + laptop)
emoji.encode('utf-8') # b'\xf0\x9f\x91\xa9\xe2\x80\x8d\xf0\x9f\x92\xbb'

Java

Java strings are UTF-16, like JavaScript:

"๐Ÿ˜€".length()          // 2 (surrogate pair)
"๐Ÿ˜€".codePointCount(0, "๐Ÿ˜€".length())  // 1 (actual code point count)

Rust

Rust strings are UTF-8 by default and distinguish between bytes, chars, and graphemes:

"๐Ÿ˜€".len()            // 4 (bytes)
"๐Ÿ˜€".chars().count()  // 1 (code points)

Analyze Any Emoji's Encoding

Use our Sequence Analyzer to see the complete encoding breakdown of any emoji โ€” UTF-8 bytes, UTF-16 surrogates, code points, and component roles.

Related Tools

๐Ÿ” Sequence Analyzer Sequence Analyzer
Decode ZWJ sequences, skin tone modifiers, keycap sequences, and flag pairs into individual components.

Glossary Terms

Code Point Code Point
A unique numerical value assigned to each character in the Unicode standard, written in the format U+XXXX (e.g., U+1F600 for ๐Ÿ˜€).
Emoji Emoji
A Japanese word (็ตตๆ–‡ๅญ—) meaning 'picture character' โ€” small graphical symbols used in digital communication to express ideas, emotions, and objects.
Supplementary Multilingual Plane (SMP) Supplementary Multilingual Plane (SMP)
Unicode Plane 1 (U+10000 to U+1FFFF), where the majority of emoji code points are allocated.
Surrogate Pair Surrogate Pair
Two UTF-16 code units (a high surrogate U+D800-U+DBFF followed by a low surrogate U+DC00-U+DFFF) that together represent a character above U+FFFF.
UTF-16 UTF-16
A variable-width Unicode encoding that uses 2 or 4 bytes per character, used internally by JavaScript, Java, and Windows.
UTF-32 UTF-32
A fixed-width Unicode encoding that uses exactly 4 bytes per character, providing direct code point mapping at the cost of space.
UTF-8 UTF-8
A variable-width Unicode encoding that uses 1 to 4 bytes per character, dominant on the web (used by 98%+ of websites).
Unicode Unicode
Universal character encoding standard that assigns a unique number to every character across all writing systems and symbol sets, including emoji.
Zero Width Joiner (ZWJ) Zero Width Joiner (ZWJ)
An invisible Unicode character (U+200D) used to join multiple emoji into a single composite emoji, such as combining people and objects into profession emoji.

Related Stories