🛠️ Technical & Developer ก.พ. 22, 2026

Emoji Encoding Guide: UTF-8, UTF-16 & Surrogate Pairs

Why Emoji Break Your String.length

If you've ever been surprised that '😀'.length === 2 in JavaScript, you've encountered one of the most common emoji encoding pitfalls. This guide explains why it happens and how to handle emoji correctly in code.

The Three Unicode Encodings

Unicode defines three encoding forms. Each uses a different strategy to convert code points into bytes:

UTF-8UTF-8
การเข้ารหัส Unicode แบบความกว้างผันแปร ใช้ 1 ถึง 4 ไบต์ต่ออักขระ เป็นมาตรฐานหลักบนเว็บ (ใช้โดยเว็บไซต์กว่า 98%): The Web Standard

UTF-8 uses 1-4 bytes per character and is backward-compatible with ASCII:

Code Point Range	Bytes	Example
U+0000 - U+007F	1 byte	A → `0x41`
U+0080 - U+07FF	2 bytes	é → `0xC3 0xA9`
U+0800 - U+FFFF	3 bytes	한 → `0xED 0x95 0x9C`
U+10000 - U+10FFFF	4 bytes	😀 → `0xF0 0x9F 0x98 0x80`

Every emoji requires 4 bytes in UTF-8 because emoji code points are above U+FFFF. A ZWJZero Width Joiner (ZWJ)
อักขระ Unicode ที่มองไม่เห็น (U+200D) ใช้เพื่อเชื่อมอิโมจิหลายตัวเข้าเป็นอิโมจิรวม เช่น การรวมคนและวัตถุเป็นอิโมจิอาชีพ sequence like 👩‍💻 uses 11 bytes (4 + 3 + 4 for person + ZWJ + laptop).

UTF-16UTF-16
การเข้ารหัส Unicode แบบความกว้างผันแปร ใช้ 2 หรือ 4 ไบต์ต่ออักขระ ใช้ภายในโดย JavaScript, Java และ Windows: JavaScript and Java's Native Encoding

UTF-16 uses 2 or 4 bytes per character:

Code Point Range	Code Units	Example
U+0000 - U+FFFF (BMP)	1 unit (2 bytes)	A → `0x0041`
U+10000 - U+10FFFF (SMPSupplementary Multilingual Plane (SMP) Unicode เพลน 1 (U+10000 ถึง U+1FFFF) ซึ่งเป็นที่จัดสรรโค้ดพอยท์ของอิโมจิส่วนใหญ่)	2 units (4 bytes)	😀 → `0xD83D 0xDE00`

Characters above U+FFFF — including virtually all emoji — require a surrogate pairSurrogate Pair
หน่วยโค้ด UTF-16 สองตัว (surrogate สูง U+D800-U+DBFF ตามด้วย surrogate ต่ำ U+DC00-U+DFFF) ที่แสดงอักขระที่อยู่เหนือ U+FFFF ร่วมกัน: two 16-bit code units that encode one code point.

UTF-32UTF-32
การเข้ารหัส Unicode แบบความกว้างคงที่ ใช้ 4 ไบต์ต่ออักขระพอดี ทำให้แมปโค้ดพอยท์ได้โดยตรงแต่ใช้พื้นที่มากกว่า: Simple but Wasteful

UTF-32 uses exactly 4 bytes per code point. Simple for processing (string[i] always gives you one code point), but uses 4x the memory of ASCII text.

Surrogate Pairs Explained

Surrogate pairs are the key to understanding JavaScript emoji behavior. Here's the math:

Code Point: U+1F600 (😀)
Offset: 0x1F600 - 0x10000 = 0xF600

High Surrogate: 0xD800 + (0xF600 >> 10) = 0xD800 + 0x3D = 0xD83D
Low Surrogate:  0xDC00 + (0xF600 & 0x3FF) = 0xDC00 + 0x200 = 0xDE00

Result: 0xD83D 0xDE00

This is why '😀'.charCodeAt(0) returns 55357 (0xD83D) — the high surrogate — and '😀'.charCodeAt(1) returns 56832 (0xDE00) — the low surrogate.

Common Pitfalls

1. String Length

// WRONG: counts UTF-16 code units
'😀'.length              // 2
'👨‍👩‍👧'.length           // 8

// CORRECT: counts grapheme clusters
[...new Intl.Segmenter().segment('👨‍👩‍👧')].length  // 1

2. String Slicing

// WRONG: splits surrogate pair
'Hello 😀'.slice(0, 7)  // 'Hello \uD83D' (broken!)

// CORRECT: use spread or Array.from
[...'Hello 😀'].slice(0, 7).join('')  // 'Hello 😀'

3. Regular Expressions

// WRONG: . doesn't match emoji by default
/^.$/.test('😀')  // false

// CORRECT: use u flag for Unicode awareness
/^.$/u.test('😀')  // true

4. Database Storage

When using MySQL, ensure your column uses utf8mb4 (not utf8 which only supports 3-byte characters). PostgreSQL handles this correctly by default with its TEXT type.

-- MySQL: must use utf8mb4 for emoji
ALTER TABLE posts MODIFY content TEXT CHARACTER SET utf8mb4;

Language-Specific Tips

Python 3

Python 3 handles emoji gracefully — len('😀') returns 1 because Python uses code points internally.

emoji = '👩‍💻'
len(emoji)           # 3 (code points: woman + ZWJ + laptop)
emoji.encode('utf-8') # b'\xf0\x9f\x91\xa9\xe2\x80\x8d\xf0\x9f\x92\xbb'

Java

Java strings are UTF-16, like JavaScript:

"😀".length()          // 2 (surrogate pair)
"😀".codePointCount(0, "😀".length())  // 1 (actual code point count)

Rust

Rust strings are UTF-8 by default and distinguish between bytes, chars, and graphemes:

"😀".len()            // 4 (bytes)
"😀".chars().count()  // 1 (code points)

Analyze Any Emoji's Encoding

Use our Sequence Analyzer to see the complete encoding breakdown of any emoji — UTF-8 bytes, UTF-16 surrogates, code points, and component roles.

เครื่องมือที่เกี่ยวข้อง

คำในอภิธานศัพท์

หมวดหมู่ Emoji ที่เกี่ยวข้อง

😀 สไมลีย์และอารมณ์ 👋 ผู้คนและร่างกาย 💡 สิ่งของ

Emoji ที่เกี่ยวข้อง

😀 👧 👨 👩 👩‍💻 👨‍👩‍👧 👩‍👧 💻

Emoji Encoding Guide: UTF-8, UTF-16 & Surrogate Pairs

Embed This Widget

Why Emoji Break Your String.length

The Three Unicode Encodings

UTF-16UTF-16
การเข้ารหัส Unicode แบบความกว้างผันแปร ใช้ 2 หรือ 4 ไบต์ต่ออักขระ ใช้ภายในโดย JavaScript, Java และ Windows: JavaScript and Java's Native Encoding

Surrogate Pairs Explained

Common Pitfalls

1. String Length

2. String Slicing

3. Regular Expressions

4. Database Storage

Language-Specific Tips

Python 3

Java

Rust

Analyze Any Emoji's Encoding

เครื่องมือที่เกี่ยวข้อง

คำในอภิธานศัพท์

หมวดหมู่ Emoji ที่เกี่ยวข้อง

Emoji ที่เกี่ยวข้อง

บทความที่เกี่ยวข้อง

What Are ZWJ Sequences? How Emoji Combine

Unicode Normalization Forms: NFC, NFD, NFKC, NFKD Explained

Unicode Emoji Properties: Extended_Pictographic, Emoji_Presentation, and More

Text vs Emoji Presentation Selectors: VS15 (U+FE0E) and VS16 (U+FE0F)

Why Emoji Break Your String.length

The Three Unicode Encodings

UTF-16UTF-16การเข้ารหัส Unicode แบบความกว้างผันแปร ใช้ 2 หรือ 4 ไบต์ต่ออักขระ ใช้ภายในโดย JavaScript, Java และ Windows: JavaScript and Java's Native Encoding

Surrogate Pairs Explained

Common Pitfalls

1. String Length

2. String Slicing

3. Regular Expressions

4. Database Storage

Language-Specific Tips

Python 3

Java

Rust

Analyze Any Emoji's Encoding

เครื่องมือที่เกี่ยวข้อง

คำในอภิธานศัพท์

หมวดหมู่ Emoji ที่เกี่ยวข้อง

Emoji ที่เกี่ยวข้อง

บทความที่เกี่ยวข้อง

What Are ZWJ Sequences? How Emoji Combine

Unicode Normalization Forms: NFC, NFD, NFKC, NFKD Explained

Unicode Emoji Properties: Extended_Pictographic, Emoji_Presentation, and More

Text vs Emoji Presentation Selectors: VS15 (U+FE0E) and VS16 (U+FE0F)

UTF-16UTF-16
การเข้ารหัส Unicode แบบความกว้างผันแปร ใช้ 2 หรือ 4 ไบต์ต่ออักขระ ใช้ภายในโดย JavaScript, Java และ Windows: JavaScript and Java's Native Encoding