Why string.length Lies: Grapheme Clusters and Emoji Length

The Lie Your Runtime Tells You

'👨‍👩‍👧‍👦'.length  // 11

That is not 11 characters. It is one family emojiEmoji
Từ tiếng Nhật (絵文字) có nghĩa là 'ký tự hình ảnh' — các ký hiệu đồ họa nhỏ dùng trong giao tiếp kỹ thuật số để diễn đạt ý tưởng, cảm xúc và sự vật.. The .length property in JavaScript counts UTF-16UTF-16
Kiểu mã hóa Unicode có chiều rộng thay đổi, dùng 2 hoặc 4 byte cho mỗi ký tự, được JavaScript, Java và Windows dùng nội bộ. code units, not visible characters. For most ASCII text this distinction is invisible, but emoji expose the gap brutally.

Understanding why this happens — and how to fix it — requires knowing what a grapheme cluster is.

Three Levels of "Character"

UnicodeUnicode
Tiêu chuẩn mã hóa ký tự phổ quát gán một số duy nhất cho mỗi ký tự trong tất cả hệ thống chữ viết và bộ ký hiệu, bao gồm cả emoji. defines multiple levels of text unit, each useful for different purposes:

Level	Unit	JavaScript	Python
Byte	Raw storage unit	—	`len(s.encode('utf-8UTF-8 Kiểu mã hóa Unicode có chiều rộng thay đổi, dùng từ 1 đến 4 byte cho mỗi ký tự, thống trị trên web (98%+ website sử dụng).'))`
Code unitCode Unit Tổ hợp bit tối thiểu dùng để mã hóa một ký tự: 8 bit cho UTF-8, 16 bit cho UTF-16 và 32 bit cho UTF-32.	Encoding unit	`s.length`	—
Code point	Unicode scalar value	`[...s].length`	`len(s)`
Grapheme cluster	User-perceived character	`Intl.Segmenter`	`grapheme` library

A grapheme cluster is what a human thinks of as "one character" — one visible glyph on screen. For emoji, this is always at least one code point, but often several combined.

What Makes an Emoji "Long"?

Basic Emoji: 1 Code Point, 2 Code Units

😀 = U+1F600
UTF-16: 0xD83D 0xDE00 (surrogate pairSurrogate Pair
Hai đơn vị mã UTF-16 (một surrogate cao U+D800-U+DBFF theo sau là một surrogate thấp U+DC00-U+DFFF) cùng nhau đại diện cho một ký tự trên U+FFFF.)
.length in JS: 2
Python len(): 1

Emoji + Variation SelectorVariation Selector (VS)
Các ký tự Unicode (VS-15 U+FE0E và VS-16 U+FE0F) xác định xem một ký tự được hiển thị dưới dạng văn bản (đơn sắc) hay emoji (có màu).: 2 Code Points

❤️ = U+2764 + U+FE0F
.length in JS: 2
Python len(): 2
Grapheme clusters: 1

Emoji + Skin Tone: 2 Code Points

👍🏽 = U+1F44D + U+1F3FD
.length in JS: 4 (two surrogate pairs)
Python len(): 2
Grapheme clusters: 1

ZWJZero Width Joiner (ZWJ)
Ký tự Unicode vô hình (U+200D) dùng để ghép nhiều emoji thành một emoji tổng hợp, chẳng hạn kết hợp người và vật thể thành emoji nghề nghiệp. Sequence: 3+ Code Points

👩‍💻 = U+1F469 + U+200D + U+1F4BB
.length in JS: 5
Python len(): 3
Grapheme clusters: 1

Family Emoji: 7 Code Points

👨‍👩‍👧‍👦 = 👨 + ZWJ + 👩 + ZWJ + 👧 + ZWJ + 👦
.length in JS: 11
Python len(): 7
Grapheme clusters: 1

Correct Counting in JavaScript

Using `Intl.Segmenter` (Modern, Built-In)

Intl.Segmenter is available in all modern browsers and Node.js 16+. It segments text by grapheme clusters:

function countGraphemes(str) {
  const segmenter = new Intl.Segmenter();
  return [...segmenter.segment(str)].length;
}

countGraphemes('Hello')     // 5 — correct
countGraphemes('Hello 😀') // 7 — correct
countGraphemes('👨‍👩‍👧‍👦')   // 1 — correct!
countGraphemes('👩🏽‍💻')    // 1 — correct!

Slicing Text Correctly

Slicing with .slice() on emoji strings corrupts surrogate pairs:

const s = 'Hello 👋';

// WRONG: may cut through a surrogate pair
s.slice(0, 7)  // 'Hello \uD83D' — broken!

// CORRECT: spread to code points first
[...s].slice(0, 7).join('')  // 'Hello 👋' — correct

// EVEN BETTER: slice by grapheme clusters
function sliceByGrapheme(str, start, end) {
  const segmenter = new Intl.Segmenter();
  const segments = [...segmenter.segment(str)];
  return segments.slice(start, end).map(s => s.segment).join('');
}

sliceByGrapheme('Hello 👋 World', 0, 7)  // 'Hello 👋'

String Reversal

Reversing an emoji string with .split('').reverse().join('') produces garbage:

// WRONG: reverses code units, breaks surrogates
'Hello 😀'.split('').reverse().join('')  // '😀 olleH' (actually garbled)

// CORRECT: reverse by code points
[...'Hello 😀'].reverse().join('')  // '😀 olleH'

// BEST: reverse by grapheme clusters
function reverseGraphemes(str) {
  const segmenter = new Intl.Segmenter();
  return [...segmenter.segment(str)]
    .map(s => s.segment)
    .reverse()
    .join('');
}

reverseGraphemes('Hello 👨‍👩‍👧‍👦')  // '👨‍👩‍👧‍👦 olleH'

Correct Counting in Python

Python 3's len() counts code points, not grapheme clusters. For most ASCII text this is identical to grapheme count, but for emoji it is not.

text = '👨‍👩‍👧‍👦'
len(text)  # 7 — code points (wrong for "visible characters")

Using the `grapheme` Library

import grapheme

grapheme.length('👨‍👩‍👧‍👦')   # 1 — correct
grapheme.length('Hello 😀')  # 7 — correct

# Slice by grapheme clusters
grapheme.slice('Hello 👨‍👩‍👧‍👦', 0, 7)  # 'Hello 👨‍👩‍👧‍👦'

# Iterate graphemes
list(grapheme.graphemes('Hi 👋'))
# ['H', 'i', ' ', '👋']

Using the `regex` Module with `\X`

The \X pattern in the regex module matches grapheme clusters:

import regex

def count_graphemes(text):
    return len(regex.findall(r'\X', text))

count_graphemes('👨‍👩‍👧‍👦')   # 1
count_graphemes('Hello 👨‍👩‍👧‍👦')  # 7

Other Languages

Java

// Java uses UTF-16 internally like JavaScript
"👨‍👩‍👧‍👦".length()  // 11 (code units)

// Count code points
"👨‍👩‍👧‍👦".codePointCount(0, "👨‍👩‍👧‍👦".length())  // 7

// Count grapheme clusters (requires ICUICU (ICU)
International Components for Unicode — thư viện mã nguồn mở được sử dụng rộng rãi, cung cấp hỗ trợ Unicode và quốc tế hóa, bao gồm xử lý emoji.)
BreakIterator bi = BreakIterator.getCharacterInstance();
bi.setText("👨‍👩‍👧‍👦");
int count = 0;
while (bi.next() != BreakIterator.DONE) count++;
// count = 1

Swift

Swift's String is designed around grapheme clusters — count returns grapheme count by default:

"👨‍👩‍👧‍👦".count  // 1 — correct out of the box!

This is one area where Swift's string model is superior to most other languages.

Rust

// Rust len() returns bytes
"👨‍👩‍👧‍👦".len()  // 25 (bytes)

// chars() iterates code points
"👨‍👩‍👧‍👦".chars().count()  // 7

// For grapheme clusters, use the unicode-segmentation crate
use unicode_segmentation::UnicodeSegmentation;
"👨‍👩‍👧‍👦".graphemes(true).count()  // 1

Practical Implications

Input Validation / Character Limits

If you show users "max 280 characters" and count by .length in JavaScript, users can enter far fewer emoji than expected (or you silently truncate their input). Count by grapheme clusters for user-facing limits.

Database Column Sizing

Ensure column size limits account for multi-code-point emoji. A VARCHAR(10) that accepts emoji-heavy input needs to be measured in code points (PostgreSQL) or bytes (MySQL with utf8mb4), not graphemes.

Text Rendering

Layout engines that use pixel widths handle emoji correctly by nature — they measure bounding boxes, not code units. But manual text layout (canvas, PDF generation, terminal output) needs grapheme-aware iteration.

Use our Sequence Analyzer to see exactly how many code points, code units, and grapheme clusters any emoji string contains — making these abstract concepts concrete.

Why string.length Lies: Grapheme Clusters and Emoji Length

Embed This Widget

The Lie Your Runtime Tells You

Three Levels of "Character"

What Makes an Emoji "Long"?

Basic Emoji: 1 Code Point, 2 Code Units

Emoji + Variation SelectorVariation Selector (VS)
Các ký tự Unicode (VS-15 U+FE0E và VS-16 U+FE0F) xác định xem một ký tự được hiển thị dưới dạng văn bản (đơn sắc) hay emoji (có màu).: 2 Code Points

Emoji + Skin Tone: 2 Code Points

ZWJZero Width Joiner (ZWJ)
Ký tự Unicode vô hình (U+200D) dùng để ghép nhiều emoji thành một emoji tổng hợp, chẳng hạn kết hợp người và vật thể thành emoji nghề nghiệp. Sequence: 3+ Code Points

Family Emoji: 7 Code Points

Correct Counting in JavaScript

Using `Intl.Segmenter` (Modern, Built-In)

Slicing Text Correctly

String Reversal

Correct Counting in Python

Using the `grapheme` Library

Using the `regex` Module with `\X`

Other Languages

Java

Swift

Rust

Practical Implications

Input Validation / Character Limits

Database Column Sizing

Text Rendering

Công cụ liên quan

Thuật ngữ

Danh mục Emoji liên quan

Emoji liên quan

Bài viết liên quan

What Are ZWJ Sequences? How Emoji Combine

Unicode Normalization Forms: NFC, NFD, NFKC, NFKD Explained

Unicode Emoji Properties: Extended_Pictographic, Emoji_Presentation, and More

Text vs Emoji Presentation Selectors: VS15 (U+FE0E) and VS16 (U+FE0F)

The Lie Your Runtime Tells You

Three Levels of "Character"

What Makes an Emoji "Long"?

Basic Emoji: 1 Code Point, 2 Code Units

Emoji + Variation SelectorVariation Selector (VS)Các ký tự Unicode (VS-15 U+FE0E và VS-16 U+FE0F) xác định xem một ký tự được hiển thị dưới dạng văn bản (đơn sắc) hay emoji (có màu).: 2 Code Points

Emoji + Skin Tone: 2 Code Points

ZWJZero Width Joiner (ZWJ)Ký tự Unicode vô hình (U+200D) dùng để ghép nhiều emoji thành một emoji tổng hợp, chẳng hạn kết hợp người và vật thể thành emoji nghề nghiệp. Sequence: 3+ Code Points

Family Emoji: 7 Code Points

Correct Counting in JavaScript

Using Intl.Segmenter (Modern, Built-In)

Slicing Text Correctly

String Reversal

Correct Counting in Python

Using the grapheme Library

Using the regex Module with \X

Other Languages

Java

Swift

Rust

Practical Implications

Input Validation / Character Limits

Database Column Sizing

Text Rendering

Công cụ liên quan

Thuật ngữ

Danh mục Emoji liên quan

Emoji liên quan

Bài viết liên quan

What Are ZWJ Sequences? How Emoji Combine

Unicode Normalization Forms: NFC, NFD, NFKC, NFKD Explained

Unicode Emoji Properties: Extended_Pictographic, Emoji_Presentation, and More

Text vs Emoji Presentation Selectors: VS15 (U+FE0E) and VS16 (U+FE0F)

Emoji + Variation SelectorVariation Selector (VS)
Các ký tự Unicode (VS-15 U+FE0E và VS-16 U+FE0F) xác định xem một ký tự được hiển thị dưới dạng văn bản (đơn sắc) hay emoji (có màu).: 2 Code Points

ZWJZero Width Joiner (ZWJ)
Ký tự Unicode vô hình (U+200D) dùng để ghép nhiều emoji thành một emoji tổng hợp, chẳng hạn kết hợp người và vật thể thành emoji nghề nghiệp. Sequence: 3+ Code Points

Using `Intl.Segmenter` (Modern, Built-In)

Using the `grapheme` Library

Using the `regex` Module with `\X`