The Lie Your Runtime Tells You
'👨👩👧👦'.length // 11
That is not 11 characters. It is one family emojiEmoji
A Japanese word (絵文字) meaning 'picture character' — small graphical symbols used in digital communication to express ideas, emotions, and objects.. The .length property in JavaScript counts UTF-16UTF-16
A variable-width Unicode encoding that uses 2 or 4 bytes per character, used internally by JavaScript, Java, and Windows. code units, not visible characters. For most ASCII text this distinction is invisible, but emoji expose the gap brutally.
Understanding why this happens — and how to fix it — requires knowing what a grapheme clusterGrapheme Cluster
A user-perceived character that may be composed of multiple Unicode code points displayed as a single visual unit. is.
Three Levels of "Character"
UnicodeUnicode
Universal character encoding standard that assigns a unique number to every character across all writing systems and symbol sets, including emoji. defines multiple levels of text unit, each useful for different purposes:
| Level | Unit | JavaScript | Python |
|---|---|---|---|
| Byte | Raw storage unit | — | len(s.encode('utf-8UTF-8 |
| Code unitCode Unit The minimum bit combination used for encoding a character: 8-bit for UTF-8, 16-bit for UTF-16, and 32-bit for UTF-32. |
Encoding unit | s.length |
— |
| Code pointCode Point A unique numerical value assigned to each character in the Unicode standard, written in the format U+XXXX (e.g., U+1F600 for 😀). |
Unicode scalar value | [...s].length |
len(s) |
| Grapheme cluster | User-perceived character | Intl.Segmenter |
grapheme library |
A grapheme cluster is what a human thinks of as "one character" — one visible glyph on screen. For emoji, this is always at least one code point, but often several combined.
What Makes an Emoji "Long"?
Basic Emoji: 1 Code Point, 2 Code Units
😀 = U+1F600
UTF-16: 0xD83D 0xDE00 (surrogate pairSurrogate Pair
Two UTF-16 code units (a high surrogate U+D800-U+DBFF followed by a low surrogate U+DC00-U+DFFF) that together represent a character above U+FFFF.)
.length in JS: 2
Python len(): 1
Emoji + Variation SelectorVariation Selector (VS)
Unicode characters (VS-15 U+FE0E and VS-16 U+FE0F) that modify whether a character renders in text (monochrome) or emoji (colorful) presentation.: 2 Code Points
❤️ = U+2764 + U+FE0F
.length in JS: 2
Python len(): 2
Grapheme clusters: 1
Emoji + Skin Tone: 2 Code Points
👍🏽 = U+1F44D + U+1F3FD
.length in JS: 4 (two surrogate pairs)
Python len(): 2
Grapheme clusters: 1
ZWJZero Width Joiner (ZWJ)
An invisible Unicode character (U+200D) used to join multiple emoji into a single composite emoji, such as combining people and objects into profession emoji. Sequence: 3+ Code Points
👩💻 = U+1F469 + U+200D + U+1F4BB
.length in JS: 5
Python len(): 3
Grapheme clusters: 1
Family Emoji: 7 Code Points
👨👩👧👦 = 👨 + ZWJ + 👩 + ZWJ + 👧 + ZWJ + 👦
.length in JS: 11
Python len(): 7
Grapheme clusters: 1
Correct Counting in JavaScript
Using Intl.Segmenter (Modern, Built-In)
Intl.Segmenter is available in all modern browsers and Node.js 16+. It segments text by grapheme clusters:
function countGraphemes(str) {
const segmenter = new Intl.Segmenter();
return [...segmenter.segment(str)].length;
}
countGraphemes('Hello') // 5 — correct
countGraphemes('Hello 😀') // 7 — correct
countGraphemes('👨👩👧👦') // 1 — correct!
countGraphemes('👩🏽💻') // 1 — correct!
Slicing Text Correctly
Slicing with .slice() on emoji strings corrupts surrogate pairs:
const s = 'Hello 👋';
// WRONG: may cut through a surrogate pair
s.slice(0, 7) // 'Hello \uD83D' — broken!
// CORRECT: spread to code points first
[...s].slice(0, 7).join('') // 'Hello 👋' — correct
// EVEN BETTER: slice by grapheme clusters
function sliceByGrapheme(str, start, end) {
const segmenter = new Intl.Segmenter();
const segments = [...segmenter.segment(str)];
return segments.slice(start, end).map(s => s.segment).join('');
}
sliceByGrapheme('Hello 👋 World', 0, 7) // 'Hello 👋'
String Reversal
Reversing an emoji string with .split('').reverse().join('') produces garbage:
// WRONG: reverses code units, breaks surrogates
'Hello 😀'.split('').reverse().join('') // '😀 olleH' (actually garbled)
// CORRECT: reverse by code points
[...'Hello 😀'].reverse().join('') // '😀 olleH'
// BEST: reverse by grapheme clusters
function reverseGraphemes(str) {
const segmenter = new Intl.Segmenter();
return [...segmenter.segment(str)]
.map(s => s.segment)
.reverse()
.join('');
}
reverseGraphemes('Hello 👨👩👧👦') // '👨👩👧👦 olleH'
Correct Counting in Python
Python 3's len() counts code points, not grapheme clusters. For most ASCII text this is identical to grapheme count, but for emoji it is not.
text = '👨👩👧👦'
len(text) # 7 — code points (wrong for "visible characters")
Using the grapheme Library
import grapheme
grapheme.length('👨👩👧👦') # 1 — correct
grapheme.length('Hello 😀') # 7 — correct
# Slice by grapheme clusters
grapheme.slice('Hello 👨👩👧👦', 0, 7) # 'Hello 👨👩👧👦'
# Iterate graphemes
list(grapheme.graphemes('Hi 👋'))
# ['H', 'i', ' ', '👋']
Using the regex Module with \X
The \X pattern in the regex module matches grapheme clusters:
import regex
def count_graphemes(text):
return len(regex.findall(r'\X', text))
count_graphemes('👨👩👧👦') # 1
count_graphemes('Hello 👨👩👧👦') # 7
Other Languages
Java
// Java uses UTF-16 internally like JavaScript
"👨👩👧👦".length() // 11 (code units)
// Count code points
"👨👩👧👦".codePointCount(0, "👨👩👧👦".length()) // 7
// Count grapheme clusters (requires ICUICU (ICU)
International Components for Unicode — a widely-used open-source library providing Unicode and internationalization support, including emoji processing.)
BreakIterator bi = BreakIterator.getCharacterInstance();
bi.setText("👨👩👧👦");
int count = 0;
while (bi.next() != BreakIterator.DONE) count++;
// count = 1
Swift
Swift's String is designed around grapheme clusters — count returns grapheme count by default:
"👨👩👧👦".count // 1 — correct out of the box!
This is one area where Swift's string model is superior to most other languages.
Rust
// Rust len() returns bytes
"👨👩👧👦".len() // 25 (bytes)
// chars() iterates code points
"👨👩👧👦".chars().count() // 7
// For grapheme clusters, use the unicode-segmentation crate
use unicode_segmentation::UnicodeSegmentation;
"👨👩👧👦".graphemes(true).count() // 1
Practical Implications
Input Validation / Character Limits
If you show users "max 280 characters" and count by .length in JavaScript, users can enter far fewer emoji than expected (or you silently truncate their input). Count by grapheme clusters for user-facing limits.
Database Column Sizing
Ensure column size limits account for multi-code-point emoji. A VARCHAR(10) that accepts emoji-heavy input needs to be measured in code points (PostgreSQL) or bytes (MySQL with utf8mb4), not graphemes.
Text Rendering
Layout engines that use pixel widths handle emoji correctly by nature — they measure bounding boxes, not code units. But manual text layout (canvas, PDF generation, terminal output) needs grapheme-aware iteration.
Use our Sequence Analyzer to see exactly how many code points, code units, and grapheme clusters any emoji string contains — making these abstract concepts concrete.