EmojiEmoji
A Japanese word (絵文字) meaning 'picture character' — small graphical symbols used in digital communication to express ideas, emotions, and objects. String Length Gotchas
Ask any language how long the string "hello 👋" is, and you might get 7, 8, or 9. It depends on what the language is counting — bytes, code units, code points, or grapheme clusters. With emoji, all four can be different. Getting this wrong causes truncated strings, database overflow errors, and off-by-one bugs in text rendering.
The Four Levels of "Length"
| Level | Unit | Example: "👋" | Example: "👋🏽" |
|---|---|---|---|
| Bytes (UTF-8UTF-8 A variable-width Unicode encoding that uses 1 to 4 bytes per character, dominant on the web (used by 98%+ of websites).) |
8-bit bytes | 4 | 8 |
| Code units (UTF-16UTF-16 A variable-width Unicode encoding that uses 2 or 4 bytes per character, used internally by JavaScript, Java, and Windows.) |
16-bit units | 2 | 4 |
| Code points | UnicodeUnicode Universal character encoding standard that assigns a unique number to every character across all writing systems and symbol sets, including emoji. scalar values |
1 | 2 |
| Grapheme clusters | User-perceived characters | 1 | 1 |
The grapheme clusterGrapheme Cluster
A user-perceived character that may be composed of multiple Unicode code points displayed as a single visual unit. is almost always what users and product requirements mean by "one character."
JavaScript: The UTF-16 Trap
JavaScript strings are sequences of UTF-16 code units. Characters above U+FFFF (like most modern emoji) are encoded as surrogate pairs — two code units for one code pointCode Point
A unique numerical value assigned to each character in the Unicode standard, written in the format U+XXXX (e.g., U+1F600 for 😀)..
const wave = "👋"; // U+1F44B
console.log(wave.length); // 2 (UTF-16 code units, NOT 1!)
console.log([...wave].length); // 1 (code points via spread)
console.log(wave.codePointAt(0)); // 128075 (0x1F44B — correct)
console.log(wave.charCodeAt(0)); // 55357 (0xD83D — high surrogate, wrong)
// ZWJZero Width Joiner (ZWJ)
An invisible Unicode character (U+200D) used to join multiple emoji into a single composite emoji, such as combining people and objects into profession emoji. sequence: woman technologist 👩💻
const coder = "👩💻"; // U+1F469 + U+200D + U+1F4BB
console.log(coder.length); // 5 (two surrogates + ZWJ + two surrogates)
console.log([...coder].length); // 3 (three code points)
// Grapheme cluster count = 1
Counting Grapheme Clusters in JavaScript
Use the Intl.Segmenter API (ES2022, Node.js 16+):
function countGraphemes(str) {
const segmenter = new Intl.Segmenter();
return [...segmenter.segment(str)].length;
}
const examples = [
"hello", // 5
"hello 👋", // 7
"👨👩👧👦", // 1 (family ZWJ sequence)
"🏳️🌈", // 1 (rainbow flag)
"👋🏽", // 1 (waving hand + skin tone)
"1️⃣", // 1 (keycap sequenceKeycap Sequence
An emoji sequence formed by a digit or symbol, followed by VS-16 (U+FE0F) and the combining enclosing keycap character (U+20E3).)
];
for (const s of examples) {
console.log(JSON.stringify(s), "→", countGraphemes(s));
}
For older environments, use the grapheme-splitter or @unicode-segmenter/core package:
import GraphemeSplitter from 'grapheme-splitter';
const splitter = new GraphemeSplitter();
console.log(splitter.countGraphemes("👨💻")); // 1
console.log(splitter.splitGraphemes("hello 👋")); // ['h','e','l','l','o',' ','👋']
Safe String Truncation in JavaScript
function truncate(str, maxGraphemes) {
const segmenter = new Intl.Segmenter();
const segments = [...segmenter.segment(str)];
if (segments.length <= maxGraphemes) return str;
return segments.slice(0, maxGraphemes).map(s => s.segment).join('');
}
// Truncating naively by .length breaks surrogate pairs:
const text = "hi 👋 there";
console.log(text.slice(0, 4)); // "hi 👋" — might cut mid-surrogate!
console.log(truncate(text, 4)); // "hi 👋" — grapheme-safe
Python: Code Points vsVariation Selector (VS)
Unicode characters (VS-15 U+FE0E and VS-16 U+FE0F) that modify whether a character renders in text (monochrome) or emoji (colorful) presentation. Grapheme Clusters
Python 3 strings are sequences of Unicode code points (not UTF-16 code units). This is better than JavaScript for basic emoji, but still wrong for multi-code-point sequences.
wave = "👋"
print(len(wave)) # 1 — correct! Python counts code points
coder = "👩💻" # U+1F469 + U+200D + U+1F4BB
print(len(coder)) # 3 — three code points, but one visual character
family = "👨👩👧👦" # 4 code points + 3 ZWJs = 7 code points
print(len(family)) # 7
flag = "🏳️🌈" # white flag + VS16 + ZWJ + rainbow
print(len(flag)) # 4
Grapheme Cluster Counting in Python
Use the grapheme package or regex with \X (grapheme cluster match):
# Option 1: grapheme package
import grapheme
texts = ["hello 👋", "👨💻", "👨👩👧👦", "🏳️🌈", "👋🏽"]
for t in texts:
print(repr(t), "→ graphemes:", grapheme.length(t))
# Option 2: regex \X — matches one grapheme cluster
import regex
def count_graphemes(s: str) -> int:
return len(regex.findall(r'\X', s))
def split_graphemes(s: str) -> list[str]:
return regex.findall(r'\X', s)
print(count_graphemes("👨👩👧👦")) # 1
print(count_graphemes("hello 👋")) # 7
# Safe truncation
def truncate(s: str, max_graphemes: int) -> str:
clusters = regex.findall(r'\X', s)
return ''.join(clusters[:max_graphemes])
print(truncate("Hello 👋🌍🚀 World", 8)) # "Hello 👋🌍"
Byte Length for Database Storage
MySQL's utf8mb4 stores emoji as 4 bytes each. PostgreSQL stores them as their UTF-8 byte length (3–4 bytes for emoji).
text = "Hello 👋"
print(len(text)) # 7 (code points)
print(len(text.encode('utf-8'))) # 10 (bytes: 5 ASCII + 1 space + 4 emoji)
print(len(text.encode('utf-16-le')) // 2) # 8 (UTF-16 code units)
# Check if text fits in a VARCHAR(255) column (255 bytes in utf8mb4)
def fits_in_varchar(text: str, max_bytes: int = 255) -> bool:
return len(text.encode('utf-8')) <= max_bytes
Go: Runes vs Bytes vs Grapheme Clusters
Go strings are byte slices. The built-in len() returns byte count. Use utf8.RuneCountInString for code points.
package main
import (
"fmt"
"unicode/utf8"
"github.com/rivo/uniseg"
)
func main() {
wave := "👋"
coder := "👩💻"
family := "👨👩👧👦"
// Byte count
fmt.Println(len(wave)) // 4
fmt.Println(len(coder)) // 11 (3 code points × ~3-4 bytes each + ZWJ)
fmt.Println(len(family)) // 25
// Rune (code point) count
fmt.Println(utf8.RuneCountInString(wave)) // 1
fmt.Println(utf8.RuneCountInString(coder)) // 3
fmt.Println(utf8.RuneCountInString(family)) // 7
// Grapheme cluster count — use uniseg
fmt.Println(uniseg.GraphemeClusterCount(wave)) // 1
fmt.Println(uniseg.GraphemeClusterCount(coder)) // 1
fmt.Println(uniseg.GraphemeClusterCount(family)) // 1
}
Safe String Truncation in Go
import "github.com/rivo/uniseg"
func TruncateGraphemes(s string, maxClusters int) string {
gr := uniseg.NewGraphemes(s)
count := 0
result := ""
for gr.Next() {
if count >= maxClusters {
break
}
result += gr.Str()
count++
}
return result
}
func main() {
fmt.Println(TruncateGraphemes("Hello 👋🌍🚀 World", 8)) // "Hello 👋🌍"
}
Database Column Sizing
When designing VARCHAR columns that accept emoji:
| DB | Charset | Bytes per emoji | VARCHAR(100) holds |
|---|---|---|---|
| MySQL | utf8 | 3 max | No 4-byte emoji |
| MySQL | utf8mb4 | 4 | 25 emoji-only chars |
| PostgreSQL | UTF8 | 4 | 25 emoji-only chars |
| SQLite | UTF-8 | 4 | No fixed limit (TEXT) |
-- MySQL: ensure utf8mb4 for emoji support
ALTER TABLE posts MODIFY content TEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
-- Check actual byte length in MySQL
SELECT CHAR_LENGTH(content), LENGTH(content), content
FROM posts
WHERE content LIKE '%🚀%';
-- CHAR_LENGTH = code unitCode Unit
The minimum bit combination used for encoding a character: 8-bit for UTF-8, 16-bit for UTF-16, and 32-bit for UTF-32. count, LENGTH = byte count
The Validation Pattern
For user-facing input validation (e.g., a username limited to 20 characters), always validate by grapheme cluster count:
function validateUsername(name) {
const segmenter = new Intl.Segmenter();
const graphemeCount = [...segmenter.segment(name)].length;
if (graphemeCount < 3) return "Too short (minimum 3 characters)";
if (graphemeCount > 20) return "Too long (maximum 20 characters)";
return null; // valid
}
console.log(validateUsername("Al 👋")); // null (5 graphemes — valid)
console.log(validateUsername("👨👩👧👦AlexJohnsonSmith")); // null (18 graphemes — valid)
Explore More on EmojiFYI
- See emoji code point and byte data: Sequence Analyzer
- Browse emoji statistics: Stats
- Full Unicode glossary: Glossary
- Access programmatic emoji length data: API Reference