Emoji String Length Gotchas: Surrogate Pairs, Grapheme Clusters, and Byte Counts

Emoji String Length Gotchas

Ask any language how long the string "hello 👋" is, and you might get 7, 8, or 9. It depends on what the language is counting — bytes, code units, code points, or grapheme clusters. With emoji, all four can be different. Getting this wrong causes truncated strings, database overflow errors, and off-by-one bugs in text rendering.

The Four Levels of "Length"

Level	Unit	Example: "👋"	Example: "👋🏽"
Bytes (UTF-8UTF-8 การเข้ารหัส Unicode แบบความกว้างผันแปร ใช้ 1 ถึง 4 ไบต์ต่ออักขระ เป็นมาตรฐานหลักบนเว็บ (ใช้โดยเว็บไซต์กว่า 98%))	8-bit bytes	4	8
Code units (UTF-16UTF-16 การเข้ารหัส Unicode แบบความกว้างผันแปร ใช้ 2 หรือ 4 ไบต์ต่ออักขระ ใช้ภายในโดย JavaScript, Java และ Windows)	16-bit units	2	4
Code points	Unicode scalar values	1	2
Grapheme clusters	User-perceived characters	1	1

The grapheme cluster is almost always what users and product requirements mean by "one character."

JavaScript: The UTF-16 Trap

JavaScript strings are sequences of UTF-16 code units. Characters above U+FFFF (like most modern emoji) are encoded as surrogate pairs — two code units for one code point.

const wave = "👋"; // U+1F44B

console.log(wave.length);          // 2 (UTF-16 code units, NOT 1!)
console.log([...wave].length);     // 1 (code points via spread)
console.log(wave.codePointAt(0)); // 128075 (0x1F44B — correct)
console.log(wave.charCodeAt(0));  // 55357 (0xD83D — high surrogate, wrong)

// ZWJZero Width Joiner (ZWJ)
อักขระ Unicode ที่มองไม่เห็น (U+200D) ใช้เพื่อเชื่อมอิโมจิหลายตัวเข้าเป็นอิโมจิรวม เช่น การรวมคนและวัตถุเป็นอิโมจิอาชีพ sequence: woman technologist 👩‍💻
const coder = "👩‍💻"; // U+1F469 + U+200D + U+1F4BB

console.log(coder.length);         // 5 (two surrogates + ZWJ + two surrogates)
console.log([...coder].length);    // 3 (three code points)
// Grapheme cluster count = 1

Counting Grapheme Clusters in JavaScript

Use the Intl.Segmenter API (ES2022, Node.js 16+):

function countGraphemes(str) {
  const segmenter = new Intl.Segmenter();
  return [...segmenter.segment(str)].length;
}

const examples = [
  "hello",              // 5
  "hello 👋",          // 7
  "👨‍👩‍👧‍👦",               // 1 (family ZWJ sequence)
  "🏳️‍🌈",             // 1 (rainbow flag)
  "👋🏽",              // 1 (waving hand + skin tone)
  "1️⃣",              // 1 (keycap sequence)
];

for (const s of examples) {
  console.log(JSON.stringify(s), "→", countGraphemes(s));
}

For older environments, use the grapheme-splitter or @unicode-segmenter/core package:

import GraphemeSplitter from 'grapheme-splitter';

const splitter = new GraphemeSplitter();
console.log(splitter.countGraphemes("👨‍💻")); // 1
console.log(splitter.splitGraphemes("hello 👋")); // ['h','e','l','l','o',' ','👋']

Safe String Truncation in JavaScript

function truncate(str, maxGraphemes) {
  const segmenter = new Intl.Segmenter();
  const segments = [...segmenter.segment(str)];
  if (segments.length <= maxGraphemes) return str;
  return segments.slice(0, maxGraphemes).map(s => s.segment).join('');
}

// Truncating naively by .length breaks surrogate pairs:
const text = "hi 👋 there";
console.log(text.slice(0, 4));    // "hi 👋" — might cut mid-surrogate!
console.log(truncate(text, 4));   // "hi 👋" — grapheme-safe

Python: Code Points vsVariation Selector (VS)
อักขระ Unicode (VS-15 U+FE0E และ VS-16 U+FE0F) ที่กำหนดว่าอักขระจะแสดงผลเป็นข้อความ (สีเดียว) หรืออิโมจิ (มีสี) Grapheme Clusters

Python 3 strings are sequences of Unicode code points (not UTF-16 code units). This is better than JavaScript for basic emoji, but still wrong for multi-code-point sequences.

wave = "👋"
print(len(wave))   # 1 — correct! Python counts code points

coder = "👩‍💻"   # U+1F469 + U+200D + U+1F4BB
print(len(coder))  # 3 — three code points, but one visual character

family = "👨‍👩‍👧‍👦"  # 4 code points + 3 ZWJs = 7 code points
print(len(family)) # 7

flag = "🏳️‍🌈"    # white flag + VS16 + ZWJ + rainbow
print(len(flag))   # 4

Grapheme Cluster Counting in Python

Use the grapheme package or regex with \X (grapheme cluster match):

# Option 1: grapheme package
import grapheme

texts = ["hello 👋", "👨‍💻", "👨‍👩‍👧‍👦", "🏳️‍🌈", "👋🏽"]
for t in texts:
    print(repr(t), "→ graphemes:", grapheme.length(t))

# Option 2: regex \X — matches one grapheme cluster
import regex

def count_graphemes(s: str) -> int:
    return len(regex.findall(r'\X', s))

def split_graphemes(s: str) -> list[str]:
    return regex.findall(r'\X', s)

print(count_graphemes("👨‍👩‍👧‍👦"))  # 1
print(count_graphemes("hello 👋"))   # 7

# Safe truncation
def truncate(s: str, max_graphemes: int) -> str:
    clusters = regex.findall(r'\X', s)
    return ''.join(clusters[:max_graphemes])

print(truncate("Hello 👋🌍🚀 World", 8))  # "Hello 👋🌍"

Byte Length for Database Storage

MySQL's utf8mb4 stores emoji as 4 bytes each. PostgreSQL stores them as their UTF-8 byte length (3–4 bytes for emoji).

text = "Hello 👋"
print(len(text))                          # 7 (code points)
print(len(text.encode('utf-8')))          # 10 (bytes: 5 ASCII + 1 space + 4 emoji)
print(len(text.encode('utf-16-le')) // 2) # 8 (UTF-16 code units)

# Check if text fits in a VARCHAR(255) column (255 bytes in utf8mb4)
def fits_in_varchar(text: str, max_bytes: int = 255) -> bool:
    return len(text.encode('utf-8')) <= max_bytes

Go: Runes vs Bytes vs Grapheme Clusters

Go strings are byte slices. The built-in len() returns byte count. Use utf8.RuneCountInString for code points.

package main

import (
    "fmt"
    "unicode/utf8"
    "github.com/rivo/uniseg"
)

func main() {
    wave := "👋"
    coder := "👩‍💻"
    family := "👨‍👩‍👧‍👦"

    // Byte count
    fmt.Println(len(wave))   // 4
    fmt.Println(len(coder))  // 11 (3 code points × ~3-4 bytes each + ZWJ)
    fmt.Println(len(family)) // 25

    // Rune (code point) count
    fmt.Println(utf8.RuneCountInString(wave))   // 1
    fmt.Println(utf8.RuneCountInString(coder))  // 3
    fmt.Println(utf8.RuneCountInString(family)) // 7

    // Grapheme cluster count — use uniseg
    fmt.Println(uniseg.GraphemeClusterCount(wave))   // 1
    fmt.Println(uniseg.GraphemeClusterCount(coder))  // 1
    fmt.Println(uniseg.GraphemeClusterCount(family)) // 1
}

Safe String Truncation in Go

import "github.com/rivo/uniseg"

func TruncateGraphemes(s string, maxClusters int) string {
    gr := uniseg.NewGraphemes(s)
    count := 0
    result := ""
    for gr.Next() {
        if count >= maxClusters {
            break
        }
        result += gr.Str()
        count++
    }
    return result
}

func main() {
    fmt.Println(TruncateGraphemes("Hello 👋🌍🚀 World", 8)) // "Hello 👋🌍"
}

Database Column Sizing

When designing VARCHAR columns that accept emoji:

DB	Charset	Bytes per emoji	VARCHAR(100) holds
MySQL	utf8	3 max	No 4-byte emoji
MySQL	utf8mb4	4	25 emoji-only chars
PostgreSQL	UTF8	4	25 emoji-only chars
SQLite	UTF-8	4	No fixed limit (TEXT)

-- MySQL: ensure utf8mb4 for emoji support
ALTER TABLE posts MODIFY content TEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

-- Check actual byte length in MySQL
SELECT CHAR_LENGTH(content), LENGTH(content), content
FROM posts
WHERE content LIKE '%🚀%';
-- CHAR_LENGTH = code unit count, LENGTH = byte count

The Validation Pattern

For user-facing input validation (e.g., a username limited to 20 characters), always validate by grapheme cluster count:

function validateUsername(name) {
  const segmenter = new Intl.Segmenter();
  const graphemeCount = [...segmenter.segment(name)].length;

  if (graphemeCount < 3) return "Too short (minimum 3 characters)";
  if (graphemeCount > 20) return "Too long (maximum 20 characters)";
  return null; // valid
}

console.log(validateUsername("Al 👋"));    // null (5 graphemes — valid)
console.log(validateUsername("👨‍👩‍👧‍👦AlexJohnsonSmith")); // null (18 graphemes — valid)

Explore More on EmojiFYI

See emoji code point and byte data: Sequence Analyzer
Browse emoji statistics: Stats
Full Unicode glossary: Glossary
Access programmatic emoji length data: API Reference

Emoji String Length Gotchas: Surrogate Pairs, Grapheme Clusters, and Byte Counts

Embed This Widget