Emoji String Length Gotchas: Surrogate Pairs, Grapheme Clusters, and Byte Counts

EmojiEmoji
A Japanese word (็ตตๆ–‡ๅญ—) meaning 'picture character' โ€” small graphical symbols used in digital communication to express ideas, emotions, and objects.
String Length Gotchas

Ask any language how long the string "hello ๐Ÿ‘‹" is, and you might get 7, 8, or 9. It depends on what the language is counting โ€” bytes, code units, code points, or grapheme clusters. With emoji, all four can be different. Getting this wrong causes truncated strings, database overflow errors, and off-by-one bugs in text rendering.

The Four Levels of "Length"

Level Unit Example: "๐Ÿ‘‹" Example: "๐Ÿ‘‹๐Ÿฝ"
Bytes (UTF-8UTF-8
A variable-width Unicode encoding that uses 1 to 4 bytes per character, dominant on the web (used by 98%+ of websites).
)
8-bit bytes 4 8
Code units (UTF-16UTF-16
A variable-width Unicode encoding that uses 2 or 4 bytes per character, used internally by JavaScript, Java, and Windows.
)
16-bit units 2 4
Code points UnicodeUnicode
Universal character encoding standard that assigns a unique number to every character across all writing systems and symbol sets, including emoji.
scalar values
1 2
Grapheme clusters User-perceived characters 1 1

The grapheme clusterGrapheme Cluster
A user-perceived character that may be composed of multiple Unicode code points displayed as a single visual unit.
is almost always what users and product requirements mean by "one character."

JavaScript: The UTF-16 Trap

JavaScript strings are sequences of UTF-16 code units. Characters above U+FFFF (like most modern emoji) are encoded as surrogate pairs โ€” two code units for one code pointCode Point
A unique numerical value assigned to each character in the Unicode standard, written in the format U+XXXX (e.g., U+1F600 for ๐Ÿ˜€).
.

const wave = "๐Ÿ‘‹"; // U+1F44B

console.log(wave.length);          // 2 (UTF-16 code units, NOT 1!)
console.log([...wave].length);     // 1 (code points via spread)
console.log(wave.codePointAt(0)); // 128075 (0x1F44B โ€” correct)
console.log(wave.charCodeAt(0));  // 55357 (0xD83D โ€” high surrogate, wrong)

// ZWJZero Width Joiner (ZWJ)
An invisible Unicode character (U+200D) used to join multiple emoji into a single composite emoji, such as combining people and objects into profession emoji.
sequence: woman technologist ๐Ÿ‘ฉโ€๐Ÿ’ป const coder = "๐Ÿ‘ฉโ€๐Ÿ’ป"; // U+1F469 + U+200D + U+1F4BB console.log(coder.length); // 5 (two surrogates + ZWJ + two surrogates) console.log([...coder].length); // 3 (three code points) // Grapheme cluster count = 1

Counting Grapheme Clusters in JavaScript

Use the Intl.Segmenter API (ES2022, Node.js 16+):

function countGraphemes(str) {
  const segmenter = new Intl.Segmenter();
  return [...segmenter.segment(str)].length;
}

const examples = [
  "hello",              // 5
  "hello ๐Ÿ‘‹",          // 7
  "๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ",               // 1 (family ZWJ sequence)
  "๐Ÿณ๏ธโ€๐ŸŒˆ",             // 1 (rainbow flag)
  "๐Ÿ‘‹๐Ÿฝ",              // 1 (waving hand + skin tone)
  "1๏ธโƒฃ",              // 1 (keycap sequenceKeycap Sequence
An emoji sequence formed by a digit or symbol, followed by VS-16 (U+FE0F) and the combining enclosing keycap character (U+20E3).
) ]; for (const s of examples) { console.log(JSON.stringify(s), "โ†’", countGraphemes(s)); }

For older environments, use the grapheme-splitter or @unicode-segmenter/core package:

import GraphemeSplitter from 'grapheme-splitter';

const splitter = new GraphemeSplitter();
console.log(splitter.countGraphemes("๐Ÿ‘จโ€๐Ÿ’ป")); // 1
console.log(splitter.splitGraphemes("hello ๐Ÿ‘‹")); // ['h','e','l','l','o',' ','๐Ÿ‘‹']

Safe String Truncation in JavaScript

function truncate(str, maxGraphemes) {
  const segmenter = new Intl.Segmenter();
  const segments = [...segmenter.segment(str)];
  if (segments.length <= maxGraphemes) return str;
  return segments.slice(0, maxGraphemes).map(s => s.segment).join('');
}

// Truncating naively by .length breaks surrogate pairs:
const text = "hi ๐Ÿ‘‹ there";
console.log(text.slice(0, 4));    // "hi ๐Ÿ‘‹" โ€” might cut mid-surrogate!
console.log(truncate(text, 4));   // "hi ๐Ÿ‘‹" โ€” grapheme-safe

Python: Code Points vsVariation Selector (VS)
Unicode characters (VS-15 U+FE0E and VS-16 U+FE0F) that modify whether a character renders in text (monochrome) or emoji (colorful) presentation.
Grapheme Clusters

Python 3 strings are sequences of Unicode code points (not UTF-16 code units). This is better than JavaScript for basic emoji, but still wrong for multi-code-point sequences.

wave = "๐Ÿ‘‹"
print(len(wave))   # 1 โ€” correct! Python counts code points

coder = "๐Ÿ‘ฉโ€๐Ÿ’ป"   # U+1F469 + U+200D + U+1F4BB
print(len(coder))  # 3 โ€” three code points, but one visual character

family = "๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ"  # 4 code points + 3 ZWJs = 7 code points
print(len(family)) # 7

flag = "๐Ÿณ๏ธโ€๐ŸŒˆ"    # white flag + VS16 + ZWJ + rainbow
print(len(flag))   # 4

Grapheme Cluster Counting in Python

Use the grapheme package or regex with \X (grapheme cluster match):

# Option 1: grapheme package
import grapheme

texts = ["hello ๐Ÿ‘‹", "๐Ÿ‘จโ€๐Ÿ’ป", "๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ", "๐Ÿณ๏ธโ€๐ŸŒˆ", "๐Ÿ‘‹๐Ÿฝ"]
for t in texts:
    print(repr(t), "โ†’ graphemes:", grapheme.length(t))

# Option 2: regex \X โ€” matches one grapheme cluster
import regex

def count_graphemes(s: str) -> int:
    return len(regex.findall(r'\X', s))

def split_graphemes(s: str) -> list[str]:
    return regex.findall(r'\X', s)

print(count_graphemes("๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ"))  # 1
print(count_graphemes("hello ๐Ÿ‘‹"))   # 7

# Safe truncation
def truncate(s: str, max_graphemes: int) -> str:
    clusters = regex.findall(r'\X', s)
    return ''.join(clusters[:max_graphemes])

print(truncate("Hello ๐Ÿ‘‹๐ŸŒ๐Ÿš€ World", 8))  # "Hello ๐Ÿ‘‹๐ŸŒ"

Byte Length for Database Storage

MySQL's utf8mb4 stores emoji as 4 bytes each. PostgreSQL stores them as their UTF-8 byte length (3โ€“4 bytes for emoji).

text = "Hello ๐Ÿ‘‹"
print(len(text))                          # 7 (code points)
print(len(text.encode('utf-8')))          # 10 (bytes: 5 ASCII + 1 space + 4 emoji)
print(len(text.encode('utf-16-le')) // 2) # 8 (UTF-16 code units)

# Check if text fits in a VARCHAR(255) column (255 bytes in utf8mb4)
def fits_in_varchar(text: str, max_bytes: int = 255) -> bool:
    return len(text.encode('utf-8')) <= max_bytes

Go: Runes vs Bytes vs Grapheme Clusters

Go strings are byte slices. The built-in len() returns byte count. Use utf8.RuneCountInString for code points.

package main

import (
    "fmt"
    "unicode/utf8"
    "github.com/rivo/uniseg"
)

func main() {
    wave := "๐Ÿ‘‹"
    coder := "๐Ÿ‘ฉโ€๐Ÿ’ป"
    family := "๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ"

    // Byte count
    fmt.Println(len(wave))   // 4
    fmt.Println(len(coder))  // 11 (3 code points ร— ~3-4 bytes each + ZWJ)
    fmt.Println(len(family)) // 25

    // Rune (code point) count
    fmt.Println(utf8.RuneCountInString(wave))   // 1
    fmt.Println(utf8.RuneCountInString(coder))  // 3
    fmt.Println(utf8.RuneCountInString(family)) // 7

    // Grapheme cluster count โ€” use uniseg
    fmt.Println(uniseg.GraphemeClusterCount(wave))   // 1
    fmt.Println(uniseg.GraphemeClusterCount(coder))  // 1
    fmt.Println(uniseg.GraphemeClusterCount(family)) // 1
}

Safe String Truncation in Go

import "github.com/rivo/uniseg"

func TruncateGraphemes(s string, maxClusters int) string {
    gr := uniseg.NewGraphemes(s)
    count := 0
    result := ""
    for gr.Next() {
        if count >= maxClusters {
            break
        }
        result += gr.Str()
        count++
    }
    return result
}

func main() {
    fmt.Println(TruncateGraphemes("Hello ๐Ÿ‘‹๐ŸŒ๐Ÿš€ World", 8)) // "Hello ๐Ÿ‘‹๐ŸŒ"
}

Database Column Sizing

When designing VARCHAR columns that accept emoji:

DB Charset Bytes per emoji VARCHAR(100) holds
MySQL utf8 3 max No 4-byte emoji
MySQL utf8mb4 4 25 emoji-only chars
PostgreSQL UTF8 4 25 emoji-only chars
SQLite UTF-8 4 No fixed limit (TEXT)
-- MySQL: ensure utf8mb4 for emoji support
ALTER TABLE posts MODIFY content TEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

-- Check actual byte length in MySQL
SELECT CHAR_LENGTH(content), LENGTH(content), content
FROM posts
WHERE content LIKE '%๐Ÿš€%';
-- CHAR_LENGTH = code unitCode Unit
The minimum bit combination used for encoding a character: 8-bit for UTF-8, 16-bit for UTF-16, and 32-bit for UTF-32.
count, LENGTH = byte count

The Validation Pattern

For user-facing input validation (e.g., a username limited to 20 characters), always validate by grapheme cluster count:

function validateUsername(name) {
  const segmenter = new Intl.Segmenter();
  const graphemeCount = [...segmenter.segment(name)].length;

  if (graphemeCount < 3) return "Too short (minimum 3 characters)";
  if (graphemeCount > 20) return "Too long (maximum 20 characters)";
  return null; // valid
}

console.log(validateUsername("Al ๐Ÿ‘‹"));    // null (5 graphemes โ€” valid)
console.log(validateUsername("๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆAlexJohnsonSmith")); // null (18 graphemes โ€” valid)

Explore More on EmojiFYI

Related Tools

๐Ÿ” Sequence Analyzer Sequence Analyzer
Decode ZWJ sequences, skin tone modifiers, keycap sequences, and flag pairs into individual components.
๐Ÿ“Š Emoji Stats Emoji Stats
Explore statistics about the Unicode emoji set โ€” category distribution, version growth, type breakdown.

Glossary Terms

Code Point Code Point
A unique numerical value assigned to each character in the Unicode standard, written in the format U+XXXX (e.g., U+1F600 for ๐Ÿ˜€).
Code Unit Code Unit
The minimum bit combination used for encoding a character: 8-bit for UTF-8, 16-bit for UTF-16, and 32-bit for UTF-32.
Emoji Emoji
A Japanese word (็ตตๆ–‡ๅญ—) meaning 'picture character' โ€” small graphical symbols used in digital communication to express ideas, emotions, and objects.
Grapheme Cluster Grapheme Cluster
A user-perceived character that may be composed of multiple Unicode code points displayed as a single visual unit.
Keycap Sequence Keycap Sequence
An emoji sequence formed by a digit or symbol, followed by VS-16 (U+FE0F) and the combining enclosing keycap character (U+20E3).
UTF-16 UTF-16
A variable-width Unicode encoding that uses 2 or 4 bytes per character, used internally by JavaScript, Java, and Windows.
UTF-8 UTF-8
A variable-width Unicode encoding that uses 1 to 4 bytes per character, dominant on the web (used by 98%+ of websites).
Unicode Unicode
Universal character encoding standard that assigns a unique number to every character across all writing systems and symbol sets, including emoji.
Zero Width Joiner (ZWJ) Zero Width Joiner (ZWJ)
An invisible Unicode character (U+200D) used to join multiple emoji into a single composite emoji, such as combining people and objects into profession emoji.

Related Stories