EmojiEmoji
Từ tiếng Nhật (絵文字) có nghĩa là 'ký tự hình ảnh' — các ký hiệu đồ họa nhỏ dùng trong giao tiếp kỹ thuật số để diễn đạt ý tưởng, cảm xúc và sự vật. String Length Gotchas
Ask any language how long the string "hello 👋" is, and you might get 7, 8, or 9. It depends on what the language is counting — bytes, code units, code points, or grapheme clusters. With emoji, all four can be different. Getting this wrong causes truncated strings, database overflow errors, and off-by-one bugs in text rendering.
The Four Levels of "Length"
| Level | Unit | Example: "👋" | Example: "👋🏽" |
|---|---|---|---|
| Bytes (UTF-8UTF-8 Kiểu mã hóa Unicode có chiều rộng thay đổi, dùng từ 1 đến 4 byte cho mỗi ký tự, thống trị trên web (98%+ website sử dụng).) |
8-bit bytes | 4 | 8 |
| Code units (UTF-16UTF-16 Kiểu mã hóa Unicode có chiều rộng thay đổi, dùng 2 hoặc 4 byte cho mỗi ký tự, được JavaScript, Java và Windows dùng nội bộ.) |
16-bit units | 2 | 4 |
| Code points | UnicodeUnicode Tiêu chuẩn mã hóa ký tự phổ quát gán một số duy nhất cho mỗi ký tự trong tất cả hệ thống chữ viết và bộ ký hiệu, bao gồm cả emoji. scalar values |
1 | 2 |
| Grapheme clusters | User-perceived characters | 1 | 1 |
The grapheme cluster is almost always what users and product requirements mean by "one character."
JavaScript: The UTF-16 Trap
JavaScript strings are sequences of UTF-16 code units. Characters above U+FFFF (like most modern emoji) are encoded as surrogate pairs — two code units for one code point.
const wave = "👋"; // U+1F44B
console.log(wave.length); // 2 (UTF-16 code units, NOT 1!)
console.log([...wave].length); // 1 (code points via spread)
console.log(wave.codePointAt(0)); // 128075 (0x1F44B — correct)
console.log(wave.charCodeAt(0)); // 55357 (0xD83D — high surrogate, wrong)
// ZWJZero Width Joiner (ZWJ)
Ký tự Unicode vô hình (U+200D) dùng để ghép nhiều emoji thành một emoji tổng hợp, chẳng hạn kết hợp người và vật thể thành emoji nghề nghiệp. sequence: woman technologist 👩💻
const coder = "👩💻"; // U+1F469 + U+200D + U+1F4BB
console.log(coder.length); // 5 (two surrogates + ZWJ + two surrogates)
console.log([...coder].length); // 3 (three code points)
// Grapheme cluster count = 1
Counting Grapheme Clusters in JavaScript
Use the Intl.Segmenter API (ES2022, Node.js 16+):
function countGraphemes(str) {
const segmenter = new Intl.Segmenter();
return [...segmenter.segment(str)].length;
}
const examples = [
"hello", // 5
"hello 👋", // 7
"👨👩👧👦", // 1 (family ZWJ sequence)
"🏳️🌈", // 1 (rainbow flag)
"👋🏽", // 1 (waving hand + skin tone)
"1️⃣", // 1 (keycap sequenceKeycap Sequence
Một chuỗi emoji được tạo thành từ chữ số hoặc ký hiệu, theo sau là VS-16 (U+FE0F) và ký tự keycap bao quanh (U+20E3).)
];
for (const s of examples) {
console.log(JSON.stringify(s), "→", countGraphemes(s));
}
For older environments, use the grapheme-splitter or @unicode-segmenter/core package:
import GraphemeSplitter from 'grapheme-splitter';
const splitter = new GraphemeSplitter();
console.log(splitter.countGraphemes("👨💻")); // 1
console.log(splitter.splitGraphemes("hello 👋")); // ['h','e','l','l','o',' ','👋']
Safe String Truncation in JavaScript
function truncate(str, maxGraphemes) {
const segmenter = new Intl.Segmenter();
const segments = [...segmenter.segment(str)];
if (segments.length <= maxGraphemes) return str;
return segments.slice(0, maxGraphemes).map(s => s.segment).join('');
}
// Truncating naively by .length breaks surrogate pairs:
const text = "hi 👋 there";
console.log(text.slice(0, 4)); // "hi 👋" — might cut mid-surrogate!
console.log(truncate(text, 4)); // "hi 👋" — grapheme-safe
Python: Code Points vsVariation Selector (VS)
Các ký tự Unicode (VS-15 U+FE0E và VS-16 U+FE0F) xác định xem một ký tự được hiển thị dưới dạng văn bản (đơn sắc) hay emoji (có màu). Grapheme Clusters
Python 3 strings are sequences of Unicode code points (not UTF-16 code units). This is better than JavaScript for basic emoji, but still wrong for multi-code-point sequences.
wave = "👋"
print(len(wave)) # 1 — correct! Python counts code points
coder = "👩💻" # U+1F469 + U+200D + U+1F4BB
print(len(coder)) # 3 — three code points, but one visual character
family = "👨👩👧👦" # 4 code points + 3 ZWJs = 7 code points
print(len(family)) # 7
flag = "🏳️🌈" # white flag + VS16 + ZWJ + rainbow
print(len(flag)) # 4
Grapheme Cluster Counting in Python
Use the grapheme package or regex with \X (grapheme cluster match):
# Option 1: grapheme package
import grapheme
texts = ["hello 👋", "👨💻", "👨👩👧👦", "🏳️🌈", "👋🏽"]
for t in texts:
print(repr(t), "→ graphemes:", grapheme.length(t))
# Option 2: regex \X — matches one grapheme cluster
import regex
def count_graphemes(s: str) -> int:
return len(regex.findall(r'\X', s))
def split_graphemes(s: str) -> list[str]:
return regex.findall(r'\X', s)
print(count_graphemes("👨👩👧👦")) # 1
print(count_graphemes("hello 👋")) # 7
# Safe truncation
def truncate(s: str, max_graphemes: int) -> str:
clusters = regex.findall(r'\X', s)
return ''.join(clusters[:max_graphemes])
print(truncate("Hello 👋🌍🚀 World", 8)) # "Hello 👋🌍"
Byte Length for Database Storage
MySQL's utf8mb4 stores emoji as 4 bytes each. PostgreSQL stores them as their UTF-8 byte length (3–4 bytes for emoji).
text = "Hello 👋"
print(len(text)) # 7 (code points)
print(len(text.encode('utf-8'))) # 10 (bytes: 5 ASCII + 1 space + 4 emoji)
print(len(text.encode('utf-16-le')) // 2) # 8 (UTF-16 code units)
# Check if text fits in a VARCHAR(255) column (255 bytes in utf8mb4)
def fits_in_varchar(text: str, max_bytes: int = 255) -> bool:
return len(text.encode('utf-8')) <= max_bytes
Go: Runes vs Bytes vs Grapheme Clusters
Go strings are byte slices. The built-in len() returns byte count. Use utf8.RuneCountInString for code points.
package main
import (
"fmt"
"unicode/utf8"
"github.com/rivo/uniseg"
)
func main() {
wave := "👋"
coder := "👩💻"
family := "👨👩👧👦"
// Byte count
fmt.Println(len(wave)) // 4
fmt.Println(len(coder)) // 11 (3 code points × ~3-4 bytes each + ZWJ)
fmt.Println(len(family)) // 25
// Rune (code point) count
fmt.Println(utf8.RuneCountInString(wave)) // 1
fmt.Println(utf8.RuneCountInString(coder)) // 3
fmt.Println(utf8.RuneCountInString(family)) // 7
// Grapheme cluster count — use uniseg
fmt.Println(uniseg.GraphemeClusterCount(wave)) // 1
fmt.Println(uniseg.GraphemeClusterCount(coder)) // 1
fmt.Println(uniseg.GraphemeClusterCount(family)) // 1
}
Safe String Truncation in Go
import "github.com/rivo/uniseg"
func TruncateGraphemes(s string, maxClusters int) string {
gr := uniseg.NewGraphemes(s)
count := 0
result := ""
for gr.Next() {
if count >= maxClusters {
break
}
result += gr.Str()
count++
}
return result
}
func main() {
fmt.Println(TruncateGraphemes("Hello 👋🌍🚀 World", 8)) // "Hello 👋🌍"
}
Database Column Sizing
When designing VARCHAR columns that accept emoji:
| DB | Charset | Bytes per emoji | VARCHAR(100) holds |
|---|---|---|---|
| MySQL | utf8 | 3 max | No 4-byte emoji |
| MySQL | utf8mb4 | 4 | 25 emoji-only chars |
| PostgreSQL | UTF8 | 4 | 25 emoji-only chars |
| SQLite | UTF-8 | 4 | No fixed limit (TEXT) |
-- MySQL: ensure utf8mb4 for emoji support
ALTER TABLE posts MODIFY content TEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
-- Check actual byte length in MySQL
SELECT CHAR_LENGTH(content), LENGTH(content), content
FROM posts
WHERE content LIKE '%🚀%';
-- CHAR_LENGTH = code unitCode Unit
Tổ hợp bit tối thiểu dùng để mã hóa một ký tự: 8 bit cho UTF-8, 16 bit cho UTF-16 và 32 bit cho UTF-32. count, LENGTH = byte count
The Validation Pattern
For user-facing input validation (e.g., a username limited to 20 characters), always validate by grapheme cluster count:
function validateUsername(name) {
const segmenter = new Intl.Segmenter();
const graphemeCount = [...segmenter.segment(name)].length;
if (graphemeCount < 3) return "Too short (minimum 3 characters)";
if (graphemeCount > 20) return "Too long (maximum 20 characters)";
return null; // valid
}
console.log(validateUsername("Al 👋")); // null (5 graphemes — valid)
console.log(validateUsername("👨👩👧👦AlexJohnsonSmith")); // null (18 graphemes — valid)
Explore More on EmojiFYI
- See emoji code point and byte data: Sequence Analyzer
- Browse emoji statistics: Stats
- Full Unicode glossary: Glossary
- Access programmatic emoji length data: API Reference