Unicode Normalization Forms: NFC, NFD, NFKC, NFKD Explained

Why the Same Text Can Have Different Representations

UnicodeUnicode
Tiêu chuẩn mã hóa ký tự phổ quát gán một số duy nhất cho mỗi ký tự trong tất cả hệ thống chữ viết và bộ ký hiệu, bao gồm cả emoji. has a fascinating problem: the same visible character can be encoded in multiple distinct byte sequences. The letter é can be stored as a single precomposed code point (U+00E9) or as the letter e followed by a combining accent (U+0065 + U+0301). Both look identical, both are valid Unicode, but they are byte-for-byte different.

This ambiguity matters for string comparison, database storage, full-text search, and — as we will see — emojiEmoji
Từ tiếng Nhật (絵文字) có nghĩa là 'ký tự hình ảnh' — các ký hiệu đồ họa nhỏ dùng trong giao tiếp kỹ thuật số để diễn đạt ý tưởng, cảm xúc và sự vật. processing. Unicode defines four normalization forms to solve it.

The Four Normalization Forms

NFD: Canonical Decomposition

NFD decomposes precomposed characters into their canonical component parts. é (U+00E9) becomes e + ◌́ (U+0065 + U+0301).

import unicodedata

text = 'café'
nfd = unicodedata.normalize('NFD', text)
len(text)  # 4
len(nfd)   # 5 — é is split into e + combining accent

Characters are decomposed to their canonical equivalents, then reordered to a canonical order.

NFC: Canonical Decomposition + Canonical Composition

NFC first decomposes (like NFD), then re-composes combining sequences back into precomposed forms wherever possible. The result is the shortest canonical representation.

nfc = unicodedata.normalize('NFC', text)
len(nfc)   # 4 — é is a single code point again

NFC is the recommended form for most text storage and interchange, including the web and most databases.

NFKD: Compatibility Decomposition

NFKD applies a broader decomposition that also breaks apart characters that are "compatible" but not strictly equivalent. For example, the ligature ﬁ (fi, U+FB01) decomposes to f + i. The circled number ① becomes 1.

nfkd = unicodedata.normalize('NFKD', '①ﬁ')
# '1fi' — compatibility decomposition loses formatting distinctions

Compatibility decomposition loses some visual information (like whether something was a superscript or a ligature), but produces simpler, more searchable text.

NFKC: Compatibility Decomposition + Canonical Composition

NFKC applies compatibility decomposition then re-composes. It is the normalization form used in many identifier systems, including Python 3 variable names.

nfkc = unicodedata.normalize('NFKC', '①ﬁ')
# '1fi'

How Normalization Affects Emoji

Emoji normalization is more subtle than with composed letters, but it matters.

Variation Selectors Survive Normalization

Variation selectors (U+FE0E for text, U+FE0F for emoji) are preserved by all normalization forms. The heart ❤️ (U+2764 + U+FE0F) remains two code points after NFC normalization.

heart_emoji = '❤️'  # U+2764 + U+FE0F
nfc_heart = unicodedata.normalize('NFC', heart_emoji)
len(heart_emoji)  # 2
len(nfc_heart)    # 2 — variation selectorVariation Selector (VS)
Các ký tự Unicode (VS-15 U+FE0E và VS-16 U+FE0F) xác định xem một ký tự được hiển thị dưới dạng văn bản (đơn sắc) hay emoji (có màu). preserved

ZWJ Sequences Are Not Decomposed

Zero Width JoinerZero Width Joiner (ZWJ)
Ký tự Unicode vô hình (U+200D) dùng để ghép nhiều emoji thành một emoji tổng hợp, chẳng hạn kết hợp người và vật thể thành emoji nghề nghiệp. sequences like 👩‍💻 (U+1F469 + U+200D + U+1F4BB) are not affected by normalization. The ZWJ (U+200D) and the surrounding emoji are all assigned "so" (Symbol, Other) category, and no canonical or compatibility decompositions apply to them.

woman_technologist = '👩‍💻'
nfc = unicodedata.normalize('NFC', woman_technologist)
nfd = unicodedata.normalize('NFD', woman_technologist)
woman_technologist == nfc == nfd  # True — unchanged

Skin Tone Modifiers

Emoji modifier characters (U+1F3FB through U+1F3FF, the Fitzpatrick scale) are also preserved by normalization:

thumbs_up = '👍🏽'  # U+1F44D + U+1F3FD (medium skin tone)
nfc = unicodedata.normalize('NFC', thumbs_up)
thumbs_up == nfc  # True

Normalization and String Comparison

This is where normalization becomes a real-world bug source. Without normalization, two identical-looking strings compare as unequal:

s1 = 'café'      # precomposed é (U+00E9)
s2 = 'cafe\u0301'  # decomposed é (e + combining accent)

s1 == s2  # False! Different byte sequences
unicodedata.normalize('NFC', s1) == unicodedata.normalize('NFC', s2)  # True

For emoji-heavy applications, normalize user input to NFC before storage and comparison. Most modern web browsers and operating systems produce NFC text, but user input from legacy systems or programmatically generated strings may vary.

JavaScript Normalization

JavaScript strings expose .normalize() natively:

const s1 = 'café';           // precomposed
const s2 = 'cafe\u0301';     // decomposed

s1 === s2                    // false
s1.normalize('NFC') === s2.normalize('NFC')  // true

// Normalization for emoji (no-op for pure emoji sequences)
'👩‍💻'.normalize('NFC') === '👩‍💻'  // true
'❤️'.normalize('NFC').length        // 2 — variation selector preserved

Choosing the Right Form

Form	Use When
NFC	Default for web, APIs, databases, user-visible text
NFD	Low-level text processing where you want components separated
NFKC	Search normalization, identifier comparison, removing formatting distinctions
NFKD	Same as NFKC but in decomposed form; rare in practice

For emoji applications specifically:

Store in NFC: clean, compact, web-standard
Search with NFKC: broader matching, collapses compatible forms
Never normalize to NFD for display: the output may look different on some renderers if combining sequences are not re-composed

Normalization in Databases

Most databases normalize to NFC on input or expect NFC:

-- PostgreSQL: normalize function available in v15+
SELECT normalize('café', NFC);
SELECT normalize('café', NFD);

-- Check if a string is already in NFC
SELECT 'café' = normalize('café', NFC);  -- true if already NFC

# Always normalize before storing user input
import unicodedata

def store_safe(text: str) -> str:
    return unicodedata.normalize('NFC', text)

Use our Sequence Analyzer to inspect the exact code points in any emoji or text string, including whether combining characters or variation selectors are present — which makes normalization behavior visible and concrete.

Unicode Normalization Forms: NFC, NFD, NFKC, NFKD Explained

Embed This Widget

Why the Same Text Can Have Different Representations

The Four Normalization Forms

NFD: Canonical Decomposition

NFC: Canonical Decomposition + Canonical Composition

NFKD: Compatibility Decomposition

NFKC: Compatibility Decomposition + Canonical Composition

How Normalization Affects Emoji

Variation Selectors Survive Normalization

ZWJ Sequences Are Not Decomposed

Skin Tone Modifiers

Normalization and String Comparison

JavaScript Normalization

Choosing the Right Form

Normalization in Databases

Công cụ liên quan

Thuật ngữ

Danh mục Emoji liên quan

Emoji liên quan

Bài viết liên quan

What Are ZWJ Sequences? How Emoji Combine

Unicode Emoji Properties: Extended_Pictographic, Emoji_Presentation, and More

Text vs Emoji Presentation Selectors: VS15 (U+FE0E) and VS16 (U+FE0F)

Implementing Skin Tone Modifiers Programmatically