Why the Same Text Can Have Different Representations
UnicodeUnicode
Universal character encoding standard that assigns a unique number to every character across all writing systems and symbol sets, including emoji. has a fascinating problem: the same visible character can be encoded in multiple distinct byte sequences. The letter é can be stored as a single precomposed code pointCode Point
A unique numerical value assigned to each character in the Unicode standard, written in the format U+XXXX (e.g., U+1F600 for 😀). (U+00E9) or as the letter e followed by a combining accent (U+0065 + U+0301). Both look identical, both are valid Unicode, but they are byte-for-byte different.
This ambiguity matters for string comparison, database storage, full-text search, and — as we will see — emojiEmoji
A Japanese word (絵文字) meaning 'picture character' — small graphical symbols used in digital communication to express ideas, emotions, and objects. processing. Unicode defines four normalization forms to solve it.
The Four Normalization Forms
NFD: Canonical Decomposition
NFD decomposes precomposed characters into their canonical component parts. é (U+00E9) becomes e + ◌́ (U+0065 + U+0301).
import unicodedata
text = 'café'
nfd = unicodedata.normalize('NFD', text)
len(text) # 4
len(nfd) # 5 — é is split into e + combining accent
Characters are decomposed to their canonical equivalents, then reordered to a canonical order.
NFC: Canonical Decomposition + Canonical Composition
NFC first decomposes (like NFD), then re-composes combining sequences back into precomposed forms wherever possible. The result is the shortest canonical representation.
nfc = unicodedata.normalize('NFC', text)
len(nfc) # 4 — é is a single code point again
NFC is the recommended form for most text storage and interchange, including the web and most databases.
NFKD: Compatibility Decomposition
NFKD applies a broader decomposition that also breaks apart characters that are "compatible" but not strictly equivalent. For example, the ligature fi (fi, U+FB01) decomposes to f + i. The circled number ① becomes 1.
nfkd = unicodedata.normalize('NFKD', '①fi')
# '1fi' — compatibility decomposition loses formatting distinctions
Compatibility decomposition loses some visual information (like whether something was a superscript or a ligature), but produces simpler, more searchable text.
NFKC: Compatibility Decomposition + Canonical Composition
NFKC applies compatibility decomposition then re-composes. It is the normalization form used in many identifier systems, including Python 3 variable names.
nfkc = unicodedata.normalize('NFKC', '①fi')
# '1fi'
How Normalization Affects Emoji
Emoji normalization is more subtle than with composed letters, but it matters.
Variation Selectors Survive Normalization
Variation selectors (U+FE0E for text, U+FE0F for emoji) are preserved by all normalization forms. The heart ❤️ (U+2764 + U+FE0F) remains two code points after NFC normalization.
heart_emoji = '❤️' # U+2764 + U+FE0F
nfc_heart = unicodedata.normalize('NFC', heart_emoji)
len(heart_emoji) # 2
len(nfc_heart) # 2 — variation selectorVariation Selector (VS)
Unicode characters (VS-15 U+FE0E and VS-16 U+FE0F) that modify whether a character renders in text (monochrome) or emoji (colorful) presentation. preserved
ZWJ Sequences Are Not Decomposed
Zero Width JoinerZero Width Joiner (ZWJ)
An invisible Unicode character (U+200D) used to join multiple emoji into a single composite emoji, such as combining people and objects into profession emoji. sequences like 👩💻 (U+1F469 + U+200D + U+1F4BB) are not affected by normalization. The ZWJ (U+200D) and the surrounding emoji are all assigned "so" (Symbol, Other) category, and no canonical or compatibility decompositions apply to them.
woman_technologist = '👩💻'
nfc = unicodedata.normalize('NFC', woman_technologist)
nfd = unicodedata.normalize('NFD', woman_technologist)
woman_technologist == nfc == nfd # True — unchanged
Skin Tone Modifiers
Emoji modifier characters (U+1F3FB through U+1F3FF, the Fitzpatrick scale) are also preserved by normalization:
thumbs_up = '👍🏽' # U+1F44D + U+1F3FD (medium skin tone)
nfc = unicodedata.normalize('NFC', thumbs_up)
thumbs_up == nfc # True
Normalization and String Comparison
This is where normalization becomes a real-world bug source. Without normalization, two identical-looking strings compare as unequal:
s1 = 'café' # precomposed é (U+00E9)
s2 = 'cafe\u0301' # decomposed é (e + combining accent)
s1 == s2 # False! Different byte sequences
unicodedata.normalize('NFC', s1) == unicodedata.normalize('NFC', s2) # True
For emoji-heavy applications, normalize user input to NFC before storage and comparison. Most modern web browsers and operating systems produce NFC text, but user input from legacy systems or programmatically generated strings may vary.
JavaScript Normalization
JavaScript strings expose .normalize() natively:
const s1 = 'café'; // precomposed
const s2 = 'cafe\u0301'; // decomposed
s1 === s2 // false
s1.normalize('NFC') === s2.normalize('NFC') // true
// Normalization for emoji (no-op for pure emoji sequences)
'👩💻'.normalize('NFC') === '👩💻' // true
'❤️'.normalize('NFC').length // 2 — variation selector preserved
Choosing the Right Form
| Form | Use When |
|---|---|
| NFC | Default for web, APIs, databases, user-visible text |
| NFD | Low-level text processing where you want components separated |
| NFKC | Search normalization, identifier comparison, removing formatting distinctions |
| NFKD | Same as NFKC but in decomposed form; rare in practice |
For emoji applications specifically:
- Store in NFC: clean, compact, web-standard
- Search with NFKC: broader matching, collapses compatible forms
- Never normalize to NFD for display: the output may look different on some renderers if combining sequences are not re-composed
Normalization in Databases
Most databases normalize to NFC on input or expect NFC:
-- PostgreSQL: normalize function available in v15+
SELECT normalize('café', NFC);
SELECT normalize('café', NFD);
-- Check if a string is already in NFC
SELECT 'café' = normalize('café', NFC); -- true if already NFC
# Always normalize before storing user input
import unicodedata
def store_safe(text: str) -> str:
return unicodedata.normalize('NFC', text)
Use our Sequence Analyzer to inspect the exact code points in any emoji or text string, including whether combining characters or variation selectors are present — which makes normalization behavior visible and concrete.