Emoji Security: Homoglyphs, Spoofing, Invisible Characters, and Filtering

EmojiEmoji
Mot japonais (絵文字) signifiant 'caractère image' — petits symboles graphiques utilisés dans la communication numérique pour exprimer des idées, des émotions et des objets.
Security Considerations

Emoji introduce security concerns that go beyond typical text sanitization. Invisible characters, visual spoofing with look-alike symbols, bidirectional text overrides, and unexpected behavior in SQL queries are real attack vectors. This guide covers the threat model and practical defenses.

Homoglyph Attacks

A homoglyph attack uses characters that look visually identical (or very similar) to legitimate characters to deceive users or bypass filters.

Emoji Lookalikes

Several emoji resemble standard ASCII characters:

Emoji Code Point Looks Like Risk
🅰️ U+1F170 Letter A Username impersonation
🅱️ U+1F171 Letter B Brand spoofing
Ⓜ️ U+24C2 Letter M Domain spoofing
🄰 U+1F130 Letter A Content filter bypass
𝕒 U+1D552 Letter a Script injection bypass

Confusable Detection

UnicodeUnicode
Standard universel d'encodage des caractères qui attribue un numéro unique à chaque caractère de tous les systèmes d'écriture et ensembles de symboles, y compris les emoji.
publishes a Confusables dataset that maps look-alike characters. Use it to detect potentially deceptive inputs:

# Install: pip install confusable-homoglyphs
from confusable_homoglyphs import confusables

def is_confusable_with_ascii(text: str) -> bool:
    """
    Return True if the text contains characters that are confusable
    with ASCII letters or digits.
    """
    for char in text:
        if ord(char) > 127:  # Non-ASCII
            result = confusables.is_dangerous(char)
            if result:
                return True
    return False

# Examples
print(is_confusable_with_ascii("admin"))          # False
print(is_confusable_with_ascii("аdmin"))          # True (Cyrillic а, U+0430)
print(is_confusable_with_ascii("🅰dmin"))         # True (depends on dataset)

# Normalize confusable characters to ASCII equivalents
from confusable_homoglyphs import confusables

def skeleton(text: str) -> str:
    """Return the Unicode 'skeleton' of text for confusable comparison."""
    import unicodedata
    # Normalize to NFC first
    text = unicodedata.normalize('NFC', text)
    result = []
    for char in text:
        maps = confusables.confusable_characters(char)
        # Use the first ASCII confusable if available
        mapped = next((m for m in (maps or []) if ord(m) < 128), char)
        result.append(mapped)
    return ''.join(result)

# Two strings that look the same to a user may have the same skeleton
print(skeleton("pаypal.com") == skeleton("paypal.com"))  # True (both → paypal.com)

Username Squatting Defense

When registering usernames, normalize and check for confusable collisions:

import unicodedata
import regex

def normalize_username(username: str) -> str:
    """
    Normalize a username to detect confusable duplicates.
    1. Unicode NFC normalization
    2. Case-folding
    3. Strip variation selectors and combining marks
    """
    # NFC normalization
    normalized = unicodedata.normalize('NFC', username)
    # Case-fold (Unicode-aware lowercase)
    normalized = normalized.casefold()
    # Remove variation selectors
    normalized = normalized.replace('\uFE0E', '').replace('\uFE0F', '')
    # Remove combining characters (accents, etc.)
    normalized = ''.join(
        c for c in unicodedata.normalize('NFD', normalized)
        if unicodedata.category(c) != 'Mn'
    )
    return normalized

# Check for collisions before registration
existing_skeletons = {"admin", "root", "administrator"}
new_username = "🅰dmin"

if normalize_username(new_username) in existing_skeletons:
    raise ValueError("Username too similar to a reserved name")

Invisible Characters

Several Unicode characters are invisible but occupy space in strings. They can be used to bypass keyword filters, create hidden content, or confuse parsers.

Common Invisible Characters

Code Point Name Risk
U+200B Zero-Width Space Split "badword" → "bad​word" to bypass filters
U+200C Zero-Width Non-Joiner Alter emoji rendering
U+200D Zero-Width Joiner Join emoji into sequences
U+FEFF Zero-Width No-Break Space (BOMBOM (BOM)
La marque d'ordre des octets (U+FEFF) placée en début de fichier texte pour indiquer l'ordre des octets (endianness) dans les encodages UTF-16/UTF-32.
)
Bypass string comparisons
U+2060 Word Joiner Invisible word boundary manipulation
U+00AD Soft Hyphen Invisible hyphen in text
U+180E Mongolian Vowel Separator Bypass filters
U+E0020–U+E007F Tags block Subdivision flag components; invisible as standalone
def find_invisible_chars(text: str) -> list[tuple[int, str, str]]:
    """Find invisible Unicode characters in text."""
    import unicodedata

    INVISIBLE = {
        '\u200B': 'Zero-Width Space',
        '\u200C': 'Zero-Width Non-Joiner',
        '\u200D': 'Zero-Width Joiner',
        '\u200E': 'Left-to-Right Mark',
        '\u200F': 'Right-to-Left Mark',
        '\u202A': 'Left-to-Right Embedding',
        '\u202B': 'Right-to-Left Embedding',
        '\u202C': 'Pop Directional Formatting',
        '\u202D': 'Left-to-Right Override',
        '\u202E': 'Right-to-Left Override',
        '\u2060': 'Word Joiner',
        '\u2061': 'Function Application',
        '\uFEFF': 'Zero-Width No-Break Space',
        '\u00AD': 'Soft Hyphen',
    }

    results = []
    for i, char in enumerate(text):
        if char in INVISIBLE:
            results.append((i, char, INVISIBLE[char]))
    return results

text = "Hello\u200Bworld"  # "Hello" + ZWS + "world"
print(find_invisible_chars(text))
# [(5, '\u200b', 'Zero-Width Space')]

# Strip invisible characters for security-sensitive comparisons
def strip_invisible(text: str) -> str:
    INVISIBLE_SET = {'\u200B', '\u200C', '\u200D', '\u200E', '\u200F',
                     '\u202A', '\u202B', '\u202C', '\u202D', '\u202E',
                     '\u2060', '\u2061', '\uFEFF', '\u00AD'}
    return ''.join(c for c in text if c not in INVISIBLE_SET)

Bidirectional Text Attacks (BiDi)

Unicode bidirectional control characters (U+202A–U+202E, U+2066–U+2069) control the visual order of text. These can be used to make malicious code or content appear benign in a display while the underlying string contains harmful content.

The "Trojan Source" attack (CVE-2021-42574) demonstrated how BiDi characters can hide malicious code in source files:

# Dangerous: comment appears to end before it does in an editor
# using BiDi override characters
code_with_bidi = '/* \u202E } if (isAdmin) { \u202D access = true; */'

def strip_bidi_controls(text: str) -> str:
    """Remove Unicode bidirectional control characters."""
    BIDI_CONTROLS = set('\u202A\u202B\u202C\u202D\u202E\u2066\u2067\u2068\u2069\u200E\u200F')
    return ''.join(c for c in text if c not in BIDI_CONTROLS)

# For user-generated content, always strip BiDi controls
sanitized = strip_bidi_controls(user_input)

SQL and Code Injection via Emoji

Emoji themselves do not cause SQL injection — the risk comes from improperly encoded input or databases with incorrect character set configuration.

MySQL utf8 vsSélecteur de variante (VS)
Caractères Unicode (VS-15 U+FE0E et VS-16 U+FE0F) qui déterminent si un caractère s'affiche en présentation texte (monochrome) ou en présentation emoji (en couleur).
utf8mb4 Truncation

On MySQL with the utf8 charset (3-byte max), a 4-byte emoji character can cause the database to truncate the rest of the string:

-- MySQL utf8 (NOT utf8mb4): this query may truncate silently
INSERT INTO users (name) VALUES ('Hello 🔥 World');
-- Stored: 'Hello ' (truncated at emoji)

This truncation can be exploited:

username = "safe_user 🔥' OR '1'='1"
-- If truncated at emoji: INSERT INTO users VALUES ('safe_user ')
-- But in queries that don't use the DB: the full string is compared

Fix: Always use utf8mb4 in MySQL for any column that accepts user input.

ALTER TABLE users
  MODIFY COLUMN name VARCHAR(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

Python Parameterized Queries (Safe Pattern)

import psycopg2

# SAFE: parameterized query handles emoji and special chars correctly
def create_user(name: str) -> None:
    conn = psycopg2.connect(dsn)
    with conn.cursor() as cur:
        cur.execute(
            "INSERT INTO users (name) VALUES (%s)",
            (name,)  # emoji handled safely by driver
        )
    conn.commit()

# UNSAFE: string formatting — never do this
def create_user_unsafe(name: str) -> None:
    query = f"INSERT INTO users (name) VALUES ('{name}')"  # DO NOT DO THIS
    # An emoji followed by a quote could break parsing in some environments

Emoji in Filenames and Path Traversal

Emoji in user-supplied filenames are valid on most modern filesystems (APFS, ext4, NTFS) but can cause issues:

import os
import re

def sanitize_filename(filename: str) -> str:
    """
    Sanitize a filename: allow alphanumeric, spaces, dots, hyphens, and emoji.
    Remove path traversal characters.
    """
    import regex

    # Remove null bytes and path separators
    filename = filename.replace('\x00', '').replace('/', '').replace('\\', '')

    # Remove invisible/control characters
    filename = ''.join(c for c in filename if unicodedata.category(c)[0] != 'C'
                       or c in '\t\n')

    # Prevent path traversal
    filename = os.path.basename(filename)
    if filename.startswith('.'):
        filename = '_' + filename

    return filename

# Test
print(sanitize_filename("../../../etc/passwd"))    # "passwd"
print(sanitize_filename("report 🚀 2025.pdf"))     # "report 🚀 2025.pdf"
print(sanitize_filename("file\x00name.txt"))       # "filename.txt"

Content Moderation and Filtering

Emoji can be used to convey harmful intent while bypassing keyword filters. 🔪 (kitchen knife), 💊 (pill), and combinations can represent drug dealing or violence. Content moderation systems must treat emoji as meaningful content, not noise.

import regex
from typing import Callable

# Emoji-aware profanity/content filter
BLOCKED_SEQUENCES = {
    "🔪💊",     # weapon + drug
    "💉🩸",     # injection + blood (context-dependent)
    "👊🤜",     # fighting combo
}

def contains_blocked_sequence(text: str, blocks: set[str]) -> bool:
    # Strip variation selectors for normalized comparison
    normalized = text.replace('\uFE0F', '').replace('\uFE0E', '')
    return any(seq in normalized for seq in blocks)

# Rate limiting emoji usage (spam detection)
def emoji_density(text: str) -> float:
    """Return ratio of emoji grapheme clusters to total grapheme clusters."""
    import regex as rx
    total = len(rx.findall(r'\X', text))
    emoji_count = len(rx.findall(r'\p{Extended_Pictographic}', text))
    return emoji_count / total if total > 0 else 0.0

# Flag high-emoji-density messages as potential spam
user_message = "🎰🎰🎰 WIN BIG 🤑🤑🤑 CLICK HERE 💰💰💰"
if emoji_density(user_message) > 0.5:
    flag_for_review(user_message)

Input Validation Checklist

For any user input field that accepts emoji:

  • [ ] Use utf8mb4 in MySQL; UTF-8UTF-8
    Encodage Unicode à largeur variable utilisant de 1 à 4 octets par caractère, dominant sur le web (utilisé par plus de 98 % des sites web).
    in PostgreSQL
  • [ ] Parameterize all database queries
  • [ ] Strip or reject BiDi control characters
  • [ ] Strip invisible characters (U+200B, U+FEFF, etc.) before comparison
  • [ ] Normalize emoji (strip variation selectors) for identity comparisons
  • [ ] Validate string length by grapheme cluster count, not byte or code unit count
  • [ ] Check filenames for path traversal even when emoji are present
  • [ ] Log and monitor unusual Unicode character categories in inputs

Explore More on EmojiFYI

Outils associés

🔀 Comparaison de plateformes Comparaison de plateformes
Comparez le rendu des emojis sur Apple, Google, Samsung, Microsoft et d'autres plateformes. Visualisez les différences côte à côte.
🔍 Analyseur de séquences Analyseur de séquences
Décodez les séquences ZWJ, les modificateurs de teinte de peau, les séquences de touches et les paires de drapeaux en composants individuels.

Termes du glossaire

BOM (BOM) BOM (BOM)
La marque d'ordre des octets (U+FEFF) placée en début de fichier texte pour indiquer l'ordre des octets (endianness) dans les encodages UTF-16/UTF-32.
Cluster de graphèmes Cluster de graphèmes
Caractère perçu par l'utilisateur pouvant être composé de plusieurs points de code Unicode affichés comme une seule unité visuelle.
Emoji Emoji
Mot japonais (絵文字) signifiant 'caractère image' — petits symboles graphiques utilisés dans la communication numérique pour exprimer des idées, des émotions et des objets.
Point de code Point de code
Valeur numérique unique attribuée à chaque caractère dans la norme Unicode, écrite au format U+XXXX (par exemple, U+1F600 pour 😀).
Unicode Unicode
Standard universel d'encodage des caractères qui attribue un numéro unique à chaque caractère de tous les systèmes d'écriture et ensembles de symboles, y compris les emoji.
Unité de code Unité de code
La combinaison minimale de bits utilisée pour encoder un caractère : 8 bits pour UTF-8, 16 bits pour UTF-16 et 32 bits pour UTF-32.
UTF-8 UTF-8
Encodage Unicode à largeur variable utilisant de 1 à 4 octets par caractère, dominant sur le web (utilisé par plus de 98 % des sites web).

Articles associés