Emoji Security: Homoglyphs, Spoofing, Invisible Characters, and Filtering

EmojiEmoji
A Japanese word (絵文字) meaning 'picture character' — small graphical symbols used in digital communication to express ideas, emotions, and objects.
Security Considerations

Emoji introduce security concerns that go beyond typical text sanitization. Invisible characters, visual spoofing with look-alike symbols, bidirectional text overrides, and unexpected behavior in SQL queries are real attack vectors. This guide covers the threat model and practical defenses.

Homoglyph Attacks

A homoglyph attack uses characters that look visually identical (or very similar) to legitimate characters to deceive users or bypass filters.

Emoji Lookalikes

Several emoji resemble standard ASCII characters:

Emoji Code PointCode Point
A unique numerical value assigned to each character in the Unicode standard, written in the format U+XXXX (e.g., U+1F600 for 😀).
Looks Like Risk
🅰️ U+1F170 Letter A Username impersonation
🅱️ U+1F171 Letter B Brand spoofing
Ⓜ️ U+24C2 Letter M Domain spoofing
🄰 U+1F130 Letter A Content filter bypass
𝕒 U+1D552 Letter a Script injection bypass

Confusable Detection

UnicodeUnicode
Universal character encoding standard that assigns a unique number to every character across all writing systems and symbol sets, including emoji.
publishes a Confusables dataset that maps look-alike characters. Use it to detect potentially deceptive inputs:

# Install: pip install confusable-homoglyphs
from confusable_homoglyphs import confusables

def is_confusable_with_ascii(text: str) -> bool:
    """
    Return True if the text contains characters that are confusable
    with ASCII letters or digits.
    """
    for char in text:
        if ord(char) > 127:  # Non-ASCII
            result = confusables.is_dangerous(char)
            if result:
                return True
    return False

# Examples
print(is_confusable_with_ascii("admin"))          # False
print(is_confusable_with_ascii("аdmin"))          # True (Cyrillic а, U+0430)
print(is_confusable_with_ascii("🅰dmin"))         # True (depends on dataset)

# Normalize confusable characters to ASCII equivalents
from confusable_homoglyphs import confusables

def skeleton(text: str) -> str:
    """Return the Unicode 'skeleton' of text for confusable comparison."""
    import unicodedata
    # Normalize to NFC first
    text = unicodedata.normalize('NFC', text)
    result = []
    for char in text:
        maps = confusables.confusable_characters(char)
        # Use the first ASCII confusable if available
        mapped = next((m for m in (maps or []) if ord(m) < 128), char)
        result.append(mapped)
    return ''.join(result)

# Two strings that look the same to a user may have the same skeleton
print(skeleton("pаypal.com") == skeleton("paypal.com"))  # True (both → paypal.com)

Username Squatting Defense

When registering usernames, normalize and check for confusable collisions:

import unicodedata
import regex

def normalize_username(username: str) -> str:
    """
    Normalize a username to detect confusable duplicates.
    1. Unicode NFC normalization
    2. Case-folding
    3. Strip variation selectors and combining marks
    """
    # NFC normalization
    normalized = unicodedata.normalize('NFC', username)
    # Case-fold (Unicode-aware lowercase)
    normalized = normalized.casefold()
    # Remove variation selectors
    normalized = normalized.replace('\uFE0E', '').replace('\uFE0F', '')
    # Remove combining characters (accents, etc.)
    normalized = ''.join(
        c for c in unicodedata.normalize('NFD', normalized)
        if unicodedata.category(c) != 'Mn'
    )
    return normalized

# Check for collisions before registration
existing_skeletons = {"admin", "root", "administrator"}
new_username = "🅰dmin"

if normalize_username(new_username) in existing_skeletons:
    raise ValueError("Username too similar to a reserved name")

Invisible Characters

Several Unicode characters are invisible but occupy space in strings. They can be used to bypass keyword filters, create hidden content, or confuse parsers.

Common Invisible Characters

Code Point Name Risk
U+200B Zero-Width Space Split "badword" → "bad​word" to bypass filters
U+200C Zero-Width Non-Joiner Alter emoji rendering
U+200D Zero-Width Joiner Join emoji into sequences
U+FEFF Zero-Width No-Break Space (BOMBOM (BOM)
The Byte Order Mark (U+FEFF) placed at the start of a text file to indicate byte order (endianness) in UTF-16/UTF-32 encodings.
)
Bypass string comparisons
U+2060 Word Joiner Invisible word boundary manipulation
U+00AD Soft Hyphen Invisible hyphen in text
U+180E Mongolian Vowel Separator Bypass filters
U+E0020–U+E007F Tags block Subdivision flag components; invisible as standalone
def find_invisible_chars(text: str) -> list[tuple[int, str, str]]:
    """Find invisible Unicode characters in text."""
    import unicodedata

    INVISIBLE = {
        '\u200B': 'Zero-Width Space',
        '\u200C': 'Zero-Width Non-Joiner',
        '\u200D': 'Zero-Width Joiner',
        '\u200E': 'Left-to-Right Mark',
        '\u200F': 'Right-to-Left Mark',
        '\u202A': 'Left-to-Right Embedding',
        '\u202B': 'Right-to-Left Embedding',
        '\u202C': 'Pop Directional Formatting',
        '\u202D': 'Left-to-Right Override',
        '\u202E': 'Right-to-Left Override',
        '\u2060': 'Word Joiner',
        '\u2061': 'Function Application',
        '\uFEFF': 'Zero-Width No-Break Space',
        '\u00AD': 'Soft Hyphen',
    }

    results = []
    for i, char in enumerate(text):
        if char in INVISIBLE:
            results.append((i, char, INVISIBLE[char]))
    return results

text = "Hello\u200Bworld"  # "Hello" + ZWS + "world"
print(find_invisible_chars(text))
# [(5, '\u200b', 'Zero-Width Space')]

# Strip invisible characters for security-sensitive comparisons
def strip_invisible(text: str) -> str:
    INVISIBLE_SET = {'\u200B', '\u200C', '\u200D', '\u200E', '\u200F',
                     '\u202A', '\u202B', '\u202C', '\u202D', '\u202E',
                     '\u2060', '\u2061', '\uFEFF', '\u00AD'}
    return ''.join(c for c in text if c not in INVISIBLE_SET)

Bidirectional Text Attacks (BiDi)

Unicode bidirectional control characters (U+202A–U+202E, U+2066–U+2069) control the visual order of text. These can be used to make malicious code or content appear benign in a display while the underlying string contains harmful content.

The "Trojan Source" attack (CVE-2021-42574) demonstrated how BiDi characters can hide malicious code in source files:

# Dangerous: comment appears to end before it does in an editor
# using BiDi override characters
code_with_bidi = '/* \u202E } if (isAdmin) { \u202D access = true; */'

def strip_bidi_controls(text: str) -> str:
    """Remove Unicode bidirectional control characters."""
    BIDI_CONTROLS = set('\u202A\u202B\u202C\u202D\u202E\u2066\u2067\u2068\u2069\u200E\u200F')
    return ''.join(c for c in text if c not in BIDI_CONTROLS)

# For user-generated content, always strip BiDi controls
sanitized = strip_bidi_controls(user_input)

SQL and Code Injection via Emoji

Emoji themselves do not cause SQL injection — the risk comes from improperly encoded input or databases with incorrect character set configuration.

MySQL utf8 vsVariation Selector (VS)
Unicode characters (VS-15 U+FE0E and VS-16 U+FE0F) that modify whether a character renders in text (monochrome) or emoji (colorful) presentation.
utf8mb4 Truncation

On MySQL with the utf8 charset (3-byte max), a 4-byte emoji character can cause the database to truncate the rest of the string:

-- MySQL utf8 (NOT utf8mb4): this query may truncate silently
INSERT INTO users (name) VALUES ('Hello 🔥 World');
-- Stored: 'Hello ' (truncated at emoji)

This truncation can be exploited:

username = "safe_user 🔥' OR '1'='1"
-- If truncated at emoji: INSERT INTO users VALUES ('safe_user ')
-- But in queries that don't use the DB: the full string is compared

Fix: Always use utf8mb4 in MySQL for any column that accepts user input.

ALTER TABLE users
  MODIFY COLUMN name VARCHAR(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

Python Parameterized Queries (Safe Pattern)

import psycopg2

# SAFE: parameterized query handles emoji and special chars correctly
def create_user(name: str) -> None:
    conn = psycopg2.connect(dsn)
    with conn.cursor() as cur:
        cur.execute(
            "INSERT INTO users (name) VALUES (%s)",
            (name,)  # emoji handled safely by driver
        )
    conn.commit()

# UNSAFE: string formatting — never do this
def create_user_unsafe(name: str) -> None:
    query = f"INSERT INTO users (name) VALUES ('{name}')"  # DO NOT DO THIS
    # An emoji followed by a quote could break parsing in some environments

Emoji in Filenames and Path Traversal

Emoji in user-supplied filenames are valid on most modern filesystems (APFS, ext4, NTFS) but can cause issues:

import os
import re

def sanitize_filename(filename: str) -> str:
    """
    Sanitize a filename: allow alphanumeric, spaces, dots, hyphens, and emoji.
    Remove path traversal characters.
    """
    import regex

    # Remove null bytes and path separators
    filename = filename.replace('\x00', '').replace('/', '').replace('\\', '')

    # Remove invisible/control characters
    filename = ''.join(c for c in filename if unicodedata.category(c)[0] != 'C'
                       or c in '\t\n')

    # Prevent path traversal
    filename = os.path.basename(filename)
    if filename.startswith('.'):
        filename = '_' + filename

    return filename

# Test
print(sanitize_filename("../../../etc/passwd"))    # "passwd"
print(sanitize_filename("report 🚀 2025.pdf"))     # "report 🚀 2025.pdf"
print(sanitize_filename("file\x00name.txt"))       # "filename.txt"

Content Moderation and Filtering

Emoji can be used to convey harmful intent while bypassing keyword filters. 🔪 (kitchen knife), 💊 (pill), and combinations can represent drug dealing or violence. Content moderation systems must treat emoji as meaningful content, not noise.

import regex
from typing import Callable

# Emoji-aware profanity/content filter
BLOCKED_SEQUENCES = {
    "🔪💊",     # weapon + drug
    "💉🩸",     # injection + blood (context-dependent)
    "👊🤜",     # fighting combo
}

def contains_blocked_sequence(text: str, blocks: set[str]) -> bool:
    # Strip variation selectors for normalized comparison
    normalized = text.replace('\uFE0F', '').replace('\uFE0E', '')
    return any(seq in normalized for seq in blocks)

# Rate limiting emoji usage (spam detection)
def emoji_density(text: str) -> float:
    """Return ratio of emoji grapheme clusters to total grapheme clusters."""
    import regex as rx
    total = len(rx.findall(r'\X', text))
    emoji_count = len(rx.findall(r'\p{Extended_Pictographic}', text))
    return emoji_count / total if total > 0 else 0.0

# Flag high-emoji-density messages as potential spam
user_message = "🎰🎰🎰 WIN BIG 🤑🤑🤑 CLICK HERE 💰💰💰"
if emoji_density(user_message) > 0.5:
    flag_for_review(user_message)

Input Validation Checklist

For any user input field that accepts emoji:

  • [ ] Use utf8mb4 in MySQL; UTF-8UTF-8
    A variable-width Unicode encoding that uses 1 to 4 bytes per character, dominant on the web (used by 98%+ of websites).
    in PostgreSQL
  • [ ] Parameterize all database queries
  • [ ] Strip or reject BiDi control characters
  • [ ] Strip invisible characters (U+200B, U+FEFF, etc.) before comparison
  • [ ] Normalize emoji (strip variation selectors) for identity comparisons
  • [ ] Validate string length by grapheme clusterGrapheme Cluster
    A user-perceived character that may be composed of multiple Unicode code points displayed as a single visual unit.
    count, not byte or code unitCode Unit
    The minimum bit combination used for encoding a character: 8-bit for UTF-8, 16-bit for UTF-16, and 32-bit for UTF-32.
    count
  • [ ] Check filenames for path traversal even when emoji are present
  • [ ] Log and monitor unusual Unicode character categories in inputs

Explore More on EmojiFYI

Related Tools

🔀 Platform Compare Platform Compare
Compare how emojis render across Apple, Google, Samsung, Microsoft, and more. See visual differences side by side.
🔍 Sequence Analyzer Sequence Analyzer
Decode ZWJ sequences, skin tone modifiers, keycap sequences, and flag pairs into individual components.

Glossary Terms

BOM (BOM) BOM (BOM)
The Byte Order Mark (U+FEFF) placed at the start of a text file to indicate byte order (endianness) in UTF-16/UTF-32 encodings.
Code Point Code Point
A unique numerical value assigned to each character in the Unicode standard, written in the format U+XXXX (e.g., U+1F600 for 😀).
Code Unit Code Unit
The minimum bit combination used for encoding a character: 8-bit for UTF-8, 16-bit for UTF-16, and 32-bit for UTF-32.
Emoji Emoji
A Japanese word (絵文字) meaning 'picture character' — small graphical symbols used in digital communication to express ideas, emotions, and objects.
Grapheme Cluster Grapheme Cluster
A user-perceived character that may be composed of multiple Unicode code points displayed as a single visual unit.
UTF-8 UTF-8
A variable-width Unicode encoding that uses 1 to 4 bytes per character, dominant on the web (used by 98%+ of websites).
Unicode Unicode
Universal character encoding standard that assigns a unique number to every character across all writing systems and symbol sets, including emoji.

Related Stories