Emoji Security: Homoglyphs, Spoofing, Invisible Characters, and Filtering

EmojiEmoji
Từ tiếng Nhật (絵文字) có nghĩa là 'ký tự hình ảnh' — các ký hiệu đồ họa nhỏ dùng trong giao tiếp kỹ thuật số để diễn đạt ý tưởng, cảm xúc và sự vật. Security Considerations

Emoji introduce security concerns that go beyond typical text sanitization. Invisible characters, visual spoofing with look-alike symbols, bidirectional text overrides, and unexpected behavior in SQL queries are real attack vectors. This guide covers the threat model and practical defenses.

Homoglyph Attacks

A homoglyph attack uses characters that look visually identical (or very similar) to legitimate characters to deceive users or bypass filters.

Emoji Lookalikes

Several emoji resemble standard ASCII characters:

Emoji	Code Point	Looks Like	Risk
🅰️	U+1F170	Letter A	Username impersonation
🅱️	U+1F171	Letter B	Brand spoofing
Ⓜ️	U+24C2	Letter M	Domain spoofing
🄰	U+1F130	Letter A	Content filter bypass
𝕒	U+1D552	Letter a	Script injection bypass

Confusable Detection

UnicodeUnicode
Tiêu chuẩn mã hóa ký tự phổ quát gán một số duy nhất cho mỗi ký tự trong tất cả hệ thống chữ viết và bộ ký hiệu, bao gồm cả emoji. publishes a Confusables dataset that maps look-alike characters. Use it to detect potentially deceptive inputs:

# Install: pip install confusable-homoglyphs
from confusable_homoglyphs import confusables

def is_confusable_with_ascii(text: str) -> bool:
    """
    Return True if the text contains characters that are confusable
    with ASCII letters or digits.
    """
    for char in text:
        if ord(char) > 127:  # Non-ASCII
            result = confusables.is_dangerous(char)
            if result:
                return True
    return False

# Examples
print(is_confusable_with_ascii("admin"))          # False
print(is_confusable_with_ascii("аdmin"))          # True (Cyrillic а, U+0430)
print(is_confusable_with_ascii("🅰dmin"))         # True (depends on dataset)

# Normalize confusable characters to ASCII equivalents
from confusable_homoglyphs import confusables

def skeleton(text: str) -> str:
    """Return the Unicode 'skeleton' of text for confusable comparison."""
    import unicodedata
    # Normalize to NFC first
    text = unicodedata.normalize('NFC', text)
    result = []
    for char in text:
        maps = confusables.confusable_characters(char)
        # Use the first ASCII confusable if available
        mapped = next((m for m in (maps or []) if ord(m) < 128), char)
        result.append(mapped)
    return ''.join(result)

# Two strings that look the same to a user may have the same skeleton
print(skeleton("pаypal.com") == skeleton("paypal.com"))  # True (both → paypal.com)

Username Squatting Defense

When registering usernames, normalize and check for confusable collisions:

import unicodedata
import regex

def normalize_username(username: str) -> str:
    """
    Normalize a username to detect confusable duplicates.
    1. Unicode NFC normalization
    2. Case-folding
    3. Strip variation selectors and combining marks
    """
    # NFC normalization
    normalized = unicodedata.normalize('NFC', username)
    # Case-fold (Unicode-aware lowercase)
    normalized = normalized.casefold()
    # Remove variation selectors
    normalized = normalized.replace('\uFE0E', '').replace('\uFE0F', '')
    # Remove combining characters (accents, etc.)
    normalized = ''.join(
        c for c in unicodedata.normalize('NFD', normalized)
        if unicodedata.category(c) != 'Mn'
    )
    return normalized

# Check for collisions before registration
existing_skeletons = {"admin", "root", "administrator"}
new_username = "🅰dmin"

if normalize_username(new_username) in existing_skeletons:
    raise ValueError("Username too similar to a reserved name")

Invisible Characters

Several Unicode characters are invisible but occupy space in strings. They can be used to bypass keyword filters, create hidden content, or confuse parsers.

Common Invisible Characters

Code Point	Name	Risk
U+200B	Zero-Width Space	Split "badword" → "badword" to bypass filters
U+200C	Zero-Width Non-Joiner	Alter emoji rendering
U+200D	Zero-Width Joiner	Join emoji into sequences
U+FEFF	Zero-Width No-Break Space (BOMBOM (BOM) Byte Order Mark (U+FEFF) được đặt ở đầu tệp văn bản để chỉ thứ tự byte (endianness) trong các mã hóa UTF-16/UTF-32.)	Bypass string comparisons
U+2060	Word Joiner	Invisible word boundary manipulation
U+00AD	Soft Hyphen	Invisible hyphen in text
U+180E	Mongolian Vowel Separator	Bypass filters
U+E0020–U+E007F	Tags block	Subdivision flag components; invisible as standalone

def find_invisible_chars(text: str) -> list[tuple[int, str, str]]:
    """Find invisible Unicode characters in text."""
    import unicodedata

    INVISIBLE = {
        '\u200B': 'Zero-Width Space',
        '\u200C': 'Zero-Width Non-Joiner',
        '\u200D': 'Zero-Width Joiner',
        '\u200E': 'Left-to-Right Mark',
        '\u200F': 'Right-to-Left Mark',
        '\u202A': 'Left-to-Right Embedding',
        '\u202B': 'Right-to-Left Embedding',
        '\u202C': 'Pop Directional Formatting',
        '\u202D': 'Left-to-Right Override',
        '\u202E': 'Right-to-Left Override',
        '\u2060': 'Word Joiner',
        '\u2061': 'Function Application',
        '\uFEFF': 'Zero-Width No-Break Space',
        '\u00AD': 'Soft Hyphen',
    }

    results = []
    for i, char in enumerate(text):
        if char in INVISIBLE:
            results.append((i, char, INVISIBLE[char]))
    return results

text = "Hello\u200Bworld"  # "Hello" + ZWS + "world"
print(find_invisible_chars(text))
# [(5, '\u200b', 'Zero-Width Space')]

# Strip invisible characters for security-sensitive comparisons
def strip_invisible(text: str) -> str:
    INVISIBLE_SET = {'\u200B', '\u200C', '\u200D', '\u200E', '\u200F',
                     '\u202A', '\u202B', '\u202C', '\u202D', '\u202E',
                     '\u2060', '\u2061', '\uFEFF', '\u00AD'}
    return ''.join(c for c in text if c not in INVISIBLE_SET)

Bidirectional Text Attacks (BiDi)

Unicode bidirectional control characters (U+202A–U+202E, U+2066–U+2069) control the visual order of text. These can be used to make malicious code or content appear benign in a display while the underlying string contains harmful content.

The "Trojan Source" attack (CVE-2021-42574) demonstrated how BiDi characters can hide malicious code in source files:

# Dangerous: comment appears to end before it does in an editor
# using BiDi override characters
code_with_bidi = '/* \u202E } if (isAdmin) { \u202D access = true; */'

def strip_bidi_controls(text: str) -> str:
    """Remove Unicode bidirectional control characters."""
    BIDI_CONTROLS = set('\u202A\u202B\u202C\u202D\u202E\u2066\u2067\u2068\u2069\u200E\u200F')
    return ''.join(c for c in text if c not in BIDI_CONTROLS)

# For user-generated content, always strip BiDi controls
sanitized = strip_bidi_controls(user_input)

SQL and Code Injection via Emoji

Emoji themselves do not cause SQL injection — the risk comes from improperly encoded input or databases with incorrect character set configuration.

MySQL utf8 vsVariation Selector (VS)
Các ký tự Unicode (VS-15 U+FE0E và VS-16 U+FE0F) xác định xem một ký tự được hiển thị dưới dạng văn bản (đơn sắc) hay emoji (có màu). utf8mb4 Truncation

On MySQL with the utf8 charset (3-byte max), a 4-byte emoji character can cause the database to truncate the rest of the string:

-- MySQL utf8 (NOT utf8mb4): this query may truncate silently
INSERT INTO users (name) VALUES ('Hello 🔥 World');
-- Stored: 'Hello ' (truncated at emoji)

This truncation can be exploited:

username = "safe_user 🔥' OR '1'='1"
-- If truncated at emoji: INSERT INTO users VALUES ('safe_user ')
-- But in queries that don't use the DB: the full string is compared

Fix: Always use utf8mb4 in MySQL for any column that accepts user input.

ALTER TABLE users
  MODIFY COLUMN name VARCHAR(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

Python Parameterized Queries (Safe Pattern)

import psycopg2

# SAFE: parameterized query handles emoji and special chars correctly
def create_user(name: str) -> None:
    conn = psycopg2.connect(dsn)
    with conn.cursor() as cur:
        cur.execute(
            "INSERT INTO users (name) VALUES (%s)",
            (name,)  # emoji handled safely by driver
        )
    conn.commit()

# UNSAFE: string formatting — never do this
def create_user_unsafe(name: str) -> None:
    query = f"INSERT INTO users (name) VALUES ('{name}')"  # DO NOT DO THIS
    # An emoji followed by a quote could break parsing in some environments

Emoji in Filenames and Path Traversal

Emoji in user-supplied filenames are valid on most modern filesystems (APFS, ext4, NTFS) but can cause issues:

import os
import re

def sanitize_filename(filename: str) -> str:
    """
    Sanitize a filename: allow alphanumeric, spaces, dots, hyphens, and emoji.
    Remove path traversal characters.
    """
    import regex

    # Remove null bytes and path separators
    filename = filename.replace('\x00', '').replace('/', '').replace('\\', '')

    # Remove invisible/control characters
    filename = ''.join(c for c in filename if unicodedata.category(c)[0] != 'C'
                       or c in '\t\n')

    # Prevent path traversal
    filename = os.path.basename(filename)
    if filename.startswith('.'):
        filename = '_' + filename

    return filename

# Test
print(sanitize_filename("../../../etc/passwd"))    # "passwd"
print(sanitize_filename("report 🚀 2025.pdf"))     # "report 🚀 2025.pdf"
print(sanitize_filename("file\x00name.txt"))       # "filename.txt"

Content Moderation and Filtering

Emoji can be used to convey harmful intent while bypassing keyword filters. 🔪 (kitchen knife), 💊 (pill), and combinations can represent drug dealing or violence. Content moderation systems must treat emoji as meaningful content, not noise.

import regex
from typing import Callable

# Emoji-aware profanity/content filter
BLOCKED_SEQUENCES = {
    "🔪💊",     # weapon + drug
    "💉🩸",     # injection + blood (context-dependent)
    "👊🤜",     # fighting combo
}

def contains_blocked_sequence(text: str, blocks: set[str]) -> bool:
    # Strip variation selectors for normalized comparison
    normalized = text.replace('\uFE0F', '').replace('\uFE0E', '')
    return any(seq in normalized for seq in blocks)

# Rate limiting emoji usage (spam detection)
def emoji_density(text: str) -> float:
    """Return ratio of emoji grapheme clusters to total grapheme clusters."""
    import regex as rx
    total = len(rx.findall(r'\X', text))
    emoji_count = len(rx.findall(r'\p{Extended_Pictographic}', text))
    return emoji_count / total if total > 0 else 0.0

# Flag high-emoji-density messages as potential spam
user_message = "🎰🎰🎰 WIN BIG 🤑🤑🤑 CLICK HERE 💰💰💰"
if emoji_density(user_message) > 0.5:
    flag_for_review(user_message)

Input Validation Checklist

For any user input field that accepts emoji:

[ ] Use utf8mb4 in MySQL; UTF-8UTF-8
Kiểu mã hóa Unicode có chiều rộng thay đổi, dùng từ 1 đến 4 byte cho mỗi ký tự, thống trị trên web (98%+ website sử dụng). in PostgreSQL
[ ] Parameterize all database queries
[ ] Strip or reject BiDi control characters
[ ] Strip invisible characters (U+200B, U+FEFF, etc.) before comparison
[ ] Normalize emoji (strip variation selectors) for identity comparisons
[ ] Validate string length by grapheme cluster count, not byte or code unitCode Unit
Tổ hợp bit tối thiểu dùng để mã hóa một ký tự: 8 bit cho UTF-8, 16 bit cho UTF-16 và 32 bit cho UTF-32. count
[ ] Check filenames for path traversal even when emoji are present
[ ] Log and monitor unusual Unicode character categories in inputs

Explore More on EmojiFYI

Inspect emoji code points for security analysis: Sequence Analyzer
Compare how emoji render across platforms: Compare Tool
Unicode character property glossary: Glossary
Programmatic emoji data for validation rules: API Reference

Emoji Security: Homoglyphs, Spoofing, Invisible Characters, and Filtering

Embed This Widget

EmojiEmoji
Từ tiếng Nhật (絵文字) có nghĩa là 'ký tự hình ảnh' — các ký hiệu đồ họa nhỏ dùng trong giao tiếp kỹ thuật số để diễn đạt ý tưởng, cảm xúc và sự vật. Security Considerations

Homoglyph Attacks

Emoji Lookalikes

Confusable Detection

Username Squatting Defense

Invisible Characters

Common Invisible Characters

Bidirectional Text Attacks (BiDi)

SQL and Code Injection via Emoji

MySQL utf8 vsVariation Selector (VS)
Các ký tự Unicode (VS-15 U+FE0E và VS-16 U+FE0F) xác định xem một ký tự được hiển thị dưới dạng văn bản (đơn sắc) hay emoji (có màu). utf8mb4 Truncation

Python Parameterized Queries (Safe Pattern)

Emoji in Filenames and Path Traversal

Content Moderation and Filtering

Input Validation Checklist

Explore More on EmojiFYI

Công cụ liên quan

Thuật ngữ

Danh mục Emoji liên quan

Emoji liên quan

Bài viết liên quan

What Are ZWJ Sequences? How Emoji Combine

Unicode Normalization Forms: NFC, NFD, NFKC, NFKD Explained

Unicode Emoji Properties: Extended_Pictographic, Emoji_Presentation, and More

Text vs Emoji Presentation Selectors: VS15 (U+FE0E) and VS16 (U+FE0F)

EmojiEmojiTừ tiếng Nhật (絵文字) có nghĩa là 'ký tự hình ảnh' — các ký hiệu đồ họa nhỏ dùng trong giao tiếp kỹ thuật số để diễn đạt ý tưởng, cảm xúc và sự vật. Security Considerations

Homoglyph Attacks

Emoji Lookalikes

Confusable Detection

Username Squatting Defense

Invisible Characters

Common Invisible Characters

Bidirectional Text Attacks (BiDi)

SQL and Code Injection via Emoji

MySQL utf8 vsVariation Selector (VS)Các ký tự Unicode (VS-15 U+FE0E và VS-16 U+FE0F) xác định xem một ký tự được hiển thị dưới dạng văn bản (đơn sắc) hay emoji (có màu). utf8mb4 Truncation

Python Parameterized Queries (Safe Pattern)

Emoji in Filenames and Path Traversal

Content Moderation and Filtering

Input Validation Checklist

Explore More on EmojiFYI

Công cụ liên quan

Thuật ngữ

Danh mục Emoji liên quan

Emoji liên quan

Bài viết liên quan

What Are ZWJ Sequences? How Emoji Combine

Unicode Normalization Forms: NFC, NFD, NFKC, NFKD Explained

Unicode Emoji Properties: Extended_Pictographic, Emoji_Presentation, and More

Text vs Emoji Presentation Selectors: VS15 (U+FE0E) and VS16 (U+FE0F)

EmojiEmoji
Từ tiếng Nhật (絵文字) có nghĩa là 'ký tự hình ảnh' — các ký hiệu đồ họa nhỏ dùng trong giao tiếp kỹ thuật số để diễn đạt ý tưởng, cảm xúc và sự vật. Security Considerations

MySQL utf8 vsVariation Selector (VS)
Các ký tự Unicode (VS-15 U+FE0E và VS-16 U+FE0F) xác định xem một ký tự được hiển thị dưới dạng văn bản (đơn sắc) hay emoji (có màu). utf8mb4 Truncation