EmojiEmoji
A Japanese word (็ตตๆๅญ) meaning 'picture character' โ small graphical symbols used in digital communication to express ideas, emotions, and objects. Security Considerations
Emoji introduce security concerns that go beyond typical text sanitization. Invisible characters, visual spoofing with look-alike symbols, bidirectional text overrides, and unexpected behavior in SQL queries are real attack vectors. This guide covers the threat model and practical defenses.
Homoglyph Attacks
A homoglyph attack uses characters that look visually identical (or very similar) to legitimate characters to deceive users or bypass filters.
Emoji Lookalikes
Several emoji resemble standard ASCII characters:
| Emoji | Code PointCode Point A unique numerical value assigned to each character in the Unicode standard, written in the format U+XXXX (e.g., U+1F600 for ๐). |
Looks Like | Risk |
|---|---|---|---|
| ๐ ฐ๏ธ | U+1F170 | Letter A | Username impersonation |
| ๐ ฑ๏ธ | U+1F171 | Letter B | Brand spoofing |
| โ๏ธ | U+24C2 | Letter M | Domain spoofing |
| ๐ฐ | U+1F130 | Letter A | Content filter bypass |
| ๐ | U+1D552 | Letter a | Script injection bypass |
Confusable Detection
UnicodeUnicode
Universal character encoding standard that assigns a unique number to every character across all writing systems and symbol sets, including emoji. publishes a Confusables dataset that maps look-alike characters. Use it to detect potentially deceptive inputs:
# Install: pip install confusable-homoglyphs
from confusable_homoglyphs import confusables
def is_confusable_with_ascii(text: str) -> bool:
"""
Return True if the text contains characters that are confusable
with ASCII letters or digits.
"""
for char in text:
if ord(char) > 127: # Non-ASCII
result = confusables.is_dangerous(char)
if result:
return True
return False
# Examples
print(is_confusable_with_ascii("admin")) # False
print(is_confusable_with_ascii("ะฐdmin")) # True (Cyrillic ะฐ, U+0430)
print(is_confusable_with_ascii("๐
ฐdmin")) # True (depends on dataset)
# Normalize confusable characters to ASCII equivalents
from confusable_homoglyphs import confusables
def skeleton(text: str) -> str:
"""Return the Unicode 'skeleton' of text for confusable comparison."""
import unicodedata
# Normalize to NFC first
text = unicodedata.normalize('NFC', text)
result = []
for char in text:
maps = confusables.confusable_characters(char)
# Use the first ASCII confusable if available
mapped = next((m for m in (maps or []) if ord(m) < 128), char)
result.append(mapped)
return ''.join(result)
# Two strings that look the same to a user may have the same skeleton
print(skeleton("pะฐypal.com") == skeleton("paypal.com")) # True (both โ paypal.com)
Username Squatting Defense
When registering usernames, normalize and check for confusable collisions:
import unicodedata
import regex
def normalize_username(username: str) -> str:
"""
Normalize a username to detect confusable duplicates.
1. Unicode NFC normalization
2. Case-folding
3. Strip variation selectors and combining marks
"""
# NFC normalization
normalized = unicodedata.normalize('NFC', username)
# Case-fold (Unicode-aware lowercase)
normalized = normalized.casefold()
# Remove variation selectors
normalized = normalized.replace('\uFE0E', '').replace('\uFE0F', '')
# Remove combining characters (accents, etc.)
normalized = ''.join(
c for c in unicodedata.normalize('NFD', normalized)
if unicodedata.category(c) != 'Mn'
)
return normalized
# Check for collisions before registration
existing_skeletons = {"admin", "root", "administrator"}
new_username = "๐
ฐdmin"
if normalize_username(new_username) in existing_skeletons:
raise ValueError("Username too similar to a reserved name")
Invisible Characters
Several Unicode characters are invisible but occupy space in strings. They can be used to bypass keyword filters, create hidden content, or confuse parsers.
Common Invisible Characters
| Code Point | Name | Risk |
|---|---|---|
| U+200B | Zero-Width Space | Split "badword" โ "badโword" to bypass filters |
| U+200C | Zero-Width Non-Joiner | Alter emoji rendering |
| U+200D | Zero-Width Joiner | Join emoji into sequences |
| U+FEFF | Zero-Width No-Break Space (BOMBOM (BOM) The Byte Order Mark (U+FEFF) placed at the start of a text file to indicate byte order (endianness) in UTF-16/UTF-32 encodings.) |
Bypass string comparisons |
| U+2060 | Word Joiner | Invisible word boundary manipulation |
| U+00AD | Soft Hyphen | Invisible hyphen in text |
| U+180E | Mongolian Vowel Separator | Bypass filters |
| U+E0020โU+E007F | Tags block | Subdivision flag components; invisible as standalone |
def find_invisible_chars(text: str) -> list[tuple[int, str, str]]:
"""Find invisible Unicode characters in text."""
import unicodedata
INVISIBLE = {
'\u200B': 'Zero-Width Space',
'\u200C': 'Zero-Width Non-Joiner',
'\u200D': 'Zero-Width Joiner',
'\u200E': 'Left-to-Right Mark',
'\u200F': 'Right-to-Left Mark',
'\u202A': 'Left-to-Right Embedding',
'\u202B': 'Right-to-Left Embedding',
'\u202C': 'Pop Directional Formatting',
'\u202D': 'Left-to-Right Override',
'\u202E': 'Right-to-Left Override',
'\u2060': 'Word Joiner',
'\u2061': 'Function Application',
'\uFEFF': 'Zero-Width No-Break Space',
'\u00AD': 'Soft Hyphen',
}
results = []
for i, char in enumerate(text):
if char in INVISIBLE:
results.append((i, char, INVISIBLE[char]))
return results
text = "Hello\u200Bworld" # "Hello" + ZWS + "world"
print(find_invisible_chars(text))
# [(5, '\u200b', 'Zero-Width Space')]
# Strip invisible characters for security-sensitive comparisons
def strip_invisible(text: str) -> str:
INVISIBLE_SET = {'\u200B', '\u200C', '\u200D', '\u200E', '\u200F',
'\u202A', '\u202B', '\u202C', '\u202D', '\u202E',
'\u2060', '\u2061', '\uFEFF', '\u00AD'}
return ''.join(c for c in text if c not in INVISIBLE_SET)
Bidirectional Text Attacks (BiDi)
Unicode bidirectional control characters (U+202AโU+202E, U+2066โU+2069) control the visual order of text. These can be used to make malicious code or content appear benign in a display while the underlying string contains harmful content.
The "Trojan Source" attack (CVE-2021-42574) demonstrated how BiDi characters can hide malicious code in source files:
# Dangerous: comment appears to end before it does in an editor
# using BiDi override characters
code_with_bidi = '/* \u202E } if (isAdmin) { \u202D access = true; */'
def strip_bidi_controls(text: str) -> str:
"""Remove Unicode bidirectional control characters."""
BIDI_CONTROLS = set('\u202A\u202B\u202C\u202D\u202E\u2066\u2067\u2068\u2069\u200E\u200F')
return ''.join(c for c in text if c not in BIDI_CONTROLS)
# For user-generated content, always strip BiDi controls
sanitized = strip_bidi_controls(user_input)
SQL and Code Injection via Emoji
Emoji themselves do not cause SQL injection โ the risk comes from improperly encoded input or databases with incorrect character set configuration.
MySQL utf8 vsVariation Selector (VS)
Unicode characters (VS-15 U+FE0E and VS-16 U+FE0F) that modify whether a character renders in text (monochrome) or emoji (colorful) presentation. utf8mb4 Truncation
On MySQL with the utf8 charset (3-byte max), a 4-byte emoji character can cause the database to truncate the rest of the string:
-- MySQL utf8 (NOT utf8mb4): this query may truncate silently
INSERT INTO users (name) VALUES ('Hello ๐ฅ World');
-- Stored: 'Hello ' (truncated at emoji)
This truncation can be exploited:
username = "safe_user ๐ฅ' OR '1'='1"
-- If truncated at emoji: INSERT INTO users VALUES ('safe_user ')
-- But in queries that don't use the DB: the full string is compared
Fix: Always use utf8mb4 in MySQL for any column that accepts user input.
ALTER TABLE users
MODIFY COLUMN name VARCHAR(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
Python Parameterized Queries (Safe Pattern)
import psycopg2
# SAFE: parameterized query handles emoji and special chars correctly
def create_user(name: str) -> None:
conn = psycopg2.connect(dsn)
with conn.cursor() as cur:
cur.execute(
"INSERT INTO users (name) VALUES (%s)",
(name,) # emoji handled safely by driver
)
conn.commit()
# UNSAFE: string formatting โ never do this
def create_user_unsafe(name: str) -> None:
query = f"INSERT INTO users (name) VALUES ('{name}')" # DO NOT DO THIS
# An emoji followed by a quote could break parsing in some environments
Emoji in Filenames and Path Traversal
Emoji in user-supplied filenames are valid on most modern filesystems (APFS, ext4, NTFS) but can cause issues:
import os
import re
def sanitize_filename(filename: str) -> str:
"""
Sanitize a filename: allow alphanumeric, spaces, dots, hyphens, and emoji.
Remove path traversal characters.
"""
import regex
# Remove null bytes and path separators
filename = filename.replace('\x00', '').replace('/', '').replace('\\', '')
# Remove invisible/control characters
filename = ''.join(c for c in filename if unicodedata.category(c)[0] != 'C'
or c in '\t\n')
# Prevent path traversal
filename = os.path.basename(filename)
if filename.startswith('.'):
filename = '_' + filename
return filename
# Test
print(sanitize_filename("../../../etc/passwd")) # "passwd"
print(sanitize_filename("report ๐ 2025.pdf")) # "report ๐ 2025.pdf"
print(sanitize_filename("file\x00name.txt")) # "filename.txt"
Content Moderation and Filtering
Emoji can be used to convey harmful intent while bypassing keyword filters. ๐ช (kitchen knife), ๐ (pill), and combinations can represent drug dealing or violence. Content moderation systems must treat emoji as meaningful content, not noise.
import regex
from typing import Callable
# Emoji-aware profanity/content filter
BLOCKED_SEQUENCES = {
"๐ช๐", # weapon + drug
"๐๐ฉธ", # injection + blood (context-dependent)
"๐๐ค", # fighting combo
}
def contains_blocked_sequence(text: str, blocks: set[str]) -> bool:
# Strip variation selectors for normalized comparison
normalized = text.replace('\uFE0F', '').replace('\uFE0E', '')
return any(seq in normalized for seq in blocks)
# Rate limiting emoji usage (spam detection)
def emoji_density(text: str) -> float:
"""Return ratio of emoji grapheme clusters to total grapheme clusters."""
import regex as rx
total = len(rx.findall(r'\X', text))
emoji_count = len(rx.findall(r'\p{Extended_Pictographic}', text))
return emoji_count / total if total > 0 else 0.0
# Flag high-emoji-density messages as potential spam
user_message = "๐ฐ๐ฐ๐ฐ WIN BIG ๐ค๐ค๐ค CLICK HERE ๐ฐ๐ฐ๐ฐ"
if emoji_density(user_message) > 0.5:
flag_for_review(user_message)
Input Validation Checklist
For any user input field that accepts emoji:
- [ ] Use
utf8mb4in MySQL; UTF-8UTF-8
A variable-width Unicode encoding that uses 1 to 4 bytes per character, dominant on the web (used by 98%+ of websites). in PostgreSQL - [ ] Parameterize all database queries
- [ ] Strip or reject BiDi control characters
- [ ] Strip invisible characters (U+200B, U+FEFF, etc.) before comparison
- [ ] Normalize emoji (strip variation selectors) for identity comparisons
- [ ] Validate string length by grapheme clusterGrapheme Cluster
A user-perceived character that may be composed of multiple Unicode code points displayed as a single visual unit. count, not byte or code unitCode Unit
The minimum bit combination used for encoding a character: 8-bit for UTF-8, 16-bit for UTF-16, and 32-bit for UTF-32. count - [ ] Check filenames for path traversal even when emoji are present
- [ ] Log and monitor unusual Unicode character categories in inputs
Explore More on EmojiFYI
- Inspect emoji code points for security analysis: Sequence Analyzer
- Compare how emoji render across platforms: Compare Tool
- Unicode character property glossary: Glossary
- Programmatic emoji data for validation rules: API Reference