Emojis in URLs and Domain Names
EmojiEmoji
A Japanese word (็ตตๆๅญ) meaning 'picture character' โ small graphical symbols used in digital communication to express ideas, emotions, and objects. domains like ๐.ws and emoji in URL paths are real โ but the underlying technology involves multiple layers of encoding that can catch developers off guard. This guide explains how Punycode, Internationalized Domain Names (IDN), and percent-encoding interact with emoji characters.
Emoji in Domain Names
How IDN Works
The Domain Name System (DNS) was designed for ASCII only (letters, digits, and hyphens). Internationalized Domain Names (IDN) extend this to non-ASCII characters โ including emoji โ through Punycode encoding, defined in RFC 3492.
Every non-ASCII label in a domain name is converted to ASCII-Compatible Encoding (ACE) by:
- Mapping the UnicodeUnicode
Universal character encoding standard that assigns a unique number to every character across all writing systems and symbol sets, including emoji. characters using IDNA (Internationalized Domain Names in Applications) rules - Encoding the result as a Punycode string
- Prefixing with
xn--
For example:
๐.ws โ xn--vi8h.ws
๐.com โ xn--h28h.com
IDNA Standards: 2003 vsVariation Selector (VS)
Unicode characters (VS-15 U+FE0E and VS-16 U+FE0F) that modify whether a character renders in text (monochrome) or emoji (colorful) presentation. 2008
There are two IDNA standards with important differences for emoji:
- IDNA2003 (RFC 3490): Used Nameprep (based on Unicode 3.2). Many browsers initially followed this.
- IDNA2008 (RFC 5891): Stricter rules. Emoji are categorized as
DISALLOWEDorUNASSIGNED, meaning emoji domains are technically invalid under IDNA2008.
In practice, most registrars that sell emoji domains use IDNA2003 or their own extensions. Browser behavior varies.
Converting to Punycode in Python
# Python's built-in codec handles Punycode
domain = "๐.ws"
encoded = domain.encode("idna").decode("ascii")
print(encoded) # xn--vi8h.ws
# Decode back
decoded = encoded.encode("ascii").decode("idna")
print(decoded) # ๐.ws
# For multi-label domains
full_domain = "๐.๐.example"
parts = full_domain.split(".")
encoded_parts = []
for part in parts:
try:
encoded_parts.append(part.encode("idna").decode("ascii"))
except UnicodeError:
encoded_parts.append(part)
encoded_domain = ".".join(encoded_parts)
print(encoded_domain) # xn--yp8h.xn--54g.example
Converting to Punycode in JavaScript
// Node.js built-in
const { domainToASCII, domainToUnicode } = require('url');
console.log(domainToASCII('๐.ws')); // xn--vi8h.ws
console.log(domainToUnicode('xn--vi8h.ws')); // ๐.ws
// Browser โ use the URL API
const url = new URL('http://๐.ws/');
console.log(url.hostname); // xn--vi8h.ws (browsers normalize to Punycode)
// For just encoding a label
const encoded = new URL(`http://${encodeURIComponent('๐')}.ws/`).hostname;
// This won't work cleanly โ use domainToASCII in Node or a library
Emoji Domains in Practice
Popular registrars that have offered emoji domain registration include .ws (Samoa), .fm (Micronesia), and .to (Tonga). Some gTLD operators do not allow emoji under their registry policies.
When building link detection or URL parsers, you must handle both the display form (๐.ws) and the wire form (xn--vi8h.ws).
Emoji in URL Paths and Query Strings
Unlike domains, URL paths and query strings use percent-encoding (also called URL encoding), defined in RFC 3986.
Percent-Encoding Emoji
Emoji characters are encoded as their UTF-8UTF-8
A variable-width Unicode encoding that uses 1 to 4 bytes per character, dominant on the web (used by 98%+ of websites). byte sequence, with each byte represented as %XX:
๐ โ UTF-8 bytes: F0 9F 8E 89
Percent-encoded: %F0%9F%8E%89
๐ โ UTF-8 bytes: F0 9F 8C 8D
Percent-encoded: %F0%9F%8C%8D
So the URL https://example.com/celebrate/๐ becomes:
https://example.com/celebrate/%F0%9F%8E%89
Encoding in Python
from urllib.parse import quote, unquote, urlencode, quote_plus
emoji = "๐"
# Encode for URL path (safe='/' preserves slashes)
path = f"/celebrate/{quote(emoji)}"
print(path) # /celebrate/%F0%9F%8E%89
# Decode back
print(unquote("%F0%9F%8E%89")) # ๐
# For query parameters
params = {"emoji": "๐๐", "lang": "en"}
query = urlencode(params, quote_via=quote)
print(query) # emoji=%F0%9F%8E%89%F0%9F%9A%80&lang=en
# Encode for HTML forms (spaces โ +)
form_encoded = quote_plus("search ๐")
print(form_encoded) # search+%F0%9F%94%8D
Encoding in JavaScript
// encodeURIComponent encodes everything except unreserved chars
const emoji = "๐";
const encoded = encodeURIComponent(emoji);
console.log(encoded); // %F0%9F%8E%89
// Decode
console.log(decodeURIComponent("%F0%9F%8E%89")); // ๐
// Full URL with emoji path and query
const base = "https://example.com";
const path = `/tag/${encodeURIComponent("๐tech")}`;
const params = new URLSearchParams({ q: "emoji ๐", page: "1" });
const fullUrl = `${base}${path}?${params}`;
console.log(fullUrl);
// https://example.com/tag/%F0%9F%9A%80tech?q=emoji+%F0%9F%94%8D&page=1
// URL API handles encoding automatically
const url = new URL("https://example.com");
url.pathname = "/celebrate/๐";
url.searchParams.set("q", "๐ search");
console.log(url.toString());
// https://example.com/celebrate/%F0%9F%8E%89?q=%F0%9F%94%8D+search
Go
package main
import (
"fmt"
"net/url"
)
func main() {
base := "https://example.com"
emoji := "๐"
// Encode path segment
encoded := url.PathEscape(emoji)
fmt.Println(encoded) // %F0%9F%8E%89
// Build full URL
u, _ := url.Parse(base)
u.Path = "/celebrate/" + emoji // url.URL handles encoding on output
q := u.Query()
q.Set("tag", "๐ rocket")
u.RawQuery = q.Encode()
fmt.Println(u.String())
// https://example.com/celebrate/%F0%9F%8E%89?tag=%F0%9F%9A%80+rocket
// Decode
decoded, _ := url.PathUnescape("%F0%9F%8E%89")
fmt.Println(decoded) // ๐
}
Security Considerations
Homoglyph Attacks in Domains
While Punycode is transparent in the address bar of modern browsers (which display the Unicode form for known-safe scripts), mixing scripts can create spoofing opportunities. An emoji like ๐ ฐ๏ธ (U+1F170, negative squared Latin A) looks like the letter A but encodes differently. This is less dangerous in domains than look-alike Latin characters, but worth being aware of in user-generated content.
Double-Encoding
A common bug is double-encoding emoji in URLs:
# WRONG: encoding an already-encoded string
encoded = quote("๐") # "%F0%9F%8E%89"
double = quote(encoded) # "%25F0%259F%258E%2589" โ wrong!
# RIGHT: encode once, decode once
encoded = quote("๐") # "%F0%9F%8E%89"
decoded = unquote(encoded) # "๐"
Django URL Patterns
Django automatically decodes percent-encoded URL paths before passing them to views. You receive the raw emoji character, not the encoded form:
# urls.py
from django.urls import path
from . import views
urlpatterns = [
path("tag/<str:tag>/", views.tag_view),
]
# views.py
def tag_view(request, tag):
# tag = "๐" (already decoded by Django)
print(repr(tag)) # '๐'
Testing Emoji URLs
# curl handles emoji encoding automatically
curl -v "https://example.com/๐"
# Or encode manually
curl -v "https://example.com/%F0%9F%8E%89"
# Test Punycode conversion
python3 -c "print('๐.ws'.encode('idna').decode('ascii'))"
Explore More on EmojiFYI
- Analyze emoji code points and their byte sequences: Sequence Analyzer
- Look up emoji Unicode data: Glossary
- Access emoji data for your URL handling code: API Reference
- Search for specific emoji: Search