Emojis in URLs and Domain Names: Punycode, IDN, and Percent-Encoding

Emojis in URLs and Domain Names

EmojiEmoji
A Japanese word (็ตตๆ–‡ๅญ—) meaning 'picture character' โ€” small graphical symbols used in digital communication to express ideas, emotions, and objects.
domains like ๐Ÿ•.ws and emoji in URL paths are real โ€” but the underlying technology involves multiple layers of encoding that can catch developers off guard. This guide explains how Punycode, Internationalized Domain Names (IDN), and percent-encoding interact with emoji characters.

Emoji in Domain Names

How IDN Works

The Domain Name System (DNS) was designed for ASCII only (letters, digits, and hyphens). Internationalized Domain Names (IDN) extend this to non-ASCII characters โ€” including emoji โ€” through Punycode encoding, defined in RFC 3492.

Every non-ASCII label in a domain name is converted to ASCII-Compatible Encoding (ACE) by:

  1. Mapping the UnicodeUnicode
    Universal character encoding standard that assigns a unique number to every character across all writing systems and symbol sets, including emoji.
    characters using IDNA (Internationalized Domain Names in Applications) rules
  2. Encoding the result as a Punycode string
  3. Prefixing with xn--

For example:

๐Ÿ•.ws  โ†’  xn--vi8h.ws
๐Ÿ˜€.com โ†’  xn--h28h.com

IDNA Standards: 2003 vsVariation Selector (VS)
Unicode characters (VS-15 U+FE0E and VS-16 U+FE0F) that modify whether a character renders in text (monochrome) or emoji (colorful) presentation.
2008

There are two IDNA standards with important differences for emoji:

  • IDNA2003 (RFC 3490): Used Nameprep (based on Unicode 3.2). Many browsers initially followed this.
  • IDNA2008 (RFC 5891): Stricter rules. Emoji are categorized as DISALLOWED or UNASSIGNED, meaning emoji domains are technically invalid under IDNA2008.

In practice, most registrars that sell emoji domains use IDNA2003 or their own extensions. Browser behavior varies.

Converting to Punycode in Python

# Python's built-in codec handles Punycode
domain = "๐Ÿ•.ws"
encoded = domain.encode("idna").decode("ascii")
print(encoded)  # xn--vi8h.ws

# Decode back
decoded = encoded.encode("ascii").decode("idna")
print(decoded)  # ๐Ÿ•.ws

# For multi-label domains
full_domain = "๐Ÿš€.๐ŸŒ.example"
parts = full_domain.split(".")
encoded_parts = []
for part in parts:
    try:
        encoded_parts.append(part.encode("idna").decode("ascii"))
    except UnicodeError:
        encoded_parts.append(part)
encoded_domain = ".".join(encoded_parts)
print(encoded_domain)  # xn--yp8h.xn--54g.example

Converting to Punycode in JavaScript

// Node.js built-in
const { domainToASCII, domainToUnicode } = require('url');

console.log(domainToASCII('๐Ÿ•.ws'));      // xn--vi8h.ws
console.log(domainToUnicode('xn--vi8h.ws')); // ๐Ÿ•.ws

// Browser โ€” use the URL API
const url = new URL('http://๐Ÿ•.ws/');
console.log(url.hostname); // xn--vi8h.ws (browsers normalize to Punycode)

// For just encoding a label
const encoded = new URL(`http://${encodeURIComponent('๐Ÿ•')}.ws/`).hostname;
// This won't work cleanly โ€” use domainToASCII in Node or a library

Emoji Domains in Practice

Popular registrars that have offered emoji domain registration include .ws (Samoa), .fm (Micronesia), and .to (Tonga). Some gTLD operators do not allow emoji under their registry policies.

When building link detection or URL parsers, you must handle both the display form (๐Ÿ•.ws) and the wire form (xn--vi8h.ws).

Emoji in URL Paths and Query Strings

Unlike domains, URL paths and query strings use percent-encoding (also called URL encoding), defined in RFC 3986.

Percent-Encoding Emoji

Emoji characters are encoded as their UTF-8UTF-8
A variable-width Unicode encoding that uses 1 to 4 bytes per character, dominant on the web (used by 98%+ of websites).
byte sequence, with each byte represented as %XX:

๐ŸŽ‰ โ†’ UTF-8 bytes: F0 9F 8E 89
     Percent-encoded: %F0%9F%8E%89

๐ŸŒ โ†’ UTF-8 bytes: F0 9F 8C 8D
     Percent-encoded: %F0%9F%8C%8D

So the URL https://example.com/celebrate/๐ŸŽ‰ becomes:

https://example.com/celebrate/%F0%9F%8E%89

Encoding in Python

from urllib.parse import quote, unquote, urlencode, quote_plus

emoji = "๐ŸŽ‰"

# Encode for URL path (safe='/' preserves slashes)
path = f"/celebrate/{quote(emoji)}"
print(path)  # /celebrate/%F0%9F%8E%89

# Decode back
print(unquote("%F0%9F%8E%89"))  # ๐ŸŽ‰

# For query parameters
params = {"emoji": "๐ŸŽ‰๐Ÿš€", "lang": "en"}
query = urlencode(params, quote_via=quote)
print(query)  # emoji=%F0%9F%8E%89%F0%9F%9A%80&lang=en

# Encode for HTML forms (spaces โ†’ +)
form_encoded = quote_plus("search ๐Ÿ”")
print(form_encoded)  # search+%F0%9F%94%8D

Encoding in JavaScript

// encodeURIComponent encodes everything except unreserved chars
const emoji = "๐ŸŽ‰";
const encoded = encodeURIComponent(emoji);
console.log(encoded); // %F0%9F%8E%89

// Decode
console.log(decodeURIComponent("%F0%9F%8E%89")); // ๐ŸŽ‰

// Full URL with emoji path and query
const base = "https://example.com";
const path = `/tag/${encodeURIComponent("๐Ÿš€tech")}`;
const params = new URLSearchParams({ q: "emoji ๐Ÿ”", page: "1" });
const fullUrl = `${base}${path}?${params}`;
console.log(fullUrl);
// https://example.com/tag/%F0%9F%9A%80tech?q=emoji+%F0%9F%94%8D&page=1

// URL API handles encoding automatically
const url = new URL("https://example.com");
url.pathname = "/celebrate/๐ŸŽ‰";
url.searchParams.set("q", "๐Ÿ” search");
console.log(url.toString());
// https://example.com/celebrate/%F0%9F%8E%89?q=%F0%9F%94%8D+search

Go

package main

import (
    "fmt"
    "net/url"
)

func main() {
    base := "https://example.com"
    emoji := "๐ŸŽ‰"

    // Encode path segment
    encoded := url.PathEscape(emoji)
    fmt.Println(encoded) // %F0%9F%8E%89

    // Build full URL
    u, _ := url.Parse(base)
    u.Path = "/celebrate/" + emoji  // url.URL handles encoding on output
    q := u.Query()
    q.Set("tag", "๐Ÿš€ rocket")
    u.RawQuery = q.Encode()
    fmt.Println(u.String())
    // https://example.com/celebrate/%F0%9F%8E%89?tag=%F0%9F%9A%80+rocket

    // Decode
    decoded, _ := url.PathUnescape("%F0%9F%8E%89")
    fmt.Println(decoded) // ๐ŸŽ‰
}

Security Considerations

Homoglyph Attacks in Domains

While Punycode is transparent in the address bar of modern browsers (which display the Unicode form for known-safe scripts), mixing scripts can create spoofing opportunities. An emoji like ๐Ÿ…ฐ๏ธ (U+1F170, negative squared Latin A) looks like the letter A but encodes differently. This is less dangerous in domains than look-alike Latin characters, but worth being aware of in user-generated content.

Double-Encoding

A common bug is double-encoding emoji in URLs:

# WRONG: encoding an already-encoded string
encoded = quote("๐ŸŽ‰")          # "%F0%9F%8E%89"
double = quote(encoded)        # "%25F0%259F%258E%2589" โ€” wrong!

# RIGHT: encode once, decode once
encoded = quote("๐ŸŽ‰")          # "%F0%9F%8E%89"
decoded = unquote(encoded)     # "๐ŸŽ‰"

Django URL Patterns

Django automatically decodes percent-encoded URL paths before passing them to views. You receive the raw emoji character, not the encoded form:

# urls.py
from django.urls import path
from . import views

urlpatterns = [
    path("tag/<str:tag>/", views.tag_view),
]

# views.py
def tag_view(request, tag):
    # tag = "๐Ÿš€" (already decoded by Django)
    print(repr(tag))  # '๐Ÿš€'

Testing Emoji URLs

# curl handles emoji encoding automatically
curl -v "https://example.com/๐ŸŽ‰"

# Or encode manually
curl -v "https://example.com/%F0%9F%8E%89"

# Test Punycode conversion
python3 -c "print('๐Ÿ•.ws'.encode('idna').decode('ascii'))"

Explore More on EmojiFYI

Related Tools

๐Ÿ” Sequence Analyzer Sequence Analyzer
Decode ZWJ sequences, skin tone modifiers, keycap sequences, and flag pairs into individual components.

Glossary Terms

Emoji Emoji
A Japanese word (็ตตๆ–‡ๅญ—) meaning 'picture character' โ€” small graphical symbols used in digital communication to express ideas, emotions, and objects.
UTF-8 UTF-8
A variable-width Unicode encoding that uses 1 to 4 bytes per character, dominant on the web (used by 98%+ of websites).
Unicode Unicode
Universal character encoding standard that assigns a unique number to every character across all writing systems and symbol sets, including emoji.

้–ข้€ฃใ™ใ‚‹็ตตๆ–‡ๅญ—

Related Stories