Unicode Lookup

Enter a codepoint like U+1F600 and get the emoji, encoding details, UTF-8/16 bytes, and HTML entities.

Converter

How to Use

1
Enter a codepoint or emoji
Type a Unicode codepoint in U+XXXX format (e.g., U+1F600), paste an emoji character directly, or enter a descriptive name to search. The tool accepts hex values with or without the U+ prefix and handles multi-codepoint sequences.
2
Read the full Unicode metadata
Review the official Unicode character name, block name, Unicode version when the character was introduced, Unicode category, and all relevant emoji properties (Emoji, Emoji_Presentation, Emoji_Modifier_Base, etc.) from Unicode's emoji-data.txt.
3
Copy any of the 8 encoding formats
Select and copy the encoding you need — UTF-8 bytes, UTF-16 code units, HTML decimal or hex entity, CSS escape, JavaScript string escape, Python escape, Java escape, or URL percent-encoding — each displayed in ready-to-use format.

About

At the most fundamental level, every emoji is a Unicode codepoint — a number in the Unicode codespace (U+0000 to U+10FFFF) assigned and maintained by the Unicode Consortium. The mapping from a codepoint to a displayed glyph passes through several layers: the encoding form (UTF-8, UTF-16, or UTF-32) used to store and transmit the character as bytes, the rendering engine that looks up the glyph in a font, and the font itself (which may be an OS system font or a custom emoji font). Understanding codepoints and encodings is foundational to building applications that handle emoji correctly.

Emoji codepoints are spread across more than a dozen Unicode blocks, reflecting the history of their incorporation into Unicode. Many early emoji were originally proprietary characters from Japanese mobile carriers (NTT DoCoMo, au, SoftBank) that were harmonized into Unicode 6.0 in 2010 using codepoints in previously unassigned ranges. The Supplementary Multilingual Plane (SMP, U+10000–U+10FFFF), where most modern emoji reside, requires surrogate pair encoding in UTF-16 — a detail that causes persistent bugs in JavaScript (which uses UTF-16 internally) and other languages with similar string models.

Beyond the codepoint itself, each emoji carries a set of Unicode character properties defined in emoji-data.txt, UnicodeData.txt, and related data files. These properties — Emoji, Emoji_Presentation, Emoji_Modifier_Base, Emoji_Component, Extended_Pictographic — are what text shaping engines use to determine how to process sequences, apply modifiers, and segment grapheme clusters. The Unicode Character Database (UCD) is the authoritative source for all of this metadata, available for download at unicode.org and updated with each annual Unicode release.

FAQ

What is the Unicode codepoint system and how are emoji codepoints assigned?

Unicode assigns every character a unique numerical identifier called a codepoint, expressed as U+XXXX (or U+XXXXX for codepoints above U+FFFF). The Unicode codespace spans from U+0000 to U+10FFFF — over 1.1 million possible positions, of which approximately 150,000 are currently assigned. Emoji are concentrated in several blocks: U+2600–U+27BF (Miscellaneous Symbols), U+1F300–U+1F9FF (various Miscellaneous Symbols and Pictographs blocks), and U+1FA00–U+1FAFF (Symbols and Pictographs Extended-A). New emoji codepoints are allocated by the Unicode Technical Committee during the annual release process, considering factors like avoiding collisions with reserved ranges and maintaining logical grouping.

Why do emoji require multiple bytes in UTF-8?

UTF-8 is a variable-width encoding that uses 1 to 4 bytes per Unicode codepoint. The number of bytes depends on the codepoint value: U+0000–U+007F (ASCII) uses 1 byte, U+0080–U+07FF uses 2 bytes, U+0800–U+FFFF uses 3 bytes, and U+10000–U+10FFFF uses 4 bytes. Most emoji codepoints fall in the U+1F300–U+1FAFF range (Supplementary Multilingual Plane), which requires 4 bytes in UTF-8 — for example, 😀 (U+1F600) encodes as F0 9F 98 80 in hex. A ZWJ sequence emoji like 👩‍💻 uses multiple codepoints, each requiring 4 UTF-8 bytes, plus the ZWJ U+200D (3 bytes) — totaling 11 bytes for a single displayed emoji.

What is the difference between UTF-8, UTF-16, and UTF-32 for emoji?

All three are Unicode encoding forms that represent the same codepoints using different byte structures. UTF-8 uses 1–4 bytes per codepoint and is the dominant encoding for web content and databases. UTF-16 uses 2 or 4 bytes — emoji above U+FFFF require a surrogate pair (two 2-byte units), which is why JavaScript's String.length counts these emoji as 2. UTF-32 always uses exactly 4 bytes per codepoint, making random access O(1) but consuming more memory. For emoji processing, UTF-16's surrogate pair handling is the most common source of programming bugs, since naive string length and indexing operations produce incorrect results for any emoji above U+FFFF.

What is a Unicode block and which blocks contain emoji?

Unicode blocks are contiguous ranges of codepoints grouped by character type or script. Emoji are distributed across multiple blocks: the Miscellaneous Symbols block (U+2600–U+26FF) contains many weather, chess, and religious symbols that predate the emoji designation; Dingbats (U+2700–U+27BF) includes decorative marks; Miscellaneous Symbols and Pictographs (U+1F300–U+1F5FF), Emoticons (U+1F600–U+1F64F), Transport and Map Symbols (U+1F680–U+1F6FF), Supplemental Symbols and Pictographs (U+1F900–U+1F9FF), and Symbols and Pictographs Extended-A (U+1FA00–U+1FAFF) cover the bulk of modern emoji. The distribution across multiple blocks reflects the incremental history of emoji addition to Unicode.

What Unicode properties are most important for emoji handling?

Unicode's emoji-data.txt file defines the properties most critical for emoji processing. The 'Emoji' property marks codepoints that are emoji or can appear in emoji sequences. 'Emoji_Presentation' marks codepoints that default to emoji rendering without a variation selector. 'Emoji_Modifier_Base' identifies codepoints that can accept Fitzpatrick skin tone modifiers. 'Emoji_Modifier' identifies the five skin tone modifier codepoints themselves. 'Emoji_Component' marks codepoints like ZWJ, variation selectors, and tag characters that participate in sequences but should not render independently. 'Extended_Pictographic' is a superset used in grapheme cluster breaking rules to ensure that even future emoji codepoints in reserved ranges are segmented correctly.