What Is the UnicodeUnicode
Tiêu chuẩn mã hóa ký tự phổ quát gán một số duy nhất cho mỗi ký tự trong tất cả hệ thống chữ viết và bộ ký hiệu, bao gồm cả emoji. Lookup Tool?
Every character you see on screen — including all 3,900+ emojiEmoji
Từ tiếng Nhật (絵文字) có nghĩa là 'ký tự hình ảnh' — các ký hiệu đồ họa nhỏ dùng trong giao tiếp kỹ thuật số để diễn đạt ý tưởng, cảm xúc và sự vật. — has a unique numeric identifier assigned by the Unicode ConsortiumUnicode Consortium
Tổ chức phi lợi nhuận phát triển và duy trì Tiêu chuẩn Unicode, bao gồm quy trình thêm emoji mới.. The Unicode Lookup tool is a quick reference utility that takes either a code point (like U+1F600) or a pasted emoji character (like 😀) and returns the complete encoding picture: UTF-8 bytes, UTF-16 code units, HTML entities, CSS content values, and language-specific escape sequences.
For developers working with emoji in strings, databases, APIs, or user interfaces, this kind of instant breakdown eliminates the guesswork of converting between representations. Instead of reaching for a Python shell or searching scattered documentation, you get the full encoding story in one place.
Understanding Code Points
The U+ Notation
A code point is the canonical identifier for a Unicode character. It is written in the form U+ followed by four to six hexadecimal digits. The U+ prefix is universal shorthand: it does not represent bytes in memory, only an abstract numeric position within the Unicode codespace.
For example:
- U+1F600 — 😀 Grinning Face
- U+2764 — ❤ Heavy Black Heart
- U+1F1FA U+1F1F8 — 🇺🇸 Flag: United States (two code points)
When you copy a code point from a specification, a font tool, or another reference site, you can paste it directly into the Unicode Lookup tool exactly as written.
Hexadecimal Code Points
Code points are expressed in hexadecimal (base 16) because the Unicode codespace runs from U+0000 to U+10FFFF — over 1.1 million possible positions. Decimal would be cumbersome for both humans and tools. Hex maps cleanly to the byte-level representations used by UTF-8 and UTF-16, making it the natural notation for encoding work.
If you are more comfortable with decimal, the tool displays the decimal equivalent alongside the hex value. U+1F600 in decimal is 128512.
Code Point Ranges for Emojis
Emoji code points are scattered across several Unicode blocks rather than grouped in a single contiguous range:
| Range | Content |
|---|---|
| U+2000 – U+27FF | Symbols, punctuation, and some legacy emoji (✅, ⚡) |
| U+1F300 – U+1F9FF | Core emoji block (faces, animals, food, activities) |
| U+1FA00 – U+1FAFF | Extended-A (newer additions, chess pieces, household objects) |
| U+1F1E0 – U+1F1FF | Regional IndicatorRegional Indicator (RI) Các chữ cái Unicode ghép đôi (U+1F1E6 đến U+1F1FF) tạo thành emoji cờ quốc gia khi kết hợp theo mã ISO 3166-1 alpha-2. Symbols (used in flag sequences) |
Understanding where an emoji lives in the codespace matters when you encounter database errors about character encoding ranges, which often surface because MySQL's utf8 charset only covers U+0000 through U+FFFF and rejects emoji above that range.
How to Use the Tool
Enter a Code Point (e.g., U+1F600)
Type or paste a code point into the lookup field. The tool accepts several common input formats:
U+1F600— canonical form with prefix1F600— bare hex digits0x1F600— C-style hex prefix
After submission, the tool resolves the code point to a character, displays the rendered glyph, and expands all encoding representations below.
Enter an Emoji Character
If you have an emoji character and want to reverse-engineer its code point, simply paste the character directly into the input field. The tool detects whether the input is a character or a code point notation and routes accordingly. Pasting 🥑 will resolve to U+1F951 and show all encoding formats automatically.
This reverse lookup is particularly useful when you receive an emoji in a log file, a database export, or an API response and need to identify exactly what character it is.
View Full Encoding Breakdown
Once the lookup resolves, the tool displays a structured breakdown of every major encoding format. Each value is individually copyable so you can paste the exact representation you need into your code without manual conversion.
Encoding Formats Explained
UTF-8UTF-8
Kiểu mã hóa Unicode có chiều rộng thay đổi, dùng từ 1 đến 4 byte cho mỗi ký tự, thống trị trên web (98%+ website sử dụng). Bytes
UTF-8 is the dominant encoding on the web and in most modern applications. It encodes each code point as one to four bytes, with higher code points requiring more bytes.
The grinning face emoji 😀 (U+1F600) encodes to four bytes: F0 9F 98 80. You will see this representation when inspecting raw HTTP responses, binary file contents, or low-level network streams. If your application is stripping or mangling emoji, the byte sequence is the place to start diagnosing.
UTF-16UTF-16
Kiểu mã hóa Unicode có chiều rộng thay đổi, dùng 2 hoặc 4 byte cho mỗi ký tự, được JavaScript, Java và Windows dùng nội bộ. Surrogate Pairs
UTF-16 uses 16-bit code units. Characters in the Basic Multilingual PlanePlane
Một nhóm gồm 65,536 điểm mã Unicode liên tiếp. Plane 0 là Basic Multilingual Plane (BMP); hầu hết emoji nằm ở Plane 1 (SMP). (U+0000 to U+FFFF) fit in a single 16-bit unit. Characters above U+FFFF — which includes most emoji — require two 16-bit units called a surrogate pair.
For U+1F600, the surrogate pairSurrogate Pair
Hai đơn vị mã UTF-16 (một surrogate cao U+D800-U+DBFF theo sau là một surrogate thấp U+DC00-U+DFFF) cùng nhau đại diện cho một ký tự trên U+FFFF. is D83D DE00. The first unit (D83D) is the high surrogate and the second (DE00) is the low surrogate. JavaScript strings are internally UTF-16, which is why '😀'.length evaluates to 2 rather than 1: each surrogate counts as a separate code unitCode Unit
Tổ hợp bit tối thiểu dùng để mã hóa một ký tự: 8 bit cho UTF-8, 16 bit cho UTF-16 và 32 bit cho UTF-32.. This is a common source of off-by-one errors in string manipulation code that was written without emoji in mind.
The Unicode Lookup tool shows the surrogate pair for any emoji in the supplementary planes so you can anticipate and handle these length discrepancies.
HTML Entities (Decimal and Hex)
HTML supports numeric character references in two forms:
- Decimal:
😀 - Hexadecimal:
😀
Both render identically in the browser. The hex form is more readable for anyone cross-referencing Unicode documentation, while the decimal form occasionally appears in older codebases and XML documents. The tool provides both so you can match whatever convention your project already uses.
CSS content Property
When rendering emoji through CSS (for decorative icons, pseudo-elements, or font-icon replacements), the content property uses a backslash-prefixed hex escape without the U+ prefix:
.emoji-icon::before {
content: "\1F600";
}
This format strips the U+ and replaces it with \. The Unicode Lookup tool displays the ready-to-paste CSS value alongside the other representations.
Python and JavaScript Literals
Each language has its own string escape format for Unicode characters:
Python uses \U followed by eight hex digits for supplementary plane characters:
grinning_face = "\U0001F600"
# or use chr() with the decimal value:
grinning_face = chr(0x1F600)
JavaScript uses \u{...} with ES2015+ template literals or strings:
const grinningFace = "\u{1F600}";
// older environments using surrogate pairs explicitly:
const grinningFace = "\uD83D\uDE00";
The tool outputs both the modern \u{...} form and the surrogate pair escape so you can target any JavaScript environment.
Practical Developer Use Cases
Debugging Emoji in Code
When an emoji appears as a replacement character (�), a question mark, or a sequence of garbage bytes, you need to identify the original code point and then trace which encoding layer is failing. Copy the broken output, paste it into the Unicode Lookup tool, and compare its byte sequence against what your application is producing. Mismatches in byte count or sequence often point to an encoding mismatch between the application layer and the database, or between the database and the client.
Cross-Platform Encoding Issues
Different platforms store and transmit Unicode differently. A string that round-trips cleanly between two Python services may break when passed through a Java layer that uses UTF-16 internally, or through a MySQL column using the utf8 charset instead of utf8mb4. The encoding breakdown from the Unicode Lookup tool gives you the concrete byte values and code unit sequences to verify that each layer is handling the character correctly.
Database Storage Considerations
MySQL's legacy utf8 charset is limited to three bytes per character, which excludes all emoji above U+FFFF. The correct charset for emoji storage is utf8mb4, which supports the full four-byte UTF-8 range. PostgreSQL uses UTF8 natively and handles all Unicode code points without special configuration.
If you see a "Incorrect string value" error in MySQL when inserting emoji, the Unicode Lookup tool lets you confirm whether the character is in the four-byte range (above U+FFFF) and verify that your column charset is set to utf8mb4.
Multi-Code-Point Emojis
Many emoji that appear as a single character are actually sequences of multiple code points rendered as a single visual unit — a grapheme cluster. Common sequence types include:
- ZWJ sequences: A Zero Width JoinerZero Width Joiner (ZWJ)
Ký tự Unicode vô hình (U+200D) dùng để ghép nhiều emoji thành một emoji tổng hợp, chẳng hạn kết hợp người và vật thể thành emoji nghề nghiệp. (U+200D) connects two or more emoji into a combined form. For example, 👩💻 (woman technologist) isU+1F469 U+200D U+1F4BB. - Skin tone modifiers: A Fitzpatrick modifier (U+1F3FB through U+1F3FF) is appended to a base emoji: 👍🏽 is
U+1F44D U+1F3FD. - Flag sequences: Pairs of Regional Indicator Symbols combine into country flags: 🇯🇵 is
U+1F1EF U+1F1F5.
The Unicode Lookup tool handles single code points. For sequences composed of multiple code points, use the Sequence Analyzer, which breaks a complex emoji sequenceEmoji Sequence
Một tập hợp có thứ tự gồm một hoặc nhiều điểm mã Unicode cùng nhau đại diện cho một ký tự emoji duy nhất. into its constituent components and explains the role of each part.
Understanding grapheme clusters is especially important when implementing string length checks or character counting in user interfaces. A skin-toned emoji or a family emoji built from ZWJ connections may consist of three, five, or more code points while visually occupying the space of one character.
Related Tools and Resources
- Unicode Lookup — Enter a code point or paste an emoji for a full encoding breakdown
- Sequence Analyzer — Decompose multi-code-point emoji sequences (ZWJ, skin tones, flags)
Glossary reference: - Code point — The numeric identity of a Unicode character - UTF-8 — Variable-width encoding used on the web - UTF-16 — Fixed-width encoding used internally by JavaScript and Java - Surrogate pair — Two UTF-16 code units representing a supplementary plane character - Grapheme cluster — A sequence of code points that renders as a single visible character