The Silent Data Loss Problem
EmojiEmoji
Từ tiếng Nhật (絵文字) có nghĩa là 'ký tự hình ảnh' — các ký hiệu đồ họa nhỏ dùng trong giao tiếp kỹ thuật số để diễn đạt ý tưởng, cảm xúc và sự vật. have caused more silent data corruption bugs than almost any other character type. A user submits a message containing 🎉, your server accepts it without error, but the database quietly truncates the row — or throws an opaque constraint error. The cause is almost always the same: the database is configured to use a character encoding that cannot represent emoji.
This guide covers the correct configuration for the three most widely used databases.
MySQL: The utf8mb4 Problem
MySQL's utf8 character set is not real UTF-8UTF-8
Kiểu mã hóa Unicode có chiều rộng thay đổi, dùng từ 1 đến 4 byte cho mỗi ký tự, thống trị trên web (98%+ website sử dụng).. It only supports characters up to 3 bytes (U+0000 to U+FFFF). Since emoji code points start at U+1F300 and require 4 bytes in UTF-8, MySQL's utf8 silently drops or errors on them.
The fix is utf8mb4 — MySQL's name for actual UTF-8.
Server-Level Configuration
In my.cnf or my.ini:
[mysqld]
character-set-server = utf8mb4
collation-server = utf8mb4_unicode_ci
[client]
default-character-set = utf8mb4
Database and Table Level
-- Create database with utf8mb4
CREATE DATABASE myapp
CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;
-- Alter existing database
ALTER DATABASE myapp
CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;
-- Alter specific table
ALTER TABLE posts
CONVERT TO CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;
-- Alter specific column
ALTER TABLE posts
MODIFY content TEXT
CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;
Collation Choice
| Collation | Notes |
|---|---|
utf8mb4_unicode_ci |
Standard, good sorting, case-insensitive |
utf8mb4_unicode_520_ci |
UnicodeUnicode Tiêu chuẩn mã hóa ký tự phổ quát gán một số duy nhất cho mỗi ký tự trong tất cả hệ thống chữ viết và bộ ký hiệu, bao gồm cả emoji. 5.2.0 rules |
utf8mb4_0900_ai_ci |
MySQL 8.0+, Unicode 9.0, recommended for new projects |
utf8mb4_bin |
Binary comparison, case-sensitive, fastest |
For most applications, utf8mb4_unicode_ci (MySQL 5.x) or utf8mb4_0900_ai_ci (MySQL 8.x) is correct.
Connection String
Ensure your application sets utf8mb4 on the connection:
# Python (PyMySQL / mysqlclient)
DATABASES = {
'default': {
'ENGINE': 'django.db.backends.mysql',
'OPTIONS': {
'charset': 'utf8mb4',
},
}
}
// Node.js (mysql2)
const connection = mysql.createConnection({
charset: 'utf8mb4',
collation: 'utf8mb4_unicode_ci',
});
Checking Your Current Configuration
-- Check server defaults
SHOW VARIABLES LIKE 'character_set%';
SHOW VARIABLES LIKE 'collation%';
-- Check a specific column
SELECT CHARACTER_SET_NAME, COLLATION_NAME
FROM information_schema.COLUMNS
WHERE TABLE_SCHEMA = 'myapp'
AND TABLE_NAME = 'posts'
AND COLUMN_NAME = 'content';
PostgreSQL: Emoji Work Out of the Box
PostgreSQL uses UTF-8 by default for new databases, and real UTF-8 that supports all 4-byte emoji. There is no utf8mb4 problem.
-- Create database (UTF8 is the default for most installations)
CREATE DATABASE myapp WITH ENCODING 'UTF8';
-- Verify encoding
SELECT pg_encoding_to_char(encoding) FROM pg_database WHERE datname = 'myapp';
-- UTF8
Column Types for Emoji Text
-- TEXT is unlimited-length UTF-8, handles all emoji
CREATE TABLE posts (
id SERIAL PRIMARY KEY,
content TEXT,
title VARCHAR(255) -- VARCHAR also handles emoji in PostgreSQL
);
The main PostgreSQL gotcha is column length limits with VARCHAR(n). The limit is in characters, not bytes, so VARCHAR(10) can hold 10 emoji (each up to 4 bytes). This is usually correct, but be aware when porting logic from MySQL where limits may have been set with byte-length assumptions.
Collation and Searching
-- Case-insensitive search including emoji
SELECT * FROM posts
WHERE content ILIKE '%🎉%';
-- Using pg_trgm for fuzzy matching (works with emoji)
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE INDEX ON posts USING gin(content gin_trgm_ops);
JSON Columns
PostgreSQL's jsonb type stores UTF-8 natively:
CREATE TABLE events (
id SERIAL PRIMARY KEY,
metadata JSONB
);
INSERT INTO events (metadata)
VALUES ('{"reaction": "🎉", "user": "alice"}');
SELECT metadata->>'reaction' FROM events;
-- 🎉
SQLite: Simple but Configuration-Dependent
SQLite stores text as UTF-8 by default, and it supports all Unicode code points including emoji. However, the behavior depends on how the SQLite library was compiled and how you connect to it.
import sqlite3
conn = sqlite3.connect('myapp.db')
conn.execute("PRAGMA encoding = 'UTF-8'") # Default, but explicit is good
conn.execute("""
CREATE TABLE IF NOT EXISTS posts (
id INTEGER PRIMARY KEY,
content TEXT
)
""")
# Insert emoji — works fine in Python 3
conn.execute("INSERT INTO posts (content) VALUES (?)", ('Hello 🎉',))
conn.commit()
# Retrieve
row = conn.execute("SELECT content FROM posts LIMIT 1").fetchone()
print(row[0]) # Hello 🎉
SQLite Caveats
Text affinity: SQLite's type system uses affinities, not strict types. TEXT affinity stores any string as UTF-8. As long as you use parameterized queries (not string concatenation), emoji pass through cleanly.
LIKE is ASCII-only by default: SQLite's built-in LIKE is case-insensitive only for ASCII. For emoji-aware searching, use equality checks or the glob function, or enable the ICUICU (ICU)
International Components for Unicode — thư viện mã nguồn mở được sử dụng rộng rãi, cung cấp hỗ trợ Unicode và quốc tế hóa, bao gồm xử lý emoji. extension.
-- This works reliably for emoji matching
SELECT * FROM posts WHERE content = '🎉';
-- This may not work as expected for non-ASCII case folding
SELECT * FROM posts WHERE content LIKE '%🎉%';
-- (Works for emoji since emoji have no case, but be careful with letters)
Sizing Columns Correctly
A common mistake is sizing columns by byte length (since MySQL's older utf8 counted bytes). With utf8mb4 and PostgreSQL, VARCHAR(n) limits are in characters.
| Emoji Type | Code Points | UTF-8 Bytes |
|---|---|---|
| Basic emoji (😀) | 1 | 4 |
| Emoji + variation selectorVariation Selector (VS) Các ký tự Unicode (VS-15 U+FE0E và VS-16 U+FE0F) xác định xem một ký tự được hiển thị dưới dạng văn bản (đơn sắc) hay emoji (có màu). (❤️) |
2 | 7 |
| Emoji + skin tone (👍🏽) | 2 | 8 |
| ZWJZero Width Joiner (ZWJ) Ký tự Unicode vô hình (U+200D) dùng để ghép nhiều emoji thành một emoji tổng hợp, chẳng hạn kết hợp người và vật thể thành emoji nghề nghiệp. sequence (👩💻) |
3 | 11 |
| Family emoji (👨👩👧👦) | 7 | 25 |
If you store user reactions or status fields, VARCHAR(10) on code points is enough for most use cases. For freeform text, use TEXT without a length limit.
ORM Considerations
Django
Django automatically uses utf8mb4 for MySQL when you include 'charset': 'utf8mb4' in your database OPTIONS. For PostgreSQL, no special configuration is needed.
DATABASES = {
'default': {
'ENGINE': 'django.db.backends.mysql',
'OPTIONS': {
'charset': 'utf8mb4',
'init_command': "SET sql_mode='STRICT_TRANS_TABLES'",
},
}
}
Sequelize (Node.js)
const sequelize = new Sequelize('database', 'username', 'password', {
dialect: 'mysql',
dialectOptions: {
charset: 'utf8mb4',
},
});
Quick Diagnostic
To check if your database can store emoji, run this SQL:
-- MySQL/PostgreSQL/SQLite
CREATE TEMP TABLE emoji_test (val TEXT);
INSERT INTO emoji_test VALUES ('🎉');
SELECT val FROM emoji_test;
DROP TABLE emoji_test;
If you see 🎉 in the result, you are correctly configured. If you see ? or an empty string, your encoding is wrong. If you get an error, check your connection charset setting.