Storing Emojis in Databases: MySQL, PostgreSQL, and SQLite

The Silent Data Loss Problem

EmojiEmoji
Mot japonais (絵文字) signifiant 'caractère image' — petits symboles graphiques utilisés dans la communication numérique pour exprimer des idées, des émotions et des objets.
have caused more silent data corruption bugs than almost any other character type. A user submits a message containing 🎉, your server accepts it without error, but the database quietly truncates the row — or throws an opaque constraint error. The cause is almost always the same: the database is configured to use a character encoding that cannot represent emoji.

This guide covers the correct configuration for the three most widely used databases.

MySQL: The utf8mb4 Problem

MySQL's utf8 character set is not real UTF-8UTF-8
Encodage Unicode à largeur variable utilisant de 1 à 4 octets par caractère, dominant sur le web (utilisé par plus de 98 % des sites web).
. It only supports characters up to 3 bytes (U+0000 to U+FFFF). Since emoji code points start at U+1F300 and require 4 bytes in UTF-8, MySQL's utf8 silently drops or errors on them.

The fix is utf8mb4 — MySQL's name for actual UTF-8.

Server-Level Configuration

In my.cnf or my.ini:

[mysqld]
character-set-server = utf8mb4
collation-server = utf8mb4_unicode_ci

[client]
default-character-set = utf8mb4

Database and Table Level

-- Create database with utf8mb4
CREATE DATABASE myapp
  CHARACTER SET utf8mb4
  COLLATE utf8mb4_unicode_ci;

-- Alter existing database
ALTER DATABASE myapp
  CHARACTER SET utf8mb4
  COLLATE utf8mb4_unicode_ci;

-- Alter specific table
ALTER TABLE posts
  CONVERT TO CHARACTER SET utf8mb4
  COLLATE utf8mb4_unicode_ci;

-- Alter specific column
ALTER TABLE posts
  MODIFY content TEXT
  CHARACTER SET utf8mb4
  COLLATE utf8mb4_unicode_ci;

Collation Choice

Collation Notes
utf8mb4_unicode_ci Standard, good sorting, case-insensitive
utf8mb4_unicode_520_ci UnicodeUnicode
Standard universel d'encodage des caractères qui attribue un numéro unique à chaque caractère de tous les systèmes d'écriture et ensembles de symboles, y compris les emoji.
5.2.0 rules
utf8mb4_0900_ai_ci MySQL 8.0+, Unicode 9.0, recommended for new projects
utf8mb4_bin Binary comparison, case-sensitive, fastest

For most applications, utf8mb4_unicode_ci (MySQL 5.x) or utf8mb4_0900_ai_ci (MySQL 8.x) is correct.

Connection String

Ensure your application sets utf8mb4 on the connection:

# Python (PyMySQL / mysqlclient)
DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.mysql',
        'OPTIONS': {
            'charset': 'utf8mb4',
        },
    }
}
// Node.js (mysql2)
const connection = mysql.createConnection({
  charset: 'utf8mb4',
  collation: 'utf8mb4_unicode_ci',
});

Checking Your Current Configuration

-- Check server defaults
SHOW VARIABLES LIKE 'character_set%';
SHOW VARIABLES LIKE 'collation%';

-- Check a specific column
SELECT CHARACTER_SET_NAME, COLLATION_NAME
FROM information_schema.COLUMNS
WHERE TABLE_SCHEMA = 'myapp'
  AND TABLE_NAME = 'posts'
  AND COLUMN_NAME = 'content';

PostgreSQL: Emoji Work Out of the Box

PostgreSQL uses UTF-8 by default for new databases, and real UTF-8 that supports all 4-byte emoji. There is no utf8mb4 problem.

-- Create database (UTF8 is the default for most installations)
CREATE DATABASE myapp WITH ENCODING 'UTF8';

-- Verify encoding
SELECT pg_encoding_to_char(encoding) FROM pg_database WHERE datname = 'myapp';
-- UTF8

Column Types for Emoji Text

-- TEXT is unlimited-length UTF-8, handles all emoji
CREATE TABLE posts (
  id SERIAL PRIMARY KEY,
  content TEXT,
  title VARCHAR(255)  -- VARCHAR also handles emoji in PostgreSQL
);

The main PostgreSQL gotcha is column length limits with VARCHAR(n). The limit is in characters, not bytes, so VARCHAR(10) can hold 10 emoji (each up to 4 bytes). This is usually correct, but be aware when porting logic from MySQL where limits may have been set with byte-length assumptions.

Collation and Searching

-- Case-insensitive search including emoji
SELECT * FROM posts
WHERE content ILIKE '%🎉%';

-- Using pg_trgm for fuzzy matching (works with emoji)
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE INDEX ON posts USING gin(content gin_trgm_ops);

JSON Columns

PostgreSQL's jsonb type stores UTF-8 natively:

CREATE TABLE events (
  id SERIAL PRIMARY KEY,
  metadata JSONB
);

INSERT INTO events (metadata)
VALUES ('{"reaction": "🎉", "user": "alice"}');

SELECT metadata->>'reaction' FROM events;
-- 🎉

SQLite: Simple but Configuration-Dependent

SQLite stores text as UTF-8 by default, and it supports all Unicode code points including emoji. However, the behavior depends on how the SQLite library was compiled and how you connect to it.

import sqlite3

conn = sqlite3.connect('myapp.db')
conn.execute("PRAGMA encoding = 'UTF-8'")  # Default, but explicit is good

conn.execute("""
  CREATE TABLE IF NOT EXISTS posts (
    id INTEGER PRIMARY KEY,
    content TEXT
  )
""")

# Insert emoji — works fine in Python 3
conn.execute("INSERT INTO posts (content) VALUES (?)", ('Hello 🎉',))
conn.commit()

# Retrieve
row = conn.execute("SELECT content FROM posts LIMIT 1").fetchone()
print(row[0])  # Hello 🎉

SQLite Caveats

Text affinity: SQLite's type system uses affinities, not strict types. TEXT affinity stores any string as UTF-8. As long as you use parameterized queries (not string concatenation), emoji pass through cleanly.

LIKE is ASCII-only by default: SQLite's built-in LIKE is case-insensitive only for ASCII. For emoji-aware searching, use equality checks or the glob function, or enable the ICUICU (ICU)
International Components for Unicode — une bibliothèque open-source très répandue offrant un support Unicode et d'internationalisation, y compris le traitement des emoji.
extension.

-- This works reliably for emoji matching
SELECT * FROM posts WHERE content = '🎉';

-- This may not work as expected for non-ASCII case folding
SELECT * FROM posts WHERE content LIKE '%🎉%';
-- (Works for emoji since emoji have no case, but be careful with letters)

Sizing Columns Correctly

A common mistake is sizing columns by byte length (since MySQL's older utf8 counted bytes). With utf8mb4 and PostgreSQL, VARCHAR(n) limits are in characters.

Emoji Type Code Points UTF-8 Bytes
Basic emoji (😀) 1 4
Emoji + variation selector (❤️) 2 7
Emoji + skin tone (👍🏽) 2 8
ZWJJointure sans chasse (ZWJ)
Caractère Unicode invisible (U+200D) utilisé pour combiner plusieurs emoji en un seul emoji composite, comme l'assemblage de personnes et d'objets pour former des emoji de professions.
sequence (👩‍💻)
3 11
Family emoji (👨‍👩‍👧‍👦) 7 25

If you store user reactions or status fields, VARCHAR(10) on code points is enough for most use cases. For freeform text, use TEXT without a length limit.

ORM Considerations

Django

Django automatically uses utf8mb4 for MySQL when you include 'charset': 'utf8mb4' in your database OPTIONS. For PostgreSQL, no special configuration is needed.

DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.mysql',
        'OPTIONS': {
            'charset': 'utf8mb4',
            'init_command': "SET sql_mode='STRICT_TRANS_TABLES'",
        },
    }
}

Sequelize (Node.js)

const sequelize = new Sequelize('database', 'username', 'password', {
  dialect: 'mysql',
  dialectOptions: {
    charset: 'utf8mb4',
  },
});

Quick Diagnostic

To check if your database can store emoji, run this SQL:

-- MySQL/PostgreSQL/SQLite
CREATE TEMP TABLE emoji_test (val TEXT);
INSERT INTO emoji_test VALUES ('🎉');
SELECT val FROM emoji_test;
DROP TABLE emoji_test;

If you see 🎉 in the result, you are correctly configured. If you see ? or an empty string, your encoding is wrong. If you get an error, check your connection charset setting.

Termes du glossaire

Emoji Emoji
Mot japonais (絵文字) signifiant 'caractère image' — petits symboles graphiques utilisés dans la communication numérique pour exprimer des idées, des émotions et des objets.
ICU (ICU) ICU (ICU)
International Components for Unicode — une bibliothèque open-source très répandue offrant un support Unicode et d'internationalisation, y compris le traitement des emoji.
Jointure sans chasse (ZWJ) Jointure sans chasse (ZWJ)
Caractère Unicode invisible (U+200D) utilisé pour combiner plusieurs emoji en un seul emoji composite, comme l'assemblage de personnes et d'objets pour former des emoji de professions.
Sélecteur de variante (VS) Sélecteur de variante (VS)
Caractères Unicode (VS-15 U+FE0E et VS-16 U+FE0F) qui déterminent si un caractère s'affiche en présentation texte (monochrome) ou en présentation emoji (en couleur).
Unicode Unicode
Standard universel d'encodage des caractères qui attribue un numéro unique à chaque caractère de tous les systèmes d'écriture et ensembles de symboles, y compris les emoji.
UTF-8 UTF-8
Encodage Unicode à largeur variable utilisant de 1 à 4 octets par caractère, dominant sur le web (utilisé par plus de 98 % des sites web).

Articles associés