Emoji Regex Patterns: Matching Emojis in JavaScript and Python

Why EmojiEmoji
A Japanese word (็ตตๆ–‡ๅญ—) meaning 'picture character' โ€” small graphical symbols used in digital communication to express ideas, emotions, and objects.
Regex Is Hard

Writing a regex that correctly matches emoji is surprisingly difficult. A single visible emoji like ๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ (family) is composed of 7 UnicodeUnicode
Universal character encoding standard that assigns a unique number to every character across all writing systems and symbol sets, including emoji.
code points joined by invisible characters. A regex that matches a single character will match fragments of emoji, or miss them entirely.

The root problems are:

  1. Variable length: emoji range from 1 code pointCode Point
    A unique numerical value assigned to each character in the Unicode standard, written in the format U+XXXX (e.g., U+1F600 for ๐Ÿ˜€).
    (๐Ÿ˜€) to 10+ code points (complex ZWJZero Width Joiner (ZWJ)
    An invisible Unicode character (U+200D) used to join multiple emoji into a single composite emoji, such as combining people and objects into profession emoji.
    sequences)
  2. Surrogate pairs: in UTF-16UTF-16
    A variable-width Unicode encoding that uses 2 or 4 bytes per character, used internally by JavaScript, Java, and Windows.
    environments (JavaScript), each emoji above U+FFFF is two code units
  3. Combining characters: variation selectors, skin tone modifiers, and ZWJ are invisible but part of the emoji
  4. Evolving standard: new emoji are added each Unicode release, so hardcoded ranges go stale

JavaScript: Using the Unicode Flag

The u flag enables Unicode mode in JavaScript regex, making . match a full code point rather than a single UTF-16 code unitCode Unit
The minimum bit combination used for encoding a character: 8-bit for UTF-8, 16-bit for UTF-16, and 32-bit for UTF-32.
.

// Without u flag: . matches one code unit (breaks emoji)
/^.$/.test('๐Ÿ˜€')   // false โ€” emoji is 2 code units
/^.$/u.test('๐Ÿ˜€')  // true โ€” u flag treats it as one code point

// Match any single emoji code point (basic, not sequences)
const basicEmoji = /\p{Emoji}/u;
basicEmoji.test('Hello ๐Ÿ˜€')  // true

// The v flag (ES2024) adds set operations and is stricter
const emojiV = /[\p{Emoji}--\p{Number}]/v;

Matching Full Emoji Grapheme Clusters

To match complete emoji including ZWJ sequences and skin tones, you need a pattern that handles all the components:

// Comprehensive emoji regex (covers most cases)
const emojiRegex = /\p{Emoji_Modifier_Base}\p{Emoji_Modifier}|\p{Emoji_Presentation}|\p{Emoji}\uFE0F/gu;

// Even better: use the emoji-regex npm package
// import emojiRegex from 'emoji-regex';
// const re = emojiRegex();

// Example usage
const text = 'Hello ๐Ÿ‘‹ World ๐ŸŒ from ๐Ÿ‘จโ€๐Ÿ’ป';
const matches = text.match(emojiRegex);
// ['๐Ÿ‘‹', '๐ŸŒ', '๐Ÿ‘จโ€๐Ÿ’ป']  โ† note: ZWJ sequence captured as one match

The emoji-regex Package

For production use, the emoji-regex npm package by Mathias Bynens generates a regex from the Unicode data and handles all edge cases:

import emojiRegex from 'emoji-regex';

const re = emojiRegex();
const str = '๐Ÿ’ƒ๐Ÿฝ dancing and ๐Ÿš€ launching';

let match;
while ((match = re.exec(str)) !== null) {
  console.log(`Found: ${match[0]} at index ${match.index}`);
}
// Found: ๐Ÿ’ƒ๐Ÿฝ at index 0
// Found: ๐Ÿš€ at index 14

Python: The emoji Library and Regex

Python 3 handles code points natively โ€” '๐Ÿ˜€' has length 1. But matching emoji sequences still requires care.

Using Unicode Property Escapes with regex

The built-in re module does not support Unicode property escapes. Install the regex module instead:

import regex

# Match emoji with Unicode property escapes
pattern = regex.compile(r'\p{Emoji}', regex.UNICODE)
pattern.findall('Hello ๐Ÿ˜€ World ๐ŸŒ')
# ['๐Ÿ˜€', '๐ŸŒ']

# Match grapheme clusters (handles ZWJ sequences)
grapheme_pattern = regex.compile(r'\X', regex.UNICODE)
grapheme_pattern.findall('๐Ÿ‘ฉโ€๐Ÿ’ป coding')
# ['๐Ÿ‘ฉโ€๐Ÿ’ป', ' ', 'c', 'o', 'd', 'i', 'n', 'g']

The \X pattern matches a full Unicode grapheme clusterGrapheme Cluster
A user-perceived character that may be composed of multiple Unicode code points displayed as a single visual unit.
โ€” the correct unit for "one visible character."

Using the emoji Library

For higher-level emoji operations, the emoji library is excellent:

import emoji

# Find all emoji in text
text = 'I love ๐Ÿ Python and โ˜• coffee'
emoji.emoji_list(text)
# [{'match_start': 7, 'match_end': 8, 'emoji': '๐Ÿ'},
#  {'match_start': 19, 'match_end': 20, 'emoji': 'โ˜•'}]

# Check if string is entirely emoji
emoji.is_emoji('๐Ÿ˜€')   # True
emoji.is_emoji('hello') # False

# Count distinct emoji
emoji.emoji_count('๐Ÿ๐Ÿ๐Ÿ')  # 3
emoji.emoji_count('๐Ÿ๐Ÿ๐Ÿ', unique=True)  # 1

Matching Specific Emoji Subsets

Flags Only

Country flags are Regional IndicatorRegional Indicator (RI)
Paired Unicode letters (U+1F1E6 to U+1F1FF) that form country flag emoji when combined according to ISO 3166-1 alpha-2 codes.
Symbol pairs (U+1F1E6โ€“U+1F1FF):

// Match flag emoji (two regional indicator letters)
const flagRegex = /[\u{1F1E6}-\u{1F1FF}]{2}/gu;
'I am from ๐Ÿ‡ฉ๐Ÿ‡ช and you from ๐Ÿ‡บ๐Ÿ‡ธ'.match(flagRegex);
// ['๐Ÿ‡ฉ๐Ÿ‡ช', '๐Ÿ‡บ๐Ÿ‡ธ']
import regex

flag_pattern = regex.compile(r'[\U0001F1E6-\U0001F1FF]{2}')
flag_pattern.findall('Visiting ๐Ÿ‡ฏ๐Ÿ‡ต and ๐Ÿ‡ฐ๐Ÿ‡ท')
# ['๐Ÿ‡ฏ๐Ÿ‡ต', '๐Ÿ‡ฐ๐Ÿ‡ท']

Keycap Sequences

Keycaps like 0๏ธโƒฃ through 9๏ธโƒฃ follow the pattern: digit + U+FE0F + U+20E3:

const keycapRegex = /[0-9#*]\uFE0F\u20E3/gu;
'Press 1๏ธโƒฃ or 2๏ธโƒฃ'.match(keycapRegex);
// ['1๏ธโƒฃ', '2๏ธโƒฃ']

Common Mistakes

Mistake 1: Using . without the u flag in JavaScript. It matches one code unit, splitting emoji.

Mistake 2: Checking str.length > 0 to detect emoji content. An emoji-only string can have .length of 8 or more.

Mistake 3: Using character class ranges like [\u0080-\uFFFF] โ€” this misses most modern emoji above U+FFFF and produces false positives for non-emoji Unicode characters.

Mistake 4: Forgetting variation selectorVariation Selector (VS)
Unicode characters (VS-15 U+FE0E and VS-16 U+FE0F) that modify whether a character renders in text (monochrome) or emoji (colorful) presentation.
U+FE0F. The character โค (U+2764) without VS16 is a text symbol; โค๏ธ with U+FE0F is the emoji presentationEmoji Presentation
The default rendering of a character as a colorful emoji glyph, either inherently or when triggered by Variation Selector-16.
.

Testing Your Pattern

Use our Sequence Analyzer to inspect any emoji's code points, then test your regex against it to verify full matches. Always test against ZWJ sequences, skin tone variants, and flag emoji before shipping emoji-handling code.

Related Tools

๐Ÿ” Sequence Analyzer Sequence Analyzer
Decode ZWJ sequences, skin tone modifiers, keycap sequences, and flag pairs into individual components.

Glossary Terms

Code Point Code Point
A unique numerical value assigned to each character in the Unicode standard, written in the format U+XXXX (e.g., U+1F600 for ๐Ÿ˜€).
Code Unit Code Unit
The minimum bit combination used for encoding a character: 8-bit for UTF-8, 16-bit for UTF-16, and 32-bit for UTF-32.
Emoji Emoji
A Japanese word (็ตตๆ–‡ๅญ—) meaning 'picture character' โ€” small graphical symbols used in digital communication to express ideas, emotions, and objects.
Emoji Presentation Emoji Presentation
The default rendering of a character as a colorful emoji glyph, either inherently or when triggered by Variation Selector-16.
Grapheme Cluster Grapheme Cluster
A user-perceived character that may be composed of multiple Unicode code points displayed as a single visual unit.
Regional Indicator (RI) Regional Indicator (RI)
Paired Unicode letters (U+1F1E6 to U+1F1FF) that form country flag emoji when combined according to ISO 3166-1 alpha-2 codes.
UTF-16 UTF-16
A variable-width Unicode encoding that uses 2 or 4 bytes per character, used internally by JavaScript, Java, and Windows.
Unicode Unicode
Universal character encoding standard that assigns a unique number to every character across all writing systems and symbol sets, including emoji.
Variation Selector (VS) Variation Selector (VS)
Unicode characters (VS-15 U+FE0E and VS-16 U+FE0F) that modify whether a character renders in text (monochrome) or emoji (colorful) presentation.
Zero Width Joiner (ZWJ) Zero Width Joiner (ZWJ)
An invisible Unicode character (U+200D) used to join multiple emoji into a single composite emoji, such as combining people and objects into profession emoji.

Related Stories