Emoji Detection in Text Strings: Algorithms and Libraries

EmojiEmoji
A Japanese word (็ตตๆ–‡ๅญ—) meaning 'picture character' โ€” small graphical symbols used in digital communication to express ideas, emotions, and objects.
Detection in Text Strings

Detecting whether a string contains emojisโ€”and extracting them accuratelyโ€”is harder than it looks. The Unicode standardUnicode Standard
The complete character encoding system maintained by the Unicode Consortium, defining characters, properties, algorithms, and encoding forms.
has grown to include over 3,700 emoji characters spread across multiple code pointCode Point
A unique numerical value assigned to each character in the Unicode standard, written in the format U+XXXX (e.g., U+1F600 for ๐Ÿ˜€).
ranges, with new ones added every year. A naive approach using a fixed range check will miss most of them.

This guide covers the algorithms, UnicodeUnicode
Universal character encoding standard that assigns a unique number to every character across all writing systems and symbol sets, including emoji.
properties, and production-ready libraries you need to detect emojis reliably.

Why Simple Range Checks Fail

A common first attempt is checking whether a code point falls in the range U+1F600โ€“U+1F64F (Emoticons block). This catches classics like ๐Ÿ˜€, ๐Ÿ˜‚, and ๐Ÿ˜Ž, but misses:

  • Basic emoji: ยฉ๏ธ (U+00A9), ยฎ (U+00AE), โ„ข๏ธ (U+2122) โ€” in the Latin Extended range
  • Dingbats: โœ… (U+2705), โŒ (U+274C) โ€” in the Dingbats block
  • Enclosed alphanumerics: ๐Ÿ…ฐ๏ธ, ๐Ÿ…ฑ๏ธ
  • ZWJZero Width Joiner (ZWJ)
    An invisible Unicode character (U+200D) used to join multiple emoji into a single composite emoji, such as combining people and objects into profession emoji.
    sequences
    : ๐Ÿ‘จโ€๐Ÿ’ป (man technologist) โ€” multiple code points joined by U+200D
  • Keycap sequences: 1๏ธโƒฃ โ€” digit + variation selectorVariation Selector (VS)
    Unicode characters (VS-15 U+FE0E and VS-16 U+FE0F) that modify whether a character renders in text (monochrome) or emoji (colorful) presentation.
    + combining enclosing keycap
  • Flag sequences: ๐Ÿ‡บ๐Ÿ‡ธ โ€” pairs of Regional IndicatorRegional Indicator (RI)
    Paired Unicode letters (U+1F1E6 to U+1F1FF) that form country flag emoji when combined according to ISO 3166-1 alpha-2 codes.
    letters

The only reliable approach is to use the official Unicode emoji property data.

The Unicode Emoji Properties Approach

Unicode defines several properties relevant to emoji detection, published in emoji-data.txt from the Unicode Character Database (UCD):

Property Meaning
Emoji The code point is an emoji
Emoji_Presentation Displayed as emoji by default (not text)
Emoji_Modifier A skin tone modifierSkin Tone Modifier
Five Unicode modifier characters based on the Fitzpatrick scale that change the skin color of human emoji (U+1F3FB to U+1F3FF).
(๐Ÿปโ€“๐Ÿฟ)
Emoji_Modifier_Base Can be modified by a skin tone modifier
Emoji_Component Used in emoji sequences (ZWJ, keycap, etc.)
Extended_Pictographic Broader set including reserved ranges

For most detection tasks you want Extended_Pictographic, which includes current emoji plus code points reserved for future emoji assignments.

Detection in Python

Using the emoji Library

The emoji library maintains an up-to-date Unicode dataset:

import emoji

text = "Hello ๐Ÿ‘‹ world! Check this out ๐Ÿš€"

# Check if string contains any emoji
has_emoji = emoji.emoji_count(text) > 0
print(has_emoji)  # True

# Count emojis
count = emoji.emoji_count(text)
print(count)  # 2

# Extract emoji with positions
for item in emoji.emoji_list(text):
    print(item)
# {'match_start': 6, 'match_end': 7, 'emoji': '๐Ÿ‘‹'}
# {'match_start': 26, 'match_end': 27, 'emoji': '๐Ÿš€'}

# Replace emojis
clean = emoji.replace_emoji(text, replace="")
print(clean)  # "Hello  world! Check this out "

Using the regex Module with Unicode Properties

The third-party regex module (not the built-in re) supports Unicode properties:

import regex

# Match any Extended_Pictographic character or emoji sequenceEmoji Sequence
An ordered set of one or more Unicode code points that together represent a single emoji character.
EMOJI_PATTERN = regex.compile( r'\p{Extended_Pictographic}' r'(?:\uFE0F)?' # optional variation selector-16 r'(?:\u20E3)?' # optional combining enclosing keycap r'(?:\uFE0F\u20E3)?' # keycap sequenceKeycap Sequence
An emoji sequence formed by a digit or symbol, followed by VS-16 (U+FE0F) and the combining enclosing keycap character (U+20E3).
r'(?:\u200D\p{Extended_Pictographic}(?:\uFE0F)?)*' # ZWJ sequences r'(?:[\U0001F1E0-\U0001F1FF]{2})?', # flag sequences regex.UNICODE ) text = "Deploying ๐Ÿš€ to production ๐Ÿ‘จโ€๐Ÿ’ป โ€” fingers crossed ๐Ÿคž๐Ÿฝ" matches = EMOJI_PATTERN.findall(text) print(matches) # ['๐Ÿš€', '๐Ÿ‘จโ€๐Ÿ’ป', '๐Ÿคž๐Ÿฝ']

Pure stdlib with unicodedata

For simpler cases without extra dependencies, check the unicodedata category:

import unicodedata

def contains_emoji_simple(text: str) -> bool:
    for char in text:
        cat = unicodedata.category(char)
        # So (Symbol, other) covers many but not all emoji
        if cat == "So":
            return True
    return False

This is fast but incomplete โ€” it misses many emoji that fall in other categories.

Detection in JavaScript

Using the emoji-regex Package

import emojiRegex from 'emoji-regex';

const regex = emojiRegex();
const text = "Meeting at 3pm ๐Ÿ“… โ€” bring your laptop ๐Ÿ’ป";

// Test for presence
console.log(regex.test(text)); // true

// Extract all emojis
const matches = [...text.matchAll(regex)];
matches.forEach(m => {
  console.log(`Found: ${m[0]} at index ${m.index}`);
});
// Found: ๐Ÿ“… at index 15
// Found: ๐Ÿ’ป at index 38

// Count
const count = [...text.matchAll(regex)].length;
console.log(count); // 2

Note that emoji-regex is generated directly from Unicode data, so it stays accurate across emoji versions.

Native Unicode Property Escapes (ES2018+)

Modern JavaScript engines support \p{} in regex with the u flag:

// Requires Node.js 10+ or modern browsers
const emojiRx = /\p{Emoji}/u;
const extPictoRx = /\p{Extended_Pictographic}/u;

console.log(emojiRx.test("Hello ๐ŸŒ")); // true
console.log(extPictoRx.test("No emoji here")); // false

// Extract using matchAll
const text = "Status: โœ… Build passed, ๐Ÿ”ด Tests failed";
const allEmoji = [...text.matchAll(/\p{Extended_Pictographic}/gu)];
console.log(allEmoji.map(m => m[0])); // ['โœ…', '๐Ÿ”ด']

Detection in Go

package main

import (
    "fmt"
    "unicode"
    "golang.org/x/text/unicode/rangetable"
)

// Basic check using unicode.Is
func containsEmoji(s string) bool {
    for _, r := range s {
        if unicode.Is(unicode.So, r) || // Symbol, other
           (r >= 0x1F600 && r <= 0x1FFFF) || // Supplemental symbols
           (r >= 0x2600 && r <= 0x27BF) {    // Misc symbols
            return true
        }
    }
    return false
}

func main() {
    texts := []string{
        "Hello world",
        "Rocket ๐Ÿš€ launched",
        "ยฉ๏ธ Copyright symbol",
    }
    for _, t := range texts {
        fmt.Printf("%q โ†’ %v\n", t, containsEmoji(t))
    }
}

For production Go code, consider the github.com/rivo/uniseg package, which handles grapheme clusterGrapheme Cluster
A user-perceived character that may be composed of multiple Unicode code points displayed as a single visual unit.
segmentation correctly and can identify emoji clusters.

Handling Edge Cases

Variation Selectors

Many emoji have both a text (VS15, U+FE0E) and emoji (VS16, U+FE0F) presentation. The digit โ˜Ž can appear as โ˜Ž๏ธŽ (text) or โ˜Ž๏ธ (emoji). Your detection should account for the variation selector:

phone_text = "\u260E\uFE0E"   # โ˜Ž๏ธŽ  text presentationText Presentation
The rendering of a character as a monochrome text symbol, either by default or when Variation Selector-15 is applied.
phone_emoji = "\u260E\uFE0F" # โ˜Ž๏ธ emoji presentationEmoji Presentation
The default rendering of a character as a colorful emoji glyph, either inherently or when triggered by Variation Selector-16.
import emoji print(emoji.emoji_count(phone_text)) # 0 print(emoji.emoji_count(phone_emoji)) # 1

ZWJ Sequences

๐Ÿ‘จโ€๐Ÿ’ป is a single grapheme cluster composed of ๐Ÿ‘จ + ZWJ (U+200D) + ๐Ÿ’ป. When counting or extracting emoji, treat ZWJ sequences as one unit. Libraries like emoji (Python) and emoji-regex (JS) handle this automatically.

Regional Indicator Flags

Country flags like ๐Ÿ‡ฉ๐Ÿ‡ช consist of two Regional Indicator letters (U+1F1E6โ€“U+1F1FF). They are only valid in pairs. A single ๐Ÿ‡ฉ without a following ๐Ÿ‡ช is not a flag.

Performance Considerations

For high-throughput text processing:

  1. Pre-compile your regex โ€” do it once at module load, not per call
  2. Short-circuit on ASCII โ€” if all bytes are < 128, there are no emoji (they are all non-ASCII)
  3. Use a library โ€” regex-based approaches with proper Unicode support are faster than custom range tables you maintain yourself
def fast_has_emoji(text: str) -> bool:
    # Short-circuit: emoji require non-ASCII bytes in UTF-8UTF-8
A variable-width Unicode encoding that uses 1 to 4 bytes per character, dominant on the web (used by 98%+ of websites).
if text.isascii(): return False return emoji.emoji_count(text) > 0

Explore More on EmojiFYI

Related Tools

๐Ÿ” Sequence Analyzer Sequence Analyzer
Decode ZWJ sequences, skin tone modifiers, keycap sequences, and flag pairs into individual components.

Glossary Terms

Code Point Code Point
A unique numerical value assigned to each character in the Unicode standard, written in the format U+XXXX (e.g., U+1F600 for ๐Ÿ˜€).
Emoji Emoji
A Japanese word (็ตตๆ–‡ๅญ—) meaning 'picture character' โ€” small graphical symbols used in digital communication to express ideas, emotions, and objects.
Emoji Presentation Emoji Presentation
The default rendering of a character as a colorful emoji glyph, either inherently or when triggered by Variation Selector-16.
Emoji Sequence Emoji Sequence
An ordered set of one or more Unicode code points that together represent a single emoji character.
Grapheme Cluster Grapheme Cluster
A user-perceived character that may be composed of multiple Unicode code points displayed as a single visual unit.
Keycap Sequence Keycap Sequence
An emoji sequence formed by a digit or symbol, followed by VS-16 (U+FE0F) and the combining enclosing keycap character (U+20E3).
Regional Indicator (RI) Regional Indicator (RI)
Paired Unicode letters (U+1F1E6 to U+1F1FF) that form country flag emoji when combined according to ISO 3166-1 alpha-2 codes.
Skin Tone Modifier Skin Tone Modifier
Five Unicode modifier characters based on the Fitzpatrick scale that change the skin color of human emoji (U+1F3FB to U+1F3FF).
Text Presentation Text Presentation
The rendering of a character as a monochrome text symbol, either by default or when Variation Selector-15 is applied.
UTF-8 UTF-8
A variable-width Unicode encoding that uses 1 to 4 bytes per character, dominant on the web (used by 98%+ of websites).
Unicode Unicode
Universal character encoding standard that assigns a unique number to every character across all writing systems and symbol sets, including emoji.
Unicode Standard Unicode Standard
The complete character encoding system maintained by the Unicode Consortium, defining characters, properties, algorithms, and encoding forms.
Variation Selector (VS) Variation Selector (VS)
Unicode characters (VS-15 U+FE0E and VS-16 U+FE0F) that modify whether a character renders in text (monochrome) or emoji (colorful) presentation.
Zero Width Joiner (ZWJ) Zero Width Joiner (ZWJ)
An invisible Unicode character (U+200D) used to join multiple emoji into a single composite emoji, such as combining people and objects into profession emoji.

Related Stories