Strings: Unicode and UTF-8

Join the AI Workshop to learn more about AI and how it can be applied to web development. Next cohort February 1st, 2026

The AI-first Web Development BOOTCAMP cohort starts February 24th, 2026. 10 weeks of intensive training and hands-on projects.


Unicode is an industry standard for consistent encoding of written text. It aims to provide a unique number to identify every character for every language, on any platform.

Code Points

Unicode maps every character to a specific code, called a code point. A code point takes the form of U+<hex-code>, ranging from U+0000 to U+10FFFF.

For example: U+004F represents the letter “O”.

Character Encodings

Unicode defines different character encodings:

  • UTF-8: Variable width (1-4 bytes), most popular on the web
  • UTF-16: Variable width (2-4 bytes), used internally by JavaScript
  • UTF-32: Fixed width (4 bytes)

UTF-8

UTF-8 is the most popular encoding, used on over 90% of web pages. It’s backwards compatible with ASCII—the first 128 characters are identical.

BytesRange
1U+0000 - U+007F
2U+0080 - U+07FF
3U+0800 - U+FFFF
4U+10000 - U+10FFFF

Planes

Unicode organizes characters into 17 planes:

  • Plane 0 (BMP): U+0000 - U+FFFF, contains most modern characters
  • Planes 1-16 (Astral planes): U+10000 and above

Characters in astral planes are called astral code points.

Working with Unicode in JavaScript

Creating Strings from Code Points

String.fromCodePoint(70, 108, 97, 118, 105, 111) // 'Flavio'

Getting Code Points

'A'.codePointAt(0) // 65
'🐶'.codePointAt(0) // 128054

Unicode Escape Sequences

'\u0041' // 'A'
'\u{1F436}' // '🐶' (ES6 syntax for astral characters)

Combining Characters

Unicode allows combining characters to form graphemes:

'e\u0301' // 'é' (e + combining acute accent)
'\u00E9'  // 'é' (precomposed form)

Both represent the same visual character but are different strings.

Normalization

Because characters can be represented multiple ways, use normalize() for comparisons:

const a = '\u00E9'       // é (precomposed)
const b = 'e\u0301'      // é (combining)

a === b                  // false
a.normalize() === b.normalize() // true

String Length with Unicode

Be careful with string length for astral characters:

'🐶'.length // 2 (surrogate pair)
[...'🐶'].length // 1 (proper character count)

Emojis

Emojis are Unicode astral plane characters:

'🐶' // U+1F436
'👨‍👩‍👧' // Multiple code points combined

The family emoji is actually multiple code points joined with Zero Width Joiner (U+200D).

Lessons in this unit:

0: Introduction
1: String Basics
2: Accessing Characters
3: Searching Strings
4: Extracting Substrings
5: Transforming Strings
6: Modifying Strings
7: Trimming and Padding
8: String Recipes
9: ▶︎ Unicode and UTF-8
10: Printable ASCII characters list
11: Non-printable ASCII characters list