Encoding & Unicode
ASCII, UTF-8, code points, why emojis have length > 1.
Text on a computer is a stack of lies that mostly works. Your browser displays "hello 🤖", but what it really has is a sequence of numbers, encoded as bytes, interpreted as code points, grouped into grapheme clusters. When that stack leaks, you get ?characters in production. Let's peel it apart.
ASCII: the original 128 characters
ASCII assigned numbers 0-127 to letters, digits, and a few control characters. Each character fits in 7 bits, so it's often stored as one byte with the top bit unused. It's 1963 technology and still the foundation.
"A".charCodeAt(0) // 65
"a".charCodeAt(0) // 97
"0".charCodeAt(0) // 48
String.fromCharCode(65) // "A"ASCII works for English. As soon as you need é, ñ, 中, or 🐱, it doesn't.
Unicode: one number per character, for every script
Unicode is a giant table that assigns a unique number (a code point) to every character humanity has ever written. Code points are written as U+XXXX. Some examples:
U+0041= AU+00E9= éU+4E2D= 中U+1F916= 🤖
Unicode itself is just the numbering. How those numbers turn into bytes on disk is a separate question, answered by encodings like UTF-8.
UTF-8: variable-length, ASCII-compatible
UTF-8 encodes each code point as 1 to 4 bytes. ASCII characters stay 1 byte (so plain English text is identical to the ASCII version). Higher code points use more bytes:
new TextEncoder().encode("A")
// Uint8Array [ 65 ] -- 1 byte
new TextEncoder().encode("é")
// Uint8Array [ 195, 169 ] -- 2 bytes
new TextEncoder().encode("中")
// Uint8Array [ 228, 184, 173 ] -- 3 bytes
new TextEncoder().encode("🤖")
// Uint8Array [ 240, 159, 164, 150 ] -- 4 bytesCode points vs UTF-16 code units (JS's quirk)
JavaScript strings are sequences of UTF-16 code units, not code points. Most characters fit in one 16-bit unit, but emoji and rare CJK characters need two units called a surrogate pair. This breaks your intuition about .length:
"A".length // 1
"é".length // 1
"中".length // 1
"🤖".length // 2 <-- !
// the robot is two UTF-16 code units glued together
for (const ch of "🤖") console.log(ch);
// 🤖 <-- iteration is smart enough
[..."🤖"].length // 1 <-- spread iterates by code point
// modern getter: count code points
[..."hello 🤖"].length // 7Grapheme clusters: what users see as one character
Code points still don't match human intuition. A "👨👩👧" family emoji is multiple code points joined by a zero-width joiner. Counting code points still gives the wrong answer. The right unit for "characters a human sees" is the grapheme cluster. Use Intl.Segmenter:
const text = "👨👩👧 family";
text.length // 9 -- UTF-16 code units
[...text].length // 6 -- code points
const seg = new Intl.Segmenter("en", { granularity: "grapheme" });
[...seg.segment(text)].length // 8 -- user-perceived charactersNormalization: same text, different bytes
Worse news: there can be two different sequences of code points that render as the same character. é can be:
- One code point:
U+00E9(precomposed). This is NFC form. - Two code points:
U+0065 U+0301(e + combining acute accent). This is NFD form.
const a = "\u00E9"; // "é" (precomposed)
const b = "\u0065\u0301"; // "é" (decomposed)
a === b // false!
a.length // 1
b.length // 2
// fix: normalize before comparing
a.normalize("NFC") === b.normalize("NFC") // trueTry it yourself: see the bytes
Type any text below. The playground shows the UTF-8 bytes, code points, and grapheme clusters. Watch what happens with accents, CJK, and emoji.
Base64: encoding, not encryption
Base64 converts arbitrary binary data into a string of safe ASCII characters. It's how you embed images in CSS, paste binary into JSON, or sneak files through email. It is not encryption. Anyone can decode it.
btoa("hello") // "aGVsbG8="
atob("aGVsbG8=") // "hello"
// for binary data, use the modern API
const bytes = new Uint8Array([0xff, 0x00, 0x42]);
btoa(String.fromCharCode(...bytes)) // "/wBC"
// data URLs are base64 in the wild
// data:image/png;base64,iVBORw0KGgoAAAA...URL percent-encoding
URLs have reserved characters: ?, &, /, #, spaces. To put those characters inside a URL component, you have to percent-encode them.
encodeURIComponent("hello world")
// "hello%20world"
encodeURIComponent("name=ada&role=admin")
// "name%3Dada%26role%3Dadmin"
// build a safe URL with query params
const q = encodeURIComponent("café & croissants");
`https://example.com/search?q=${q}`
// "https://example.com/search?q=caf%C3%A9%20%26%20croissants"
// the URL API does this for you
const url = new URL("https://example.com/search");
url.searchParams.set("q", "café & croissants");
url.toString()
// "https://example.com/search?q=caf%C3%A9+%26+croissants"Use encodeURIComponent (not encodeURI) for individual parts. Or, even better, let the URL object handle it.
HTML entities
In HTML, <, >, &, and " are special. If you want to display them literally in your content, you escape them with HTML entities:
<!-- show a literal less-than sign -->
<p>5 < 10</p> <!-- renders: 5 < 10 -->
<p>Tom & Jerry</p> <!-- renders: Tom & Jerry -->
<p>She said "hi"</p>
<!-- you can also use numeric references -->
<p>© 2024</p> <!-- © 2024 -->
<p>🤖</p> <!-- 🤖 (hex) -->React handles most of this automatically. The exception is text inside JSX, where you sometimes need 'or just use a double-quoted string. (You're seeing this in action right now.)
Quick quiz
Why does "🤖".length return 2 in JavaScript?
Recap
- Unicode = the numbering. UTF-8 = the bytes-on-disk encoding. Pick UTF-8 always.
- JS strings are UTF-16 code units.
lengthcounts those, not characters. UseIntl.Segmenterfor true character counts. - Two strings can look identical but have different bytes. Normalize input with
.normalize("NFC"). - Base64 is encoding, not encryption. Percent-encoding escapes URL parts. HTML entities escape HTML.