webdev.complete
🔤 Data Formats & Regex
🛠️Dev Toolbelt
Lesson 51 of 117
20 min

Encoding & Unicode

ASCII, UTF-8, code points, why emojis have length > 1.

Text on a computer is a stack of lies that mostly works. Your browser displays "hello 🤖", but what it really has is a sequence of numbers, encoded as bytes, interpreted as code points, grouped into grapheme clusters. When that stack leaks, you get ?characters in production. Let's peel it apart.

ASCII: the original 128 characters

ASCII assigned numbers 0-127 to letters, digits, and a few control characters. Each character fits in 7 bits, so it's often stored as one byte with the top bit unused. It's 1963 technology and still the foundation.

js
"A".charCodeAt(0)   // 65
"a".charCodeAt(0)   // 97
"0".charCodeAt(0)   // 48

String.fromCharCode(65)    // "A"

ASCII works for English. As soon as you need é, ñ, 中, or 🐱, it doesn't.

Unicode: one number per character, for every script

Unicode is a giant table that assigns a unique number (a code point) to every character humanity has ever written. Code points are written as U+XXXX. Some examples:

  • U+0041 = A
  • U+00E9 = é
  • U+4E2D = 中
  • U+1F916 = 🤖

Unicode itself is just the numbering. How those numbers turn into bytes on disk is a separate question, answered by encodings like UTF-8.

UTF-8: variable-length, ASCII-compatible

UTF-8 encodes each code point as 1 to 4 bytes. ASCII characters stay 1 byte (so plain English text is identical to the ASCII version). Higher code points use more bytes:

js
new TextEncoder().encode("A")
// Uint8Array [ 65 ]              -- 1 byte

new TextEncoder().encode("é")
// Uint8Array [ 195, 169 ]        -- 2 bytes

new TextEncoder().encode("中")
// Uint8Array [ 228, 184, 173 ]   -- 3 bytes

new TextEncoder().encode("🤖")
// Uint8Array [ 240, 159, 164, 150 ]   -- 4 bytes
Just always pick UTF-8
For databases, files, HTTP headers, and anything else, UTF-8 is the right default. It's backwards-compatible with ASCII, supports every character, and is what the web standardized on.

Code points vs UTF-16 code units (JS's quirk)

JavaScript strings are sequences of UTF-16 code units, not code points. Most characters fit in one 16-bit unit, but emoji and rare CJK characters need two units called a surrogate pair. This breaks your intuition about .length:

js
"A".length        // 1
"é".length        // 1
"中".length       // 1
"🤖".length       // 2   <-- !

// the robot is two UTF-16 code units glued together
for (const ch of "🤖") console.log(ch);
// 🤖                   <-- iteration is smart enough

[..."🤖"].length  // 1   <-- spread iterates by code point

// modern getter: count code points
[..."hello 🤖"].length    // 7

Grapheme clusters: what users see as one character

Code points still don't match human intuition. A "👨‍👩‍👧" family emoji is multiple code points joined by a zero-width joiner. Counting code points still gives the wrong answer. The right unit for "characters a human sees" is the grapheme cluster. Use Intl.Segmenter:

js
const text = "👨‍👩‍👧 family";

text.length              // 9   -- UTF-16 code units
[...text].length         // 6   -- code points

const seg = new Intl.Segmenter("en", { granularity: "grapheme" });
[...seg.segment(text)].length    // 8   -- user-perceived characters

Normalization: same text, different bytes

Worse news: there can be two different sequences of code points that render as the same character. é can be:

  • One code point: U+00E9 (precomposed). This is NFC form.
  • Two code points: U+0065 U+0301 (e + combining acute accent). This is NFD form.
js
const a = "\u00E9";          // "é" (precomposed)
const b = "\u0065\u0301";    // "é" (decomposed)

a === b                       // false!
a.length                      // 1
b.length                      // 2

// fix: normalize before comparing
a.normalize("NFC") === b.normalize("NFC")   // true
Why this matters in real apps
Username comparisons, search, and dedup all need normalization. macOS sometimes stores filenames in NFD; Windows in NFC. Copy a file across and string equality can break. Always normalize input on the way in.

Try it yourself: see the bytes

Type any text below. The playground shows the UTF-8 bytes, code points, and grapheme clusters. Watch what happens with accents, CJK, and emoji.

function describe(text) {
  const bytes = new TextEncoder().encode(text);
  const codePoints = [...text].map(c => "U+" + c.codePointAt(0).toString(16).toUpperCase().padStart(4, "0"));
  const seg = new Intl.Segmenter("en", { granularity: "grapheme" });
  const graphemes = [...seg.segment(text)].map(s => s.segment);

  console.log("Input:        ", JSON.stringify(text));
  console.log("UTF-8 bytes:  ", [...bytes].join(" "));
  console.log("Byte length:  ", bytes.length);
  console.log("string.length:", text.length, "(UTF-16 code units)");
  console.log("Code points:  ", codePoints.join(" "), "(" + codePoints.length + ")");
  console.log("Graphemes:    ", graphemes.join(" | "), "(" + graphemes.length + ")");
  console.log("---");
}

describe("A");
describe("café");
describe("中文");
describe("🤖");
describe("👨‍👩‍👧");
describe("नमस्ते");

Base64: encoding, not encryption

Base64 converts arbitrary binary data into a string of safe ASCII characters. It's how you embed images in CSS, paste binary into JSON, or sneak files through email. It is not encryption. Anyone can decode it.

js
btoa("hello")              // "aGVsbG8="
atob("aGVsbG8=")           // "hello"

// for binary data, use the modern API
const bytes = new Uint8Array([0xff, 0x00, 0x42]);
btoa(String.fromCharCode(...bytes))   // "/wBC"

// data URLs are base64 in the wild
// data:image/png;base64,iVBORw0KGgoAAAA...
Don't store secrets as base64
If you see API tokens, JWTs, or "encrypted" data that's just base64, that's plaintext with extra steps. Encryption needs a key. Base64 is just a transformation.

URL percent-encoding

URLs have reserved characters: ?, &, /, #, spaces. To put those characters inside a URL component, you have to percent-encode them.

js
encodeURIComponent("hello world")
// "hello%20world"

encodeURIComponent("name=ada&role=admin")
// "name%3Dada%26role%3Dadmin"

// build a safe URL with query params
const q = encodeURIComponent("café & croissants");
`https://example.com/search?q=${q}`
// "https://example.com/search?q=caf%C3%A9%20%26%20croissants"

// the URL API does this for you
const url = new URL("https://example.com/search");
url.searchParams.set("q", "café & croissants");
url.toString()
// "https://example.com/search?q=caf%C3%A9+%26+croissants"

Use encodeURIComponent (not encodeURI) for individual parts. Or, even better, let the URL object handle it.

HTML entities

In HTML, <, >, &, and " are special. If you want to display them literally in your content, you escape them with HTML entities:

html
<!-- show a literal less-than sign -->
<p>5 &lt; 10</p>            <!-- renders: 5 < 10 -->
<p>Tom &amp; Jerry</p>      <!-- renders: Tom & Jerry -->
<p>She said &quot;hi&quot;</p>

<!-- you can also use numeric references -->
<p>&#169; 2024</p>          <!-- © 2024 -->
<p>&#x1F916;</p>            <!-- 🤖 (hex) -->

React handles most of this automatically. The exception is text inside JSX, where you sometimes need &apos;or just use a double-quoted string. (You're seeing this in action right now.)

Quick quiz

Quiz1 / 3

Why does "🤖".length return 2 in JavaScript?

Recap

  • Unicode = the numbering. UTF-8 = the bytes-on-disk encoding. Pick UTF-8 always.
  • JS strings are UTF-16 code units. length counts those, not characters. Use Intl.Segmenter for true character counts.
  • Two strings can look identical but have different bytes. Normalize input with .normalize("NFC").
  • Base64 is encoding, not encryption. Percent-encoding escapes URL parts. HTML entities escape HTML.
Built with Next.js, Tailwind & Sandpack.
Learn. Build. Ship.