webdev.complete
🔤 Data Formats & Regex
🛠️Dev Toolbelt
Lesson 53 of 117
35 min

Regex You Won't Hate

Pattern building from primitives. Lookarounds. JS flags.

Regular expressions are a tiny language for matching patterns in text. They look like cat-walked-on-keyboard at first, and they are a little ugly, but a few of them you'll see and write every week. This lesson teaches the pieces, then turns you loose with a live playground.

Two ways to write a regex in JS

js
// literal syntax
const a = /hello/i;

// constructor (for dynamic patterns built from strings)
const b = new RegExp("hello", "i");

a.test("Hello world")      // true
"Hello world".match(a)     // ["Hello", index: 0, ...]
"Hello world".replace(a, "Hi")  // "Hi world"

Character classes: what kinds of characters?

  • . any character except newline
  • \d digit, \D non-digit
  • \w word character ([A-Za-z0-9_]), \W the opposite
  • \s whitespace, \S non-whitespace
  • [abc] any of a, b, c. [^abc] none of those.
  • [a-z] any lowercase letter. [0-9] same as \d.

Anchors: where in the string?

  • ^ start of string (or line with the m flag)
  • $ end of string (or line with m)
  • \b word boundary (between a \w and a \W)
js
/^hello/.test("hello world")    // true  (starts with hello)
/world$/.test("hello world")    // true  (ends with world)
/\bcat\b/.test("the cat sat") // true  (cat as a whole word)
/\bcat\b/.test("category")    // false (cat is part of a longer word)

Quantifiers: how many?

  • ? 0 or 1
  • * 0 or more
  • + 1 or more
  • {n} exactly n
  • {n,} n or more
  • {n,m} between n and m

By default, quantifiers are greedy: they match as much as possible. Add a ? after a quantifier to make it lazy: match as little as possible.

js
"<b>hi</b> <i>there</i>".match(/<.+>/)[0]
// "<b>hi</b> <i>there</i>"   <-- greedy, eats everything

"<b>hi</b> <i>there</i>".match(/<.+?>/)[0]
// "<b>"                       <-- lazy, stops at first match

Groups and alternation

Parentheses (...) capture matched content for later use. (?:...) groups without capturing. The pipe | is "or".

js
// alternation
/cat|dog|fish/.test("I have a dog")     // true

// capture groups for extraction
const m = "(415) 555-1212".match(/\((\d{3})\) (\d{3})-(\d{4})/);
m[0]   // "(415) 555-1212"  -- whole match
m[1]   // "415"              -- group 1
m[2]   // "555"              -- group 2
m[3]   // "1212"             -- group 3

// named groups
const r = /(?<year>\d{4})-(?<month>\d{2})/;
"2024-05".match(r).groups   // { year: "2024", month: "05" }

Lookarounds: match if surrounded by

Sometimes you want to match X but only when it's followed (or preceded) by Y, without including Y in the match itself.

  • (?=...) lookahead: must be followed by
  • (?!...) negative lookahead: must NOT be followed by
  • (?<=...) lookbehind: must be preceded by
  • (?<!...) negative lookbehind: must NOT be preceded by
js
// price in dollars only, capture just the number
"$99 or 50 yen".match(/(?<=\$)\d+/)[0]
// "99"

// digits that aren't followed by px
"width: 100px; size: 20".match(/\d+(?!px)/g)
// ["10", "20"]   <-- the 10 in "100px" matches because only 0 is followed by px
//                    (lookaround is tricky - test carefully!)

JS flags

  • g global, find all matches not just the first
  • i case-insensitive
  • m multiline, ^ and $ match line boundaries
  • s dotall, . matches newlines too
  • u Unicode mode (treat code points correctly)
  • v Unicode v-mode (modern, smarter sets, ES2024+)
js
"Cat cat CAT".match(/cat/g)      // ["cat"]
"Cat cat CAT".match(/cat/gi)     // ["Cat", "cat", "CAT"]

// emoji counted as one character with u flag
/^.$/.test("🤖")                 // false (without u, robot is 2 code units)
/^.$/u.test("🤖")                // true
/\p{Emoji}/u.test("🤖")         // true (Unicode property)

Patterns you'll actually copy-paste

js
// hex color (#fff or #ffffff)
/^#([0-9a-f]{3}|[0-9a-f]{6})$/i

// ISO date (YYYY-MM-DD)
/^\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01])$/

// "email-ish" (NOT real validation - see callout below)
/^[^\s@]+@[^\s@]+\.[^\s@]+$/

// URL slug (lowercase letters, digits, hyphens)
/^[a-z0-9]+(?:-[a-z0-9]+)*$/

// extract YouTube video ID from various URL forms
/(?:youtu\.be\/|v=)([\w-]{11})/

// whitespace at end of line (useful for cleanup)
/[ \t]+$/gm

// IPv4 (loose - for tighter, you need to check each octet ≤ 255)
/\b(?:\d{1,3}\.){3}\d{1,3}\b/
Don't regex email or HTML

The official email spec is hundreds of lines of grammar. Real email regexes are dozens of characters long and still wrong. To validate email, send a confirmation message. To "parse" HTML, use a parser (the browser's DOMParser, cheerio, jsdom).

HTML is not a regular language. Trying to regex it will work 80% of the time and silently corrupt your data the other 20%.

Live regex playground

Below is an interactive sandbox. Edit the regex in index.js and the text. The console shows what matches. Try the example patterns from above, or invent your own.

// EDIT ME: change the pattern and flags
const pattern = /\b\d{4}-\d{2}-\d{2}\b/g;

// EDIT ME: change the text to test against
const text = `
Meeting on 2024-05-24 at 09:00.
Follow-up on 2024-06-01.
Old date: 1999-12-31, ignore: 24-05-2024.
`;

// --- output ---
console.log("Pattern:", pattern.toString());
console.log("Flags:  ", pattern.flags || "(none)");

const matches = [...text.matchAll(pattern)];
console.log("Found  ", matches.length, "match(es):");
matches.forEach((m, i) => {
  console.log("  " + (i + 1) + ".", JSON.stringify(m[0]), "@ index", m.index);
  if (m.groups) console.log("     groups:", m.groups);
  for (let g = 1; g < m.length; g++) {
    console.log("     [" + g + "]:", JSON.stringify(m[g]));
  }
});

// try these:
//   /^#([0-9a-f]{3}|[0-9a-f]{6})$/i   on "#fff" / "#1A2B3C"
//   /\b\w+@\w+\.\w+\b/g          on "email me at a@b.co or x@y.com"
//   /(?<year>\d{4})-(?<month>\d{2})/  on "2024-05"

Performance: catastrophic backtracking

A regex with nested quantifiers like (a+)+$ against a string of as followed by a b can take exponentialtime to fail. This is called catastrophic backtracking, and it's a known way to DoS a server. Real outages have happened from a single ReDoS regex (Cloudflare, Stack Overflow).

js
// DON'T do this on untrusted input
/(a+)+$/.test("aaaaaaaaaaaaaaaaaaab")    // hangs the runtime

// the v flag's new set/property features avoid some of this
// but the real fix is: rewrite to avoid nested quantifiers
// or use a proper parser, or set a time budget
Tools that help
regex101.com is the standard interactive regex tool. It explains your regex piece by piece and warns about catastrophic backtracking. Use it when stuck.

Quick quiz

Quiz1 / 4

What does the regex /foo+/ match?

Recap

  • Character classes (\d \w \s [abc]), anchors (^ $ \b), and quantifiers (? * + {n}) are the building blocks.
  • Groups (...) capture, (?:...) just groups,(?<name>...) names a capture.
  • Flags: g all, i case-blind, m per-line, s dotall, u/v Unicode-aware.
  • Don't regex email or HTML. Beware catastrophic backtracking on untrusted input.
Built with Next.js, Tailwind & Sandpack.
Learn. Build. Ship.