Skip to content
Back to Blog
Tutorials

Regex cheat sheet: metacharacters, groups, and lookarounds

Regex cheat sheet: metacharacters, quantifiers, anchors, groups, lookarounds, plus 15+ JavaScript and Python patterns and concrete fixes for catastrophic backtracking.

12 min read

Regex cheat sheet: metacharacters, groups, and lookarounds

A regular expression is a small pattern language for matching text. \d+ means “one or more digits”, ^Error means “a line that starts with Error”. That is the whole job. This regex cheat sheet collects the syntax in one place: metacharacters, quantifiers, anchors, groups, lookarounds, and flags, plus 15+ patterns you can paste into JavaScript or Python.

It is written for developers who already know what a string is and want a reference, not a tour. Skip to the Quick reference table if you just need the symbols. Read the lookaround and pitfalls sections if you have ever had a regex hang a server.

1. What is regex and why you still need it in 2026

A regex is a pattern compiled into a state machine that scans a string and either matches or fails. The grammar is small. The list of uses is not.

AI can draft a pattern for you, but a few jobs still belong to a human writing regex by hand:

  • Log parsing. You have ten million lines of nginx access logs and need every 5xx request from a specific user agent. A 40-character regex over grep -E runs in seconds. An LLM call per line does not.
  • Form and field validation. Phone numbers, postal codes, ISO timestamps, license keys. The pattern lives next to the input and runs on every keystroke in the browser.
  • Bulk find-and-replace. Refactoring a thousand files where you need to capture a name and re-inject it. sed, ripgrep, and your editor’s “Replace in files” all speak regex natively.

For the JSON half of the same toolbox, see our jq command-line cheat sheet.

1.1 How to read a regex pattern (5-second regular expression tutorial)

Most patterns are easier to read left to right, one token at a time. Take ^[A-Z]\w+\d{2,4}$ as an example:

  • ^ anchors the match to the start of the string. Nothing can come before.
  • [A-Z] matches exactly one uppercase letter. Just one — there is no quantifier yet.
  • \w+ matches one or more word characters (letters, digits, underscore).
  • \d{2,4} matches between two and four digits.
  • $ anchors to the end of the string. Nothing can come after.

So the whole pattern matches Order42, Job1999, and X07 (wait, no — that needs at least two word characters after X). The skill is reading the anchors first, then the character classes, then the quantifiers, and finally the boundaries. Every pattern in this regular expression tutorial follows the same parse order.

2. Quick reference table

Copy what you need.

Metacharacters

PatternMatches
.Any character except newline (or any character with the s/dotall flag)
\dA digit ([0-9], or all Unicode digits with the u flag)
\DA non-digit
\wA word character ([A-Za-z0-9_])
\WA non-word character
\sAny whitespace (space, tab, newline, …)
\SAny non-whitespace

Quantifiers

PatternMatches
*0 or more (greedy)
+1 or more (greedy)
?0 or 1 (greedy)
{n}Exactly n times
{n,m}Between n and m times
{n,}n or more times
*?, +?, ??, {n,m}?Lazy versions of each quantifier

Anchors

PatternMatches
^Start of string (or start of line with m flag)
$End of string (or end of line with m flag)
\bWord boundary
\BNon-word boundary
\AAbsolute start of string (Python)
\ZAbsolute end of string (Python)

Character classes

PatternMatches
[abc]Any of a, b, c
[^abc]Anything except a, b, c
[a-z]Any lowercase letter
[0-9]Any digit
\p{L}Any Unicode letter (u flag in JS, default in Python re)

Groups

PatternMatches
(...)Capture group
(?:...)Non-capture group
(?<name>...)Named capture (JS ES2018+); Python uses (?P<name>...)
\1, \2Backreference to group 1, 2

Lookaround

PatternMatches
(?=...)Positive lookahead
(?!...)Negative lookahead
(?<=...)Positive lookbehind
(?<!...)Negative lookbehind

Flags

FlagEffect
iCase-insensitive
mMultiline: ^ and $ match per-line
sDotall: . matches newlines
gGlobal (JS); find all matches
uUnicode mode
ySticky (JS); anchor to lastIndex

3. Metacharacters and character classes

3.1 Literals vs special characters

Most characters are literal. The 12 metacharacters that you must escape when you want them as themselves are:

. ^ $ * + ? ( ) [ ] { } | \

Forgetting to escape . is the single most common regex bug. \. matches a literal dot. Inside a character class, [.] also matches a literal dot. Most metacharacters lose their power inside [...] except ], \, ^ (when first), and - (in the middle).

3.2 Character shorthands

The shorthand classes look simple until Unicode shows up:

// JavaScript — without the u flag, \d is ASCII only
/\d/.test('5');    // true
/\d/.test('٥');    // false (Arabic-Indic digit)
/\d/u.test('٥');   // false — even with u, \d stays ASCII in JS
/\p{N}/u.test('٥'); // true — \p{N} is the Unicode-aware digit class
# Python — re module treats \d as Unicode by default
import re
re.match(r'\d', '٥')  # <Match span=(0, 1)>
re.match(r'(?a)\d', '٥')  # None — (?a) forces ASCII

If you only deal with English ASCII input, \d and [0-9] are interchangeable. The moment a user pastes a name with an accent, you want \p{L} over \w.

3.3 Custom character classes

// JavaScript
/[A-Za-z][A-Za-z0-9_-]{2,29}/.test('valid_handle-1'); // true

// Negation and ranges combined
/[^aeiou\s]/g  // any non-vowel, non-whitespace character

For Unicode categories, \p{L} is “any letter”, \p{N} is “any number”, \p{Script=Han} is “any Han character”. JavaScript requires the u flag. Python supports \p{...} only via the regex PyPI package, not the stdlib re.

If you work on the command line, you may also see POSIX character classes:

POSIX classMatchesASCII equivalent
[[:alpha:]]letters[A-Za-z]
[[:digit:]]digits[0-9] (\d in JS/Python)
[[:alnum:]]letters and digits[A-Za-z0-9]
[[:space:]]whitespace\s
[[:upper:]]uppercase letters[A-Z]
[[:lower:]]lowercase letters[a-z]

POSIX classes work in grep -E, sed -E, and other tools that follow POSIX ERE. They do not work in JavaScript or Python re — use the shorthand equivalents (\d, \s, \w) instead.

4. Quantifiers and greedy vs lazy

4.1 Basic quantifiers

/a*/.exec('aaab')      // ['aaa']     — 0 or more
/a+/.exec('aaab')      // ['aaa']     — 1 or more
/a?/.exec('aaab')      // ['a']       — 0 or 1
/a{2,3}/.exec('aaaab') // ['aaa']     — 2 to 3

4.2 Greedy vs lazy

By default quantifiers are greedy. They grab as much as they can, then back off to make the whole pattern fit. Add ? to flip them lazy.

const html = '<p>one</p><p>two</p>';

html.match(/<p>.*<\/p>/)[0];   // '<p>one</p><p>two</p>'   (greedy eats both)
html.match(/<p>.*?<\/p>/)[0];  // '<p>one</p>'             (lazy stops at first)

The lazy version is almost always what you want when extracting tags or quoted strings. Better still, avoid . entirely and use a negated class. <p>[^<]*</p> is faster than <p>.*?</p> because there is nothing to backtrack into.

4.3 Catastrophic backtracking

This is how regex hangs a server. Nest a quantifier inside another quantifier with an ambiguous overlap, and the engine explores an exponential number of paths before giving up.

// Don't do this
/(a+)+b/.test('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa!'); // takes seconds

For 41 as followed by !, the engine tries roughly 2^41 split points before deciding the b is missing. Three fixes:

  1. Flatten the pattern. /a+b/ does the same job with no nesting.
  2. Use an atomic group (Python regex, PCRE, Java, Ruby). (?>a+)+b tells the engine that once a+ matches, it refuses to backtrack into it.
  3. Switch engines. Go’s regexp, RE2, and Rust’s regex crate use a linear-time NFA and cannot backtrack catastrophically by design.

JavaScript and Python re both backtrack and have no atomic groups in the stdlib (Python’s regex PyPI package adds them). When you control the input length, this is fine. When the input comes from a user, validate length first or pre-compile against RE2.

5. Anchors and word boundaries

5.1 ^ and $

By default, ^ is the start of the entire input and $ is the end. With the m (multiline) flag, they become the start and end of each line:

const log = 'INFO start\nERROR boom\nINFO done';
log.match(/^ERROR.*/);    // null    — single-line mode, ^ only matches index 0
log.match(/^ERROR.*/m);   // ['ERROR boom']

5.2 \b and \B

\b is a zero-width assertion. It matches the position between a word character (\w) and a non-word character. Useful for whole-word search:

/\bcat\b/.test('the cat sat');     // true
/\bcat\b/.test('concatenate');     // false

Word boundaries are defined on \w, which is ASCII by default. Chinese, Japanese, and Korean text has no spaces between words, so \b does not detect word edges there. You need a tokenizer (jieba, MeCab) before regex, not instead of one.

5.3 Multiline mode

import re
text = "INFO ok\nERROR fail\nINFO done\n"

re.findall(r'^ERROR.*$', text)              # []
re.findall(r'^ERROR.*$', text, re.MULTILINE) # ['ERROR fail']

In JavaScript the same thing reads text.match(/^ERROR.*$/gm). Combine m with g to grab every matching line.

6. Groups, capture, and backreferences

6.1 Capturing groups

Parentheses do two jobs. They group sub-patterns for quantifiers, and they capture the match for later use.

'2026-05-13'.match(/(\d{4})-(\d{2})-(\d{2})/);
// ['2026-05-13', '2026', '05', '13', index: 0, ...]

Groups are numbered left-to-right by their opening paren, starting at 1.

6.2 Non-capturing groups

When you only need grouping, not capture, use (?:...). It is faster and keeps numbered groups tidy:

/(?:https?):\/\/(\S+)/.exec('see https://go-tools.org');
// ['https://go-tools.org', 'go-tools.org']
// — the protocol is grouped but not captured; group 1 is the host

6.3 Named groups

Naming groups makes patterns readable and safe to refactor.

// JavaScript (ES2018+)
const m = '2026-05-13'.match(/(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/);
m.groups.year;  // '2026'
# Python — note the (?P<...>) syntax
import re
m = re.match(r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})', '2026-05-13')
m.group('year')  # '2026'

6.4 Backreferences

Backreferences let a later part of the pattern repeat what an earlier capture matched.

// Find any character that repeats consecutively
'bookkeeper'.match(/(\w)\1/g);   // ['oo', 'kk', 'ee']

// Match paired HTML tags by name
const tag = /<(\w+)>(.*?)<\/\1>/;
'<b>bold</b>'.match(tag);
// ['<b>bold</b>', 'b', 'bold']

In Python, \1 works in both pattern and replacement. Named references read (?P=name) in the pattern and \g<name> in re.sub replacements.

7. Lookarounds: lookahead and lookbehind

Lookarounds are zero-width assertions. They check a condition without consuming characters, so you can chain them.

7.1 Lookahead

// Password: at least 8 chars, one digit, one uppercase, one lowercase
const strong = /^(?=.*\d)(?=.*[A-Z])(?=.*[a-z]).{8,}$/;
strong.test('Hunter2!');   // true
strong.test('hunter2!');   // false — no uppercase

// Negative lookahead — file names that are not .tmp
/^[\w-]+(?!\.tmp$)\.[a-z]+$/.test('report.csv'); // true

7.2 Lookbehind

Lookbehind is the mirror image. It asserts what comes before the current position.

// Extract a price after a currency symbol — keep the number, drop the $
'price: $42.50'.match(/(?<=\$)\d+(\.\d+)?/);   // ['42.50', '.50']

// Negative lookbehind — match Bond but not James Bond
'Mr. Bond'.match(/(?<!James )Bond/);  // ['Bond']
'James Bond'.match(/(?<!James )Bond/); // null

7.3 JavaScript vs Python lookbehind

This is one of the few places where the two engines diverge enough to break a pattern when you port it.

EngineLookbehind length
JavaScript (V8, SpiderMonkey, JSC 16.4+)Variable-width since ES2018. (?<=\d+) is valid.
Python stdlib reFixed-width only. (?<=\d+) raises error: look-behind requires fixed-width pattern.
Python regex PyPI packageVariable-width supported. import regex; regex.search(r'(?<=\d+)abc', '12abc').

Workaround in Python: rewrite the lookbehind with a known repeat ((?<=\d{3})), or capture the prefix and slice it off after matching. The MDN lookbehind reference documents the JavaScript syntax in depth.

8. Flags and modifiers

8.1 i for case-insensitive

/error/i.test('FATAL ERROR'); // true
re.search(r'error', 'FATAL ERROR', re.IGNORECASE)  # <Match span=(6, 11)>

8.2 m and s

m flips ^ and $ into per-line anchors. s (dotall) lets . match newlines. They are independent. Combine them when you want both.

/<script>(.*?)<\/script>/s.exec('<script>\nalert(1)\n</script>')[1];
// '\nalert(1)\n'  — without s, the . would refuse the newlines

8.3 g for global

In JavaScript, g changes the API rather than the match itself. Without g, String.match returns capture groups. With g, it returns every match string. Use matchAll to keep capture groups across all matches.

const text = 'a=1 b=2 c=3';

text.match(/(\w)=(\d)/);     // first match with groups
text.match(/(\w)=(\d)/g);    // ['a=1', 'b=2', 'c=3'] — no groups
[...text.matchAll(/(\w)=(\d)/g)]; // every match, with groups

Python does not use g. re.findall, re.finditer, and re.sub are the global variants.

8.4 u for Unicode and \p{...}

// Match any Han character (Chinese, Japanese kanji)
/\p{Script=Han}+/gu.test('Hello 世界'); // true

// Match emoji (extended pictographic)
/\p{Extended_Pictographic}/u.test('👋'); // true

In Python, Unicode is on by default. re.findall(r'[一-鿿]+', text) is the equivalent for the Han range. For full Unicode property escapes, use the regex PyPI package: regex.findall(r'\p{Script=Han}+', text). See MDN’s Unicode character class escape for the complete property list.

9. Common patterns you’ll use daily

9.1 Email validation

Be honest about which version you need.

// The 95% pattern — what most form validators use
const email = /^[^\s@]+@[^\s@]+\.[^\s@]+$/;
email.test('a@b.co');  // true

// The "I really want to be RFC 5322-ish" pattern
const rfc = /^[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$/;

Honest truth: full RFC 5322 email validation in pure regex is ~6000 characters and still wrong on edge cases. Use the 95% pattern, then send a verification email. That is the only test that really works.

9.2 URL extraction

const urlPattern = /https?:\/\/[^\s<>"]+/g;
const found = 'See https://example.com/a?b=1 and http://x.io'.match(urlPattern);
// ['https://example.com/a?b=1', 'http://x.io']

Once you extract a URL, you usually want to inspect its query string. Paste it into our URL decoder/encoder and you can read percent-encoded parameters at a glance. For the full picture of when to encode versus decode, read the URL encoding & decoding guide.

9.3 Phone numbers

// E.164 — international, optional + and 1-3 digit country code
const e164 = /^\+?[1-9]\d{1,14}$/;
e164.test('+14155551234');  // true

// North American Number Plan with separators
const nanp = /^(\+?1[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}$/;
nanp.test('(415) 555-1234'); // true

For anything beyond “is this shape plausible”, use libphonenumber. Regex cannot tell you whether an area code exists.

9.4 IPv4 and IPv6

// IPv4 — strict 0-255 per octet
const ipv4 = /^((25[0-5]|2[0-4]\d|1?\d?\d)\.){3}(25[0-5]|2[0-4]\d|1?\d?\d)$/;
ipv4.test('192.168.1.1');   // true
ipv4.test('999.0.0.1');     // false

// IPv6 — the simplified form. The full RFC 4291 pattern is ~600 chars.
const ipv6simple = /^([0-9a-fA-F]{1,4}:){7}[0-9a-fA-F]{1,4}$/;
ipv6simple.test('2001:0db8:85a3:0000:0000:8a2e:0370:7334'); // true

For real IPv6 with :: shorthand, embedded IPv4, and zone identifiers, use node:net’s isIP() or Python’s ipaddress.ip_address(). Trying to do it in pure regex is a rite of passage, then a maintenance burden.

9.5 ISO 8601 dates and timestamps

// Date only — YYYY-MM-DD
const isoDate = /^\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01])$/;
isoDate.test('2026-05-13'); // true

// Date + time + optional fractional seconds + Z or offset
const iso = /^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(\.\d+)?(Z|[+-]\d{2}:\d{2})$/;
iso.test('2026-05-13T09:30:00.123Z'); // true

ISO 8601 looks simple and is full of traps: leap seconds, week dates (2026-W19), ordinal dates (2026-133). For epoch seconds versus milliseconds and timezone shifts, see the Unix timestamp guide.

10. Find/replace workflows with regex

10.1 JavaScript: String.replace with $1

// Reformat US dates: MM/DD/YYYY -> YYYY-MM-DD
'05/13/2026'.replace(/(\d{2})\/(\d{2})\/(\d{4})/, '$3-$1-$2');
// '2026-05-13'

// Use a callback when the replacement is conditional
'price 42 dollars'.replace(/(\d+) dollars/, (_, n) => `$${n}`);
// 'price $42'

$1, $2, … reference numbered groups. $<name> references named groups. $& is the full match. $$ is a literal $.

10.2 Python: re.sub with \1 and callbacks

import re

# Same date reformat as above
re.sub(r'(\d{2})/(\d{2})/(\d{4})', r'\3-\1-\2', '05/13/2026')
# '2026-05-13'

# Callback — uppercase every email address in a string
def upper_email(m):
    return m.group(0).upper()

re.sub(r'[\w.-]+@[\w.-]+', upper_email, 'mail me at hi@go-tools.org')
# 'mail me at HI@GO-TOOLS.ORG'

In replacements, Python uses \1 or \g<name>. The raw string r'...' prefix matters. Without it, \1 becomes a literal character.

10.3 CLI: sed, grep, ripgrep, jq

For batch refactors on the command line, regex moves from the script into the shell:

# ripgrep — find every TODO with a name attached
rg -n '\bTODO\(([^)]+)\)' --replace 'TODO(\1)'

# grep -E with anchors — failed login lines from auth.log
grep -E '^[A-Z][a-z]{2} +[0-9]+ .*Failed password' /var/log/auth.log

# sed — strip trailing whitespace, in-place, across a tree
find . -name '*.md' -print0 | xargs -0 sed -i -E 's/[[:space:]]+$//'

ripgrep uses Rust’s regex crate (RE2-style, linear time, no lookbehind). grep -E and sed -E use POSIX extended regex, which lacks \d. Use [0-9] and [[:digit:]] instead. When the data is JSON, swap regex for jq. See the jq command-line cheat sheet for a parallel reference card.

11. Common pitfalls

11.1 Forgetting to escape .

A real bug we have shipped: a log redactor was supposed to mask IP addresses.

// Wrong — matches '192a168b1c1' too
/(\d+).(\d+).(\d+).(\d+)/.test('192a168b1c1');  // true

// Right
/(\d+)\.(\d+)\.(\d+)\.(\d+)/.test('192a168b1c1'); // false

Inside a character class, . is already literal, so [.] and \. both work. Anywhere else, escape it.

11.2 Greedy .* eats too much

'<a href="x"><b>bold</b></a>'.match(/<(.*)>/)[1];
// 'a href="x"><b>bold</b></a'  — the whole thing!

Greedy .* scans to the end of the string, then backs up until > matches, which is the last > in the input. Either go lazy (.*?) or, faster and clearer, use a negated class ([^>]*).

11.3 Multiline anchors

One common confusion: ^ and $ do not match newline characters by default. They match positions at the start and end of the entire input. Adding the m flag is what turns them into per-line anchors. Adding the s flag is what lets . cross newlines. They are orthogonal, and you usually want both for log parsing.

11.4 ReDoS and how to defuse it

ReDoS (regex denial of service) is the production version of catastrophic backtracking. The fixes:

  1. Static analysis. Tools like safe-regex, recheck, and ESLint’s no-misleading-character-class catch the dangerous patterns before they ship.
  2. Atomic groups (Python regex, PCRE, Ruby, Java). (?>...) prevents the engine from re-entering the group on backtrack.
  3. Possessive quantifiers (*+, ++, ?+ in PCRE/Java). Same idea, terser syntax.
  4. Switch to a non-backtracking engine. Go’s regexp, RE2, Rust’s regex crate, and the re2 Python binding all run in linear time. ripgrep is the most popular RE2 deployment in the wild. The google/re2 syntax wiki lists the supported features and the intentional gaps (no lookbehind, no backreferences).
  5. Validate input length first. A 10 KB regex bomb is a bug. A 10 byte cap on the input is one line of code.

For a broader inventory of the daily-driver tools that pair with regex (formatters, decoders, converters), see our developer tools guide.

Before you ship a complex pattern, test it interactively. regex101.com switches between PCRE, JavaScript, Python, and Go flavors, explains each token in plain English, and shows step-by-step backtracking so you can spot catastrophic patterns before production does.

12. FAQ

What’s the difference between regex * and +?

* matches zero or more occurrences and can match an empty string. + matches one or more and needs at least one character. a* matches '', 'a', 'aaaa'. a+ matches 'a' and 'aaaa' but not ''.

How do I match across multiple lines with regex?

Turn on the multiline flag (/.../m in JavaScript, re.MULTILINE in Python) so ^ and $ anchor to each line. To let . also cross newlines, add the dotall flag (s in JavaScript, re.DOTALL in Python).

Is regex the same in JavaScript and Python?

The core syntax (quantifiers, anchors, character classes, basic groups) is 90% the same. Two real differences: JavaScript (ES2018+) supports variable-length lookbehind and writes named groups as (?<name>...), while Python stdlib re requires fixed-width lookbehind and uses (?P<name>...). For variable-length lookbehind in Python, install the regex package from PyPI.

Why does my regex have catastrophic backtracking?

You have nested quantifiers with overlapping matches, like (a+)+ or (a|a)*. On input that almost matches but fails near the end, the engine tries every split of the inner quantifier, which can be an exponential number of paths. Fix it with an atomic group (?>a+)+, a possessive quantifier a++, or by switching to a non-backtracking engine like RE2 or Go’s regexp.

Can I use lookbehind in JavaScript?

Yes. Positive (?<=...) and negative (?<!...) lookbehind have been in V8 (Chrome, Node.js), SpiderMonkey (Firefox), and JavaScriptCore (Safari 16.4+) since ES2018. Variable-length lookbehind is supported. For older Safari, transpile with Babel or feature-detect with a try/catch around new RegExp.

How do I match a literal dot . in regex?

Escape it with a backslash. \. matches a literal dot. Inside a character class, the dot is already literal, so both [.] and [\.] work. Outside a class, an unescaped . is a metacharacter meaning “any character except newline” (or any character at all with the dotall flag).

What does \s mean in regex?

\s matches any whitespace character — space, tab (\t), newline (\n), carriage return (\r), vertical tab, and form feed. In Unicode mode (JavaScript with the u flag, Python by default in 3.x) it also matches NBSP ( ) and the other Unicode whitespace points. The inverse \S matches any non-whitespace character.

Are regular expressions case sensitive?

By default, yes. /cat/ will not match Cat. Turn on the case-insensitive flag to ignore case: JavaScript adds i (/cat/i), Python uses re.IGNORECASE or the inline (?i) group. In Unicode mode the case folding also covers tricky pairs like ß↔SS and Turkish dotted/dotless I, which is sometimes the source of surprising matches.

Related Articles

View all articles