Regex in depth: capture groups, lookaround, catastrophic backtracking, cross-language differences

regex is something most engineers use daily but never feel fully fluent in. Here are the techniques I actually reach for after years of QA / SRE work — plus the real cases where my browser locked up because of one careless quantifier.

Capture vs non-capturing vs named groups

All three use parentheses but mean different things:

() Capture group: captures, referenceable as $1, $2
(?:) Non-capturing group: groups but doesn't capture — slightly faster, so use these freely in complex patterns
(?<name>) Named group: captures and can be referenced by name — wins on readability because you don't have to count positions weeks later

Example: matching an email

Bad: (\w+)@(\w+)\.(\w+) — using $1 $2 $3 later, no idea which is which after a month.
Good: (?<user>\w+)@(?<host>\w+)\.(?<tld>\w+) — using groups.user is self-documenting.

Lookahead / lookbehind: conditions without consuming

Lookaround tests a condition inside the regex engine without advancing the cursor — perfect when you want "simultaneous conditions":

(?=...) Positive lookahead: what follows must match
(?!...) Negative lookahead: what follows must not match
(?<=...) Positive lookbehind: what precedes must match
(?<!...) Negative lookbehind: what precedes must not match

Classic example — password with at least one digit AND one uppercase:

^(?=.*\d)(?=.*[A-Z]).{8,}$

Three lookaheads in parallel, no characters "consumed", just conditions checked. Much cleaner than chaining multiple regexes.

Catastrophic backtracking: how regex freezes your browser

Nested quantifiers are the classic foot-gun. Anti-patterns: (a+)+, (a*)*, (a|a)*

With (a+)+b against aaaaaaaaaaaaaaaaa, the engine tries every possible way to split the as into groups before giving up — O(2^n). I've seen 30 characters of test input lock up a browser for 8 seconds.

How to avoid:

Atomic groups (?>...) (no JS support; Node ≥ 16 has them; Java / .NET have them)
Possessive quantifiers ++ *+ (same — no JS)
Audit your quantifiers for overlap ((\w+)+ collapses to \w+)
Cap input length (this site's tools cap regex input at 100,000 chars for exactly this reason)

JS-land usually relies on (3) and (4).

JavaScript vs Python: differences that bite

Start / end anchors: JS doesn't have \A / \Z, use ^ $ + m flag
Unicode: JS needs the u flag to get \p{Letter}; Python is Unicode by default
Lookbehind: Safari < 16.4 has no lookbehind at all — your site breaks for those users. Always wrap in try/catch with a fallback regex.
re vs regex module (Python): the stdlib re doesn't support variable-length lookbehind; install the third-party regex module if you need it
Sticky flag y: JS only — useful when writing tokenizers / lexers

Real QA scenarios where I use regex

The patterns I reach for most weeks:

nginx access log parsing: extract IP / status / response time → feed into percentile analysis
API response body checks: Robot Framework's Should Match Regexp is much sharper than Should Contain
Test data validation: confirm the credit card test data you generated matches the expected (\d{4}) (\d{4}) (\d{4}) (\d{4}) format
Selenium dynamic IDs: grab userdata-([a-f0-9]{8}) and use the captured suffix
Error log classification: pull file path + line number out of stack traces to rank flakiest modules

Try the patterns: paste each one into the Regex tool and confirm matches live. The lookbehind-on-Safari case is the one that catches everyone.