Python Regular Expressions: Master Guide

Comprehensive guide to pattern matching and text processing in Python

1. Pattern Matching Fundamentals
2. Search and Replace Operations
3. Common Regex Patterns
4. Advanced Regex Techniques
5. Debugging & Tools

1. Pattern Matching Fundamentals

Regex Syntax Basics

Meta Characters

Special characters that define the matching rules:

Character	Description	Example
`.`	Matches any character except newline	`a.c` matches "abc", "a1c"
`^`	Matches start of string	`^Hello` matches "Hello" at start
`$`	Matches end of string	`end$` matches "end" at end
`*`	0 or more repetitions	`ab*c` matches "ac", "abc", "abbc"
`+`	1 or more repetitions	`ab+c` matches "abc", "abbc" (not "ac")
`?`	0 or 1 repetition (optional)	`colou?r` matches "color" and "colour"
`{m,n}`	Between m and n repetitions	`a{2,4}` matches "aa", "aaa", "aaaa"

Character Classes

Special sequences and custom character sets:

# Predefined character classes
\d  # Any digit [0-9]
\w  # Word character [a-zA-Z0-9_]
\s  # Whitespace [ \t\n\r\f\v]

# Custom character classes
[aeiou]    # Any vowel
[A-Z]      # Any uppercase letter
[^0-9]     # Negation - anything NOT a digit
[0-9a-fA-F]  # Hexadecimal digits

Grouping & Capturing

Organize patterns and extract matched portions:

import re

# Basic capturing group
match = re.search(r'(\d{3})-(\d{3})', 'Phone: 123-456')
print(match.group(1))  # '123'
print(match.group(2))  # '456'

# Non-capturing group
re.findall(r'(?:Mr|Ms|Mrs) (\w+)', 'Mr Smith and Ms Doe')  # ['Smith', 'Doe']

# Named group
match = re.search(r'(?P<area>\d{3})-(?P<number>\d{3})', '123-456')
print(match.group('area'))    # '123'
print(match.group('number'))  # '456'

Python's re Module

Search Functions

Function	Description	Example
`re.match()`	Match only at string start	`re.match(r'\d+', '123abc')`
`re.search()`	Search anywhere in string	`re.search(r'\d+', 'abc123def')`
`re.findall()`	Return all matches as list	`re.findall(r'\d+', '1a22b333') # ['1', '22', '333']`
`re.finditer()`	Return iterator of match objects	`for match in re.finditer(r'\d+', text):`

Flags

Modify regex behavior:

# Case insensitive matching
re.findall(r'hello', 'Hello WORLD', re.IGNORECASE)  # ['Hello']

# Multiline mode (^/$ match start/end of line)
text = "first\nsecond\nthird"
re.findall(r'^\w+', text, re.MULTILINE)  # ['first', 'second', 'third']

# DOTALL mode (make . match newlines)
re.search(r'a.*b', 'a\nb', re.DOTALL).group()  # 'a\nb'

2. Search and Replace Operations

Basic Search/Replace

re.sub()

Replace all occurrences of a pattern:

# Simple replacement
text = "Today is 2023-01-15"
new_text = re.sub(r'\d{4}-\d{2}-\d{2}', 'DATE', text)
# "Today is DATE"

# With backreferences
text = "Smith, John"
new_text = re.sub(r'(\w+), (\w+)', r'\2 \1', text)
# "John Smith"

# Named backreferences
text = "Area code: 123-456"
new_text = re.sub(r'(?P<area>\d{3})-(?P<number>\d{3})', 
                 r'\g<number>-\g<area>', text)
# "Area code: 456-123"

# Limited replacements
text = "a a a a a"
new_text = re.sub(r'a', 'b', text, count=2)
# "b b a a a"

Advanced Replacement Techniques

Replacement with Functions

Use a callable to generate replacement strings:

def to_upper(match):
    return match.group().upper()

text = "hello world"
new_text = re.sub(r'\w+', to_upper, text)
# "HELLO WORLD"

# With lambda
new_text = re.sub(r'\d+', lambda m: str(int(m.group()) * 2), "2 apples")
# "4 apples"

Conditional Replacements

Use lookarounds for context-sensitive replacements:

# Replace comma only if followed by 3 digits
text = "Amounts: 1,234 and 5,67"
new_text = re.sub(r',(?=\d{3})', '.', text)
# "Amounts: 1.234 and 5,67"

# Add space after dot not followed by space or end
text = "Hello.World.Python. "
new_text = re.sub(r'\.(?! |$)', '. ', text)
# "Hello. World. Python. "

3. Common Regex Patterns

Validation Patterns

Email: ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$

Matches most common email formats

Phone (US): ^\+1\d{10}$

Matches US phone numbers with country code

URL: https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+

Matches HTTP/HTTPS URLs with encoded characters

Extraction Patterns

Dates: \d{2}-\d{2}-\d{4}

Matches dates in DD-MM-YYYY format (adjust as needed)

HTML tags: <([a-z]+)([^<]+)*(?:>(.*)<\/\1>| *\/>)

Matches HTML tags and their content (simplified)

Log parsing: (\d{4}-\d{2}-\d{2}) (\d{2}:\d{2}:\d{2}) (ERROR|WARN)

Extracts timestamp and log level from log entries

Text Processing Patterns

Splitting Strings

# Split on multiple delimiters
text = "apple,banana;cherry/orange"
re.split(r'[,;/]', text)  # ['apple', 'banana', 'cherry', 'orange']

# Split and keep delimiters
re.split(r'([,;/])', text)  # ['apple', ',', 'banana', ';', 'cherry', '/', 'orange']

Whitespace Normalization

text = "Hello     world\n\nPython"
re.sub(r'\s+', ' ', text).strip()  # "Hello world Python"

Removing Comments

code = """
def func(): # This is a comment
    return 42  # Another comment
"""
re.sub(r'#.*$', '', code, flags=re.MULTILINE)

4. Advanced Regex Techniques

Lookaheads/Lookbehinds

Assertions that don't consume characters:

Pattern	Type	Description
`(?=...)`	Positive lookahead	Matches if ... matches next
`(?!...)`	Negative lookahead	Matches if ... doesn't match next
`(?<=...)`	Positive lookbehind	Matches if ... matches before
`(?<!...)`	Negative lookbehind	Matches if ... doesn't match before

Practical Example: Password Validation

# At least 8 chars, contains uppercase, lowercase, digit and special char
pattern = r'^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[@$!%*?&])[A-Za-z\d@$!%*?&]{8,}$'

re.fullmatch(pattern, 'Password1!')  # Match
re.fullmatch(pattern, 'weak')        # No match

Optimization & Performance

Compiling Patterns

# Compile once, use many times
pattern = re.compile(r'\d{3}-\d{3}-\d{4}')  # Phone number pattern
pattern.search('Call 123-456-7890')

Avoiding Catastrophic Backtracking

Warning: Nested quantifiers can cause exponential time complexity.

# Problematic pattern (exponential time)
re.match(r'(a+)+$', 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaa!')

# Better approach (atomic grouping or possessive quantifiers)
re.match(r'(?>a+)+$', 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaa!')

When Not to Use Regex

Simple string operations (str.startswith(), str.split())
Parsing nested structures (HTML/XML - use proper parsers)
When readability suffers (complex patterns can be unmaintainable)

5. Debugging & Tools

Online Testers

Regex101 - Interactive tester with explanation
Pythex - Python-specific regex tester
Debuggex - Visual regex debugger

re.DEBUG Flag

See how Python interprets your pattern:

re.compile(r'\d{3}-\d{3}', re.DEBUG)

# Output shows parsing steps:
# MAX_REPEAT 3 3
#   IN
#     CATEGORY CATEGORY_DIGIT
# LITERAL 45
# MAX_REPEAT 3 3
#   IN
#     CATEGORY CATEGORY_DIGIT

Common Pitfalls

Greedy vs Lazy Matching

text = "<div>content</div><div>more</div>"

# Greedy match (default)
re.findall(r'<div>.*</div>', text)  # Matches entire string

# Lazy match (add ? after quantifier)
re.findall(r'<div>.*?</div>', text)  # Matches each div separately

Other Common Issues

Forgetting to use raw strings (r'\d' vs '\d')
Overusing regex when simpler string methods would suffice
Not considering Unicode characters (\w matches word chars in any language)
Assuming . matches newlines (need re.DOTALL flag)

Regular Expressions | Basics of Python Developer