Python Regular Expressions: Master Guide
Comprehensive guide to pattern matching and text processing in Python
Table of Contents
1. Pattern Matching Fundamentals
Regex Syntax Basics
Meta Characters
Special characters that define the matching rules:
Character | Description | Example |
---|---|---|
. |
Matches any character except newline | a.c matches "abc", "a1c" |
^ |
Matches start of string | ^Hello matches "Hello" at start |
$ |
Matches end of string | end$ matches "end" at end |
* |
0 or more repetitions | ab*c matches "ac", "abc", "abbc" |
+ |
1 or more repetitions | ab+c matches "abc", "abbc" (not "ac") |
? |
0 or 1 repetition (optional) | colou?r matches "color" and "colour" |
{m,n} |
Between m and n repetitions | a{2,4} matches "aa", "aaa", "aaaa" |
Character Classes
Special sequences and custom character sets:
# Predefined character classes
\d # Any digit [0-9]
\w # Word character [a-zA-Z0-9_]
\s # Whitespace [ \t\n\r\f\v]
# Custom character classes
[aeiou] # Any vowel
[A-Z] # Any uppercase letter
[^0-9] # Negation - anything NOT a digit
[0-9a-fA-F] # Hexadecimal digits
Grouping & Capturing
Organize patterns and extract matched portions:
import re
# Basic capturing group
match = re.search(r'(\d{3})-(\d{3})', 'Phone: 123-456')
print(match.group(1)) # '123'
print(match.group(2)) # '456'
# Non-capturing group
re.findall(r'(?:Mr|Ms|Mrs) (\w+)', 'Mr Smith and Ms Doe') # ['Smith', 'Doe']
# Named group
match = re.search(r'(?P<area>\d{3})-(?P<number>\d{3})', '123-456')
print(match.group('area')) # '123'
print(match.group('number')) # '456'
Python's re Module
Search Functions
Function | Description | Example |
---|---|---|
re.match() |
Match only at string start | re.match(r'\d+', '123abc') |
re.search() |
Search anywhere in string | re.search(r'\d+', 'abc123def') |
re.findall() |
Return all matches as list | re.findall(r'\d+', '1a22b333') # ['1', '22', '333'] |
re.finditer() |
Return iterator of match objects | for match in re.finditer(r'\d+', text): |
Flags
Modify regex behavior:
# Case insensitive matching
re.findall(r'hello', 'Hello WORLD', re.IGNORECASE) # ['Hello']
# Multiline mode (^/$ match start/end of line)
text = "first\nsecond\nthird"
re.findall(r'^\w+', text, re.MULTILINE) # ['first', 'second', 'third']
# DOTALL mode (make . match newlines)
re.search(r'a.*b', 'a\nb', re.DOTALL).group() # 'a\nb'
2. Search and Replace Operations
Basic Search/Replace
re.sub()
Replace all occurrences of a pattern:
# Simple replacement
text = "Today is 2023-01-15"
new_text = re.sub(r'\d{4}-\d{2}-\d{2}', 'DATE', text)
# "Today is DATE"
# With backreferences
text = "Smith, John"
new_text = re.sub(r'(\w+), (\w+)', r'\2 \1', text)
# "John Smith"
# Named backreferences
text = "Area code: 123-456"
new_text = re.sub(r'(?P<area>\d{3})-(?P<number>\d{3})',
r'\g<number>-\g<area>', text)
# "Area code: 456-123"
# Limited replacements
text = "a a a a a"
new_text = re.sub(r'a', 'b', text, count=2)
# "b b a a a"
Advanced Replacement Techniques
Replacement with Functions
Use a callable to generate replacement strings:
def to_upper(match):
return match.group().upper()
text = "hello world"
new_text = re.sub(r'\w+', to_upper, text)
# "HELLO WORLD"
# With lambda
new_text = re.sub(r'\d+', lambda m: str(int(m.group()) * 2), "2 apples")
# "4 apples"
Conditional Replacements
Use lookarounds for context-sensitive replacements:
# Replace comma only if followed by 3 digits
text = "Amounts: 1,234 and 5,67"
new_text = re.sub(r',(?=\d{3})', '.', text)
# "Amounts: 1.234 and 5,67"
# Add space after dot not followed by space or end
text = "Hello.World.Python. "
new_text = re.sub(r'\.(?! |$)', '. ', text)
# "Hello. World. Python. "
3. Common Regex Patterns
Validation Patterns
Email: ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$
Matches most common email formats
Phone (US): ^\+1\d{10}$
Matches US phone numbers with country code
URL: https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+
Matches HTTP/HTTPS URLs with encoded characters
Extraction Patterns
Dates: \d{2}-\d{2}-\d{4}
Matches dates in DD-MM-YYYY format (adjust as needed)
HTML tags: <([a-z]+)([^<]+)*(?:>(.*)<\/\1>| *\/>)
Matches HTML tags and their content (simplified)
Log parsing: (\d{4}-\d{2}-\d{2}) (\d{2}:\d{2}:\d{2}) (ERROR|WARN)
Extracts timestamp and log level from log entries
Text Processing Patterns
Splitting Strings
# Split on multiple delimiters
text = "apple,banana;cherry/orange"
re.split(r'[,;/]', text) # ['apple', 'banana', 'cherry', 'orange']
# Split and keep delimiters
re.split(r'([,;/])', text) # ['apple', ',', 'banana', ';', 'cherry', '/', 'orange']
Whitespace Normalization
text = "Hello world\n\nPython"
re.sub(r'\s+', ' ', text).strip() # "Hello world Python"
Removing Comments
code = """
def func(): # This is a comment
return 42 # Another comment
"""
re.sub(r'#.*$', '', code, flags=re.MULTILINE)
4. Advanced Regex Techniques
Lookaheads/Lookbehinds
Assertions that don't consume characters:
Pattern | Type | Description |
---|---|---|
(?=...) |
Positive lookahead | Matches if ... matches next |
(?!...) |
Negative lookahead | Matches if ... doesn't match next |
(?<=...) |
Positive lookbehind | Matches if ... matches before |
(?<!...) |
Negative lookbehind | Matches if ... doesn't match before |
Practical Example: Password Validation
# At least 8 chars, contains uppercase, lowercase, digit and special char
pattern = r'^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[@$!%*?&])[A-Za-z\d@$!%*?&]{8,}$'
re.fullmatch(pattern, 'Password1!') # Match
re.fullmatch(pattern, 'weak') # No match
Optimization & Performance
Compiling Patterns
# Compile once, use many times
pattern = re.compile(r'\d{3}-\d{3}-\d{4}') # Phone number pattern
pattern.search('Call 123-456-7890')
Avoiding Catastrophic Backtracking
Warning: Nested quantifiers can cause exponential time complexity.
# Problematic pattern (exponential time)
re.match(r'(a+)+$', 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaa!')
# Better approach (atomic grouping or possessive quantifiers)
re.match(r'(?>a+)+$', 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaa!')
When Not to Use Regex
- Simple string operations (
str.startswith()
,str.split()
) - Parsing nested structures (HTML/XML - use proper parsers)
- When readability suffers (complex patterns can be unmaintainable)
5. Debugging & Tools
Online Testers
- Regex101 - Interactive tester with explanation
- Pythex - Python-specific regex tester
- Debuggex - Visual regex debugger
re.DEBUG Flag
See how Python interprets your pattern:
re.compile(r'\d{3}-\d{3}', re.DEBUG)
# Output shows parsing steps:
# MAX_REPEAT 3 3
# IN
# CATEGORY CATEGORY_DIGIT
# LITERAL 45
# MAX_REPEAT 3 3
# IN
# CATEGORY CATEGORY_DIGIT
Common Pitfalls
Greedy vs Lazy Matching
text = "<div>content</div><div>more</div>"
# Greedy match (default)
re.findall(r'<div>.*</div>', text) # Matches entire string
# Lazy match (add ? after quantifier)
re.findall(r'<div>.*?</div>', text) # Matches each div separately
Other Common Issues
- Forgetting to use raw strings (
r'\d'
vs'\d'
) - Overusing regex when simpler string methods would suffice
- Not considering Unicode characters (
\w
matches word chars in any language) - Assuming . matches newlines (need
re.DOTALL
flag)