Latest update Android YouTube

Regular Expressions | Basics of Python Developer

Python Regular Expressions: Master Guide

Comprehensive guide to pattern matching and text processing in Python

1. Pattern Matching Fundamentals

Regex Syntax Basics

Meta Characters

Special characters that define the matching rules:

Character Description Example
. Matches any character except newline a.c matches "abc", "a1c"
^ Matches start of string ^Hello matches "Hello" at start
$ Matches end of string end$ matches "end" at end
* 0 or more repetitions ab*c matches "ac", "abc", "abbc"
+ 1 or more repetitions ab+c matches "abc", "abbc" (not "ac")
? 0 or 1 repetition (optional) colou?r matches "color" and "colour"
{m,n} Between m and n repetitions a{2,4} matches "aa", "aaa", "aaaa"

Character Classes

Special sequences and custom character sets:

# Predefined character classes
\d  # Any digit [0-9]
\w  # Word character [a-zA-Z0-9_]
\s  # Whitespace [ \t\n\r\f\v]

# Custom character classes
[aeiou]    # Any vowel
[A-Z]      # Any uppercase letter
[^0-9]     # Negation - anything NOT a digit
[0-9a-fA-F]  # Hexadecimal digits

Grouping & Capturing

Organize patterns and extract matched portions:

import re

# Basic capturing group
match = re.search(r'(\d{3})-(\d{3})', 'Phone: 123-456')
print(match.group(1))  # '123'
print(match.group(2))  # '456'

# Non-capturing group
re.findall(r'(?:Mr|Ms|Mrs) (\w+)', 'Mr Smith and Ms Doe')  # ['Smith', 'Doe']

# Named group
match = re.search(r'(?P<area>\d{3})-(?P<number>\d{3})', '123-456')
print(match.group('area'))    # '123'
print(match.group('number'))  # '456'

Python's re Module

Search Functions

Function Description Example
re.match() Match only at string start re.match(r'\d+', '123abc')
re.search() Search anywhere in string re.search(r'\d+', 'abc123def')
re.findall() Return all matches as list re.findall(r'\d+', '1a22b333') # ['1', '22', '333']
re.finditer() Return iterator of match objects for match in re.finditer(r'\d+', text):

Flags

Modify regex behavior:

# Case insensitive matching
re.findall(r'hello', 'Hello WORLD', re.IGNORECASE)  # ['Hello']

# Multiline mode (^/$ match start/end of line)
text = "first\nsecond\nthird"
re.findall(r'^\w+', text, re.MULTILINE)  # ['first', 'second', 'third']

# DOTALL mode (make . match newlines)
re.search(r'a.*b', 'a\nb', re.DOTALL).group()  # 'a\nb'

2. Search and Replace Operations

Basic Search/Replace

re.sub()

Replace all occurrences of a pattern:

# Simple replacement
text = "Today is 2023-01-15"
new_text = re.sub(r'\d{4}-\d{2}-\d{2}', 'DATE', text)
# "Today is DATE"

# With backreferences
text = "Smith, John"
new_text = re.sub(r'(\w+), (\w+)', r'\2 \1', text)
# "John Smith"

# Named backreferences
text = "Area code: 123-456"
new_text = re.sub(r'(?P<area>\d{3})-(?P<number>\d{3})', 
                 r'\g<number>-\g<area>', text)
# "Area code: 456-123"

# Limited replacements
text = "a a a a a"
new_text = re.sub(r'a', 'b', text, count=2)
# "b b a a a"

Advanced Replacement Techniques

Replacement with Functions

Use a callable to generate replacement strings:

def to_upper(match):
    return match.group().upper()

text = "hello world"
new_text = re.sub(r'\w+', to_upper, text)
# "HELLO WORLD"

# With lambda
new_text = re.sub(r'\d+', lambda m: str(int(m.group()) * 2), "2 apples")
# "4 apples"

Conditional Replacements

Use lookarounds for context-sensitive replacements:

# Replace comma only if followed by 3 digits
text = "Amounts: 1,234 and 5,67"
new_text = re.sub(r',(?=\d{3})', '.', text)
# "Amounts: 1.234 and 5,67"

# Add space after dot not followed by space or end
text = "Hello.World.Python. "
new_text = re.sub(r'\.(?! |$)', '. ', text)
# "Hello. World. Python. "

3. Common Regex Patterns

Validation Patterns

Email: ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$
Matches most common email formats
Phone (US): ^\+1\d{10}$
Matches US phone numbers with country code
URL: https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+
Matches HTTP/HTTPS URLs with encoded characters

Extraction Patterns

Dates: \d{2}-\d{2}-\d{4}
Matches dates in DD-MM-YYYY format (adjust as needed)
HTML tags: <([a-z]+)([^<]+)*(?:>(.*)<\/\1>| *\/>)
Matches HTML tags and their content (simplified)
Log parsing: (\d{4}-\d{2}-\d{2}) (\d{2}:\d{2}:\d{2}) (ERROR|WARN)
Extracts timestamp and log level from log entries

Text Processing Patterns

Splitting Strings

# Split on multiple delimiters
text = "apple,banana;cherry/orange"
re.split(r'[,;/]', text)  # ['apple', 'banana', 'cherry', 'orange']

# Split and keep delimiters
re.split(r'([,;/])', text)  # ['apple', ',', 'banana', ';', 'cherry', '/', 'orange']

Whitespace Normalization

text = "Hello     world\n\nPython"
re.sub(r'\s+', ' ', text).strip()  # "Hello world Python"

Removing Comments

code = """
def func(): # This is a comment
    return 42  # Another comment
"""
re.sub(r'#.*$', '', code, flags=re.MULTILINE)

4. Advanced Regex Techniques

Lookaheads/Lookbehinds

Assertions that don't consume characters:

Pattern Type Description
(?=...) Positive lookahead Matches if ... matches next
(?!...) Negative lookahead Matches if ... doesn't match next
(?<=...) Positive lookbehind Matches if ... matches before
(?<!...) Negative lookbehind Matches if ... doesn't match before

Practical Example: Password Validation

# At least 8 chars, contains uppercase, lowercase, digit and special char
pattern = r'^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[@$!%*?&])[A-Za-z\d@$!%*?&]{8,}$'

re.fullmatch(pattern, 'Password1!')  # Match
re.fullmatch(pattern, 'weak')        # No match

Optimization & Performance

Compiling Patterns

# Compile once, use many times
pattern = re.compile(r'\d{3}-\d{3}-\d{4}')  # Phone number pattern
pattern.search('Call 123-456-7890')

Avoiding Catastrophic Backtracking

Warning: Nested quantifiers can cause exponential time complexity.

# Problematic pattern (exponential time)
re.match(r'(a+)+$', 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaa!')

# Better approach (atomic grouping or possessive quantifiers)
re.match(r'(?>a+)+$', 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaa!')

When Not to Use Regex

  • Simple string operations (str.startswith(), str.split())
  • Parsing nested structures (HTML/XML - use proper parsers)
  • When readability suffers (complex patterns can be unmaintainable)

5. Debugging & Tools

Online Testers

  • Regex101 - Interactive tester with explanation
  • Pythex - Python-specific regex tester
  • Debuggex - Visual regex debugger

re.DEBUG Flag

See how Python interprets your pattern:

re.compile(r'\d{3}-\d{3}', re.DEBUG)

# Output shows parsing steps:
# MAX_REPEAT 3 3
#   IN
#     CATEGORY CATEGORY_DIGIT
# LITERAL 45
# MAX_REPEAT 3 3
#   IN
#     CATEGORY CATEGORY_DIGIT

Common Pitfalls

Greedy vs Lazy Matching

text = "<div>content</div><div>more</div>"

# Greedy match (default)
re.findall(r'<div>.*</div>', text)  # Matches entire string

# Lazy match (add ? after quantifier)
re.findall(r'<div>.*?</div>', text)  # Matches each div separately

Other Common Issues

  • Forgetting to use raw strings (r'\d' vs '\d')
  • Overusing regex when simpler string methods would suffice
  • Not considering Unicode characters (\w matches word chars in any language)
  • Assuming . matches newlines (need re.DOTALL flag)

Post a Comment

Feel free to ask your query...
Cookie Consent
We serve cookies on this site to analyze traffic, remember your preferences, and optimize your experience.
Oops!
It seems there is something wrong with your internet connection. Please connect to the internet and start browsing again.
AdBlock Detected!
We have detected that you are using adblocking plugin in your browser.
The revenue we earn by the advertisements is used to manage this website, we request you to whitelist our website in your adblocking plugin.
Site is Blocked
Sorry! This site is not available in your country.