Python Regular Expression (RegEx)
1. Introduction to Regex
What is Regex?
Regular Expressions (Regex) are a powerful tool used in programming for searching, manipulating, and editing text based on specific patterns. These patterns are constructed using a sequence of characters and special symbols that define the search criterion. Regex is widely supported across various programming languages, text editors, and other tools, making it an indispensable skill for anyone working with text or data.
Uses of Regex
- Data Cleaning and Preparation: Regex is extensively used in data preprocessing to clean and transform text data. This includes tasks like removing unnecessary characters, formatting strings, extracting useful information, and more.
- Text Processing: It facilitates complex text processing tasks such as searching for specific patterns, replacing substrings, splitting strings into arrays based on a pattern, and more.
- Information Extraction: Regex is instrumental in extracting specific pieces of information from large texts, such as email addresses, phone numbers, URLs, and other custom patterns that signify valuable data.
- Validation: Regular expressions are used to validate text to ensure it matches a predefined pattern, such as email syntax, phone number formats, and password strength requirements.
- Syntax Highlighting: Many text editors and IDEs use regex to identify keywords, variables, and other elements of programming languages to apply syntax highlighting, improving the readability of the code.
The Power of Regex
While regex is incredibly powerful and versatile, it's also known for its steep learning curve due to its cryptic syntax. However, once mastered, it allows for very efficient text processing and can reduce complex tasks into a single line of code. Understanding regex opens up a wide array of possibilities for automating text-related tasks, making it a valuable skill in your programming toolkit.
2. Basics of Regular Expressions
Basic Concepts
In the world of regular expressions, the patterns you define to search within text can include various types of characters: literals, character classes, and special characters. Understanding these basics is key to harnessing the full power of regex.
- Literals: These are the simplest form of pattern matching in regex. When you use literals, you're searching for the exact character(s) in the text. For example, the regex
cat
will match the substring "cat" in the string "The cat sat on the mat". - Character Classes: Character classes allow you to match any one out of several characters at a specific position. For instance, the regex
[cat]
will match any single 'c', 'a', or 't' character in a string, not the word "cat". - Special Characters: These characters have special meanings in regex syntax. They can represent types of characters (like any digit, any whitespace, etc.), specify the quantity (like zero or more occurrences of a character), or assert positions within the text (like the start or end of a word). Some common special characters include:
.
(Dot): Matches any character except newline.\d
: Matches any digit (equivalent to[0-9]
).\w
: Matches any word character (equivalent to[a-zA-Z0-9_]
).*
: Matches 0 or more occurrences of the preceding element.+
: Matches 1 or more occurrences of the preceding element.?
: Makes the preceding element optional (0 or 1 occurrence).^
: Matches the start of a string.$
: Matches the end of a string.
Introduction to Python's re
Module
Python provides support for regex through the re
module, which is part of the standard library. This module offers a wide range of functions to perform queries on strings. Here are some of the most commonly used functions:
re.match()
: Checks for a match only at the beginning of the string.re.search()
: Searches the entire string for a match.re.findall()
: Finds all substrings where the regex pattern matches and returns them as a list.re.sub()
: Replaces the matches with a string of choice.
Before you can use these functions, you need to import the re
module:
import re
Let's try a simple example to see re
in action. We'll search for any digit in a string:
import re
text = "The year 2024"
match = re.search(r'\d+', text) # r'...' denotes a raw string, which treats backslashes as literal characters
if match:
print("Found:", match.group())
else:
print("No match found")
This brief overview introduces you to the building blocks of regex in Python. As you get comfortable with these concepts, you'll be able to tackle more complex text processing tasks with ease.
Pattern matching
in Python can be done using the re
module, which provides functions like re.match()
and re.search()
for searching patterns in strings using regular expressions.
Matching Literal Strings in a Text:
Let's say we want to find occurrences of a specific word or phrase within a text.
import re
text = "Python is a powerful programming language. Python is also easy to learn."
# Using re.search() to find occurrences of the word "Python"
match = re.search("Python", text)
if match:
print("Found:", match.group()) # Output: Found: Python
else:
print("Not found")
In this example, re.search()
is used to find the first occurrence of the word "Python" in the text.
Using re.match()
and re.search()
:
re.match()
checks for a match only at the beginning of the string, while re.search()
scans the entire string for a match.
text = "Python is a powerful programming language. Python is also easy to learn."
# Using re.match() to find "Python" at the beginning of the text
match = re.match("Python", text)
if match:
print("Found at the beginning:", match.group()) # Output: Found at the beginning: Python
else:
print("Not found")
# Using re.search() to find "Python" anywhere in the text
match = re.search("Python", text)
if match:
print("Found anywhere:", match.group()) # Output: Found anywhere: Python
else:
print("Not found")
In this example, re.match()
is used to find "Python" only at the beginning of the text, while re.search()
finds "Python" anywhere in the text.
Both re.match()
and re.search()
return a match object if the pattern is found, or None
if not found. The group()
method of the match object returns the matched string.
#Metacharacters in regular expressions are characters that have a special meaning and are used to specify patterns to match text. Here's an explanation of commonly used metacharacters:
.
(Dot): Matches any single character except a newline.^
(Caret): Matches the start of the string.$
(Dollar): Matches the end of the string.*
(Asterisk): Matches zero or more occurrences of the preceding character.+
(Plus): Matches one or more occurrences of the preceding character.?
(Question Mark): Matches zero or one occurrence of the preceding character.{}
(Braces): Specifies the minimum and maximum number of occurrences of the preceding character or group.[]
(Square Brackets): Specifies a character class, matches any one of the enclosed characters.\
(Backslash): Escapes special characters, or indicates a special sequence.|
(Pipe): Acts as an OR operator, matches either the expression before or after the pipe.()
(Parentheses): Creates a capturing group, used to capture and remember a matched expression.
Examples:
Let's see some examples demonstrating the use of these metacharacters:
import re
text = "The quick brown fox jumps over the lazy dog."
# . (Dot): Matches any single character except newline
pattern = r".ck"
matches = re.findall(pattern, text)
print(matches) # Output: ['ick', 'ock']
# ^ (Caret): Matches the start of the string
pattern = r"^The"
match = re.search(pattern, text)
print(match.group()) # Output: The
# $ (Dollar): Matches the end of the string
pattern = r"dog.$"
match = re.search(pattern, text)
print(match.group()) # Output: dog.
# * (Asterisk): Matches zero or more occurrences of the preceding character
pattern = r"o*"
matches = re.findall(pattern, text)
print(matches) # Output: ['', '', '', 'o', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'o', '', 'o', '', '']
# + (Plus): Matches one or more occurrences of the preceding character
pattern = r"o+"
matches = re.findall(pattern, text)
print(matches) # Output: ['o', 'o', 'o', 'o']
# ? (Question Mark): Matches zero or one occurrence of the preceding character
pattern = r"o?"
matches = re.findall(pattern, text)
print(matches) # Output: ['', '', 'o', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'o', '', 'o', '', '', '']
# {} (Braces): Specifies the minimum and maximum number of occurrences
pattern = r"o{1,2}"
matches = re.findall(pattern, text)
print(matches) # Output: ['o', 'o', 'o']
# [] (Square Brackets): Specifies a character class
pattern = r"[aeiou]"
matches = re.findall(pattern, text)
print(matches) # Output: ['e', 'u', 'i', 'o', 'o', 'u', 'o', 'e', 'e', 'a', 'o']
# \ (Backslash): Escapes special characters or indicates a special sequence
pattern = r"\."
matches = re.findall(pattern, text)
print(matches) # Output: ['.']
# | (Pipe): Acts as an OR operator
pattern = r"fox|dog"
matches = re.findall(pattern, text)
print(matches) # Output: ['fox', 'dog']
# () (Parentheses): Creates a capturing group
pattern = r"brown (fox|dog)"
match = re.search(pattern, text)
print(match.group(1)) # Output: fox
These examples illustrate the use of various metacharacters in regular expressions to match specific patterns in text.
Special sequences
in regular expressions are predefined character classes that represent common types of characters. Here are some commonly used special sequences:
\d
: Matches any digit (equivalent to[0-9]
).\D
: Matches any non-digit character (equivalent to[^0-9]
).\s
: Matches any whitespace character (space, tab, newline, etc.).\S
: Matches any non-whitespace character.\w
: Matches any alphanumeric character (word character) and underscore (equivalent to[a-zA-Z0-9_]
).\W
: Matches any non-alphanumeric character (equivalent to[^a-zA-Z0-9_]
).
Practical Examples:
Let's see some practical examples using these special sequences:
import re
text = "The price of the book is $20.99, and it weighs 2.5 kg."
# \d: Matches any digit
pattern = r"\d+"
matches = re.findall(pattern, text)
print("Digits:", matches) # Output: Digits: ['20', '99', '2', '5']
# \D: Matches any non-digit character
pattern = r"\D+"
matches = re.findall(pattern, text)
print("Non-digits:", matches) # Output: Non-digits: ['The price of the book is $', '.', ', and it weighs ', ' kg.']
# \s: Matches any whitespace character
pattern = r"\s+"
matches = re.findall(pattern, text)
print("Whitespace characters:", matches) # Output: Whitespace characters: [' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']
# \S: Matches any non-whitespace character
pattern = r"\S+"
matches = re.findall(pattern, text)
print("Non-whitespace characters:", matches) # Output: Non-whitespace characters: ['The', 'price', 'of', 'the', 'book', 'is', '$20.99,', 'and', 'it', 'weighs', '2.5', 'kg.']
# \w: Matches any word character
pattern = r"\w+"
matches = re.findall(pattern, text)
print("Word characters:", matches) # Output: Word characters: ['The', 'price', 'of', 'the', 'book', 'is', '20', '99', 'and', 'it', 'weighs', '2', '5', 'kg']
# \W: Matches any non-word character
pattern = r"\W+"
matches = re.findall(pattern, text)
print("Non-word characters:", matches) # Output: Non-word characters: [' ', ' ', ' ', ' ', ' ', ' $', '.', ', ', ' ', ' ', ' ', ' ', ' ']
These examples demonstrate how to use special sequences in regular expressions to match digits, non-digits,whitespace characters, non-whitespace characters, word characters, and non-word characters in a given text.
Sets in regular expressions
are defined using square brackets ([]
). They allow you to specify a set of characters to match. Additionally, you can specify character ranges within the square brackets.
Examples of Sets and Character Ranges:
Let's see some examples of sets and character ranges in regular expressions:
import re
text = "The quick brown fox jumps over the lazy dog."
# Matching individual characters in a set
pattern = r"[aeiou]" # Matches any vowel
matches = re.findall(pattern, text)
print("Vowels:", matches) # Output: Vowels: ['e', 'u', 'i', 'o', 'o', 'u', 'o', 'e', 'a', 'o']
# Matching characters in a range
pattern = r"[a-z]" # Matches any lowercase letter
matches = re.findall(pattern, text)
print("Lowercase letters:", matches) # Output: Lowercase letters: ['h', 'e', 'q', 'u', 'i', 'c', 'k', 'b', 'r', 'o', 'w', 'n', 'f', 'o', 'x', 'j', 'u', 'm', 'p', 's', 'o', 'v', 'e', 'r', 't', 'h', 'e', 'l', 'a', 'z', 'y', 'd', 'o', 'g']
# Matching characters in multiple ranges
pattern = r"[a-zA-Z]" # Matches any uppercase or lowercase letter
matches = re.findall(pattern, text)
print("Letters:", matches) # Output: Letters: ['T', 'h', 'e', 'q', 'u', 'i', 'c', 'k', 'b', 'r', 'o', 'w', 'n', 'f', 'o', 'x', 'j', 'u', 'm', 'p', 's', 'o', 'v', 'e', 'r', 't', 'h', 'e', 'l', 'a', 'z', 'y', 'd', 'o', 'g']
# Using negation (^) to match characters not in the set
pattern = r"[^aeiou]" # Matches any character that is not a vowel
matches = re.findall(pattern, text)
print("Consonants:", matches) # Output: Consonants: ['T', 'h', ' ', 'q', 'c', 'k', ' ', 'b', 'r', 'w', 'n', ' ', 'f', 'x', ' ', 'j', 'm', 'p', 's', ' ', 'v', 'r', ' ', 't', 'h', ' ', 'l', 'z', 'y', ' ', 'd', 'g', '.']
In these examples:
[aeiou]
matches any vowel in the text.[a-z]
matches any lowercase letter.[a-zA-Z]
matches any uppercase or lowercase letter.[^aeiou]
matches any character that is not a vowel (using negation with^
).- We use
re.findall()
to find all occurrences of the pattern in the text.
Sets and character ranges provide a powerful way to specify groups of characters to match in regular expressions, allowing for flexible and precise pattern matching.
Groups and grouping
in regular expressions allow you to capture and extract specific parts of a matched pattern by enclosing them in parentheses (()
). This is useful for extracting meaningful information from strings that match a certain pattern.
Examples of Extracting Data from Strings:
Let's see some examples of using groups and grouping to extract data from strings:
import re
# Example string containing phone numbers with area codes
text = "John's phone number is (123) 456-7890, and Jane's phone number is (456) 789-0123."
# Extracting phone numbers using groups
pattern = r"\((\d{3})\) (\d{3})-(\d{4})" # Pattern to match phone numbers with area codes
matches = re.findall(pattern, text)
for match in matches:
area_code, first_three_digits, last_four_digits = match
print(f"Area Code: {area_code}, Number: {first_three_digits}-{last_four_digits}")
# Example string containing email addresses
text = "Contact us at info@example.com or support@example.org."
# Extracting email addresses using groups
pattern = r"(\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b)" # Pattern to match email addresses
matches = re.findall(pattern, text)
for match in matches:
print("Email:", match)
In these examples:
()
is used to define groups in the regular expression pattern.- We use
\d
to match digits and\w
to match word characters in phone numbers and email addresses, respectively. - We use
{}
to specify the number of occurrences for digit sequences. - We use
re.findall()
to find all occurrences of the pattern in the text. - For each match, we extract and print the captured groups.
Output:
Area Code: 123, Number: 456-7890
Area Code: 456, Number: 789-0123
Email: info@example.com
Email: support@example.org
In summary, groups and grouping in regular expressions allow us to capture and extract specific parts of a matched pattern, enabling us to extract meaningful data from strings that match a certain pattern.
12. Practical Exercises
After exploring the capabilities and syntax of regular expressions in Python, it's time to put your knowledge into practice. Below are some exercises designed to help you apply what you've learned. These exercises involve extracting phone numbers, email addresses, and URLs from a block of text, showcasing the practical applications of regex in data extraction and text processing tasks.
Exercise 1: Extract Phone Numbers
Given a block of text, extract all phone numbers. Assume phone numbers can be in the formats: (123) 456-7890
, 123-456-7890
, 123.456.7890
, or 1234567890
.
import re
text_to_search = """
Call me at (123) 456-7890 tomorrow.
My office number is 123-456-7890.
Reach me at 123.456.7890 for urgent matters.
Emergency contact: 1234567890.
"""
pattern = re.compile(r'\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}')
phone_numbers = pattern.findall(text_to_search)
print("Extracted phone numbers:", phone_numbers)
Exercise 2: Extract Email Addresses
Extract all email addresses from a given text. Assume email addresses follow the pattern username@domain.com
, where the domain can vary (.com
, .net
, .org
, etc.).
text_to_search = """
For support, contact support@example.com or helpdesk@example.net.
For inquiries, email us at info@example.org.
"""
pattern = re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b')
emails = pattern.findall(text_to_search)
print("Extracted email addresses:", emails)
Exercise 3: Extract URLs
Extract all URLs from a block of text. Consider both http
and https
protocols and optional www.
prefixes.
text_to_search = """
Visit our website at http://www.example.com.
Our secure site is at https://example.net, and our blog is located at http://blog.example.org.
"""
pattern = re.compile(r'https?://(www\.)?\w+\.\w+')
urls = pattern.findall(text_to_search)
print("Extracted URLs:", urls)
These exercises are designed to help reinforce the concepts you've learned about regex. As you work through them, you'll gain a deeper understanding of how to construct regex patterns to match specific criteria, a skill that is invaluable for text processing and data extraction tasks.
# Python Program to find the area of triangle
a = 5
b = 6
c = 7
# Uncomment below to take inputs from the user
# a = float(input('Enter first side: '))
# b = float(input('Enter second side: '))
# c = float(input('Enter third side: '))
# calculate the semi-perimeter
s = (a + b + c) / 2
# calculate the area
area = (s*(s-a)*(s-b)*(s-c)) ** 0.5
print('The area of the triangle is %0.2f' %area)