Regular Expressions (Regex) are a powerful tool used in programming for searching, manipulating, and editing text based on specific patterns. These patterns are constructed using a sequence of characters and special symbols that define the search criterion. Regex is widely supported across various programming languages, text editors, and other tools, making it an indispensable skill for anyone working with text or data.
While regex is incredibly powerful and versatile, it's also known for its steep learning curve due to its cryptic syntax. However, once mastered, it allows for very efficient text processing and can reduce complex tasks into a single line of code. Understanding regex opens up a wide array of possibilities for automating text-related tasks, making it a valuable skill in your programming toolkit.
In the world of regular expressions, the patterns you define to search within text can include various types of characters: literals, character classes, and special characters. Understanding these basics is key to harnessing the full power of regex.
cat
will match the substring "cat" in the string "The cat sat on the mat".[cat]
will match any single 'c', 'a', or 't' character in a string, not the word "cat"..
(Dot): Matches any character except newline.\d
: Matches any digit (equivalent to [0-9]
).\w
: Matches any word character (equivalent to [a-zA-Z0-9_]
).*
: Matches 0 or more occurrences of the preceding element.+
: Matches 1 or more occurrences of the preceding element.?
: Makes the preceding element optional (0 or 1 occurrence).^
: Matches the start of a string.$
: Matches the end of a string.re
ModulePython provides support for regex through the re
module, which is part of the standard library. This module offers a wide range of functions to perform queries on strings. Here are some of the most commonly used functions:
re.match()
: Checks for a match only at the beginning of the string.re.search()
: Searches the entire string for a match.re.findall()
: Finds all substrings where the regex pattern matches and returns them as a list.re.sub()
: Replaces the matches with a string of choice.Before you can use these functions, you need to import the re
module:
import re
re
in action. We'll search for any digit in a string:import re
text = "The year 2024"
match = re.search(r'\d+', text) # r'...' denotes a raw string, which treats backslashes as literal characters
if match:
print("Found:", match.group())
else:
print("No match found")
in Python can be done using the re
module, which provides functions like re.match()
and re.search()
for searching patterns in strings using regular expressions.
Matching Literal Strings in a Text:
Let's say we want to find occurrences of a specific word or phrase within a text.
import re
text = "Python is a powerful programming language. Python is also easy to learn."
# Using re.search() to find occurrences of the word "Python"
match = re.search("Python", text)
if match:
print("Found:", match.group()) # Output: Found: Python
else:
print("Not found")
In this example, re.search()
is used to find the first occurrence of the word "Python" in the text.
Using re.match()
and re.search()
:
re.match()
checks for a match only at the beginning of the string, while re.search()
scans the entire string for a match.
text = "Python is a powerful programming language. Python is also easy to learn."
# Using re.match() to find "Python" at the beginning of the text
match = re.match("Python", text)
if match:
print("Found at the beginning:", match.group()) # Output: Found at the beginning: Python
else:
print("Not found")
# Using re.search() to find "Python" anywhere in the text
match = re.search("Python", text)
if match:
print("Found anywhere:", match.group()) # Output: Found anywhere: Python
else:
print("Not found")
In this example, re.match()
is used to find "Python" only at the beginning of the text, while re.search()
finds "Python" anywhere in the text.
Both re.match()
and re.search()
return a match object if the pattern is found, or None
if not found. The group()
method of the match object returns the matched string.
#Metacharacters in regular expressions are characters that have a special meaning and are used to specify patterns to match text. Here's an explanation of commonly used metacharacters:
.
(Dot): Matches any single character except a newline.^
(Caret): Matches the start of the string.$
(Dollar): Matches the end of the string.*
(Asterisk): Matches zero or more occurrences of the preceding character.+
(Plus): Matches one or more occurrences of the preceding character.?
(Question Mark): Matches zero or one occurrence of the preceding character.{}
(Braces): Specifies the minimum and maximum number of occurrences of the preceding character or group.[]
(Square Brackets): Specifies a character class, matches any one of the enclosed characters.\
(Backslash): Escapes special characters, or indicates a special sequence.|
(Pipe): Acts as an OR operator, matches either the expression before or after the pipe.()
(Parentheses): Creates a capturing group, used to capture and remember a matched expression.Examples:
Let's see some examples demonstrating the use of these metacharacters:
import re
text = "The quick brown fox jumps over the lazy dog."
# . (Dot): Matches any single character except newline
pattern = r".ck"
matches = re.findall(pattern, text)
print(matches) # Output: ['ick', 'ock']
# ^ (Caret): Matches the start of the string
pattern = r"^The"
match = re.search(pattern, text)
print(match.group()) # Output: The
# $ (Dollar): Matches the end of the string
pattern = r"dog.$"
match = re.search(pattern, text)
print(match.group()) # Output: dog.
# * (Asterisk): Matches zero or more occurrences of the preceding character
pattern = r"o*"
matches = re.findall(pattern, text)
print(matches) # Output: ['', '', '', 'o', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'o', '', 'o', '', '']
# + (Plus): Matches one or more occurrences of the preceding character
pattern = r"o+"
matches = re.findall(pattern, text)
print(matches) # Output: ['o', 'o', 'o', 'o']
# ? (Question Mark): Matches zero or one occurrence of the preceding character
pattern = r"o?"
matches = re.findall(pattern, text)
print(matches) # Output: ['', '', 'o', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'o', '', 'o', '', '', '']
# {} (Braces): Specifies the minimum and maximum number of occurrences
pattern = r"o{1,2}"
matches = re.findall(pattern, text)
print(matches) # Output: ['o', 'o', 'o']
# [] (Square Brackets): Specifies a character class
pattern = r"[aeiou]"
matches = re.findall(pattern, text)
print(matches) # Output: ['e', 'u', 'i', 'o', 'o', 'u', 'o', 'e', 'e', 'a', 'o']
# \ (Backslash): Escapes special characters or indicates a special sequence
pattern = r"\."
matches = re.findall(pattern, text)
print(matches) # Output: ['.']
# | (Pipe): Acts as an OR operator
pattern = r"fox|dog"
matches = re.findall(pattern, text)
print(matches) # Output: ['fox', 'dog']
# () (Parentheses): Creates a capturing group
pattern = r"brown (fox|dog)"
match = re.search(pattern, text)
print(match.group(1)) # Output: fox
These examples illustrate the use of various metacharacters in regular expressions to match specific patterns in text.
in regular expressions are predefined character classes that represent common types of characters. Here are some commonly used special sequences:
\d
: Matches any digit (equivalent to [0-9]
).\D
: Matches any non-digit character (equivalent to [^0-9]
).\s
: Matches any whitespace character (space, tab, newline, etc.).\S
: Matches any non-whitespace character.\w
: Matches any alphanumeric character (word character) and underscore (equivalent to [a-zA-Z0-9_]
).\W
: Matches any non-alphanumeric character (equivalent to [^a-zA-Z0-9_]
).Practical Examples:
Let's see some practical examples using these special sequences:
import re
text = "The price of the book is $20.99, and it weighs 2.5 kg."
# \d: Matches any digit
pattern = r"\d+"
matches = re.findall(pattern, text)
print("Digits:", matches) # Output: Digits: ['20', '99', '2', '5']
# \D: Matches any non-digit character
pattern = r"\D+"
matches = re.findall(pattern, text)
print("Non-digits:", matches) # Output: Non-digits: ['The price of the book is $', '.', ', and it weighs ', ' kg.']
# \s: Matches any whitespace character
pattern = r"\s+"
matches = re.findall(pattern, text)
print("Whitespace characters:", matches) # Output: Whitespace characters: [' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']
# \S: Matches any non-whitespace character
pattern = r"\S+"
matches = re.findall(pattern, text)
print("Non-whitespace characters:", matches) # Output: Non-whitespace characters: ['The', 'price', 'of', 'the', 'book', 'is', '$20.99,', 'and', 'it', 'weighs', '2.5', 'kg.']
# \w: Matches any word character
pattern = r"\w+"
matches = re.findall(pattern, text)
print("Word characters:", matches) # Output: Word characters: ['The', 'price', 'of', 'the', 'book', 'is', '20', '99', 'and', 'it', 'weighs', '2', '5', 'kg']
# \W: Matches any non-word character
pattern = r"\W+"
matches = re.findall(pattern, text)
print("Non-word characters:", matches) # Output: Non-word characters: [' ', ' ', ' ', ' ', ' ', ' $', '.', ', ', ' ', ' ', ' ', ' ', ' ']
These examples demonstrate how to use special sequences in regular expressions to match digits, non-digits,whitespace characters, non-whitespace characters, word characters, and non-word characters in a given text.
are defined using square brackets ([]
). They allow you to specify a set of characters to match. Additionally, you can specify character ranges within the square brackets.
Examples of Sets and Character Ranges:
Let's see some examples of sets and character ranges in regular expressions:
import re
text = "The quick brown fox jumps over the lazy dog."
# Matching individual characters in a set
pattern = r"[aeiou]" # Matches any vowel
matches = re.findall(pattern, text)
print("Vowels:", matches) # Output: Vowels: ['e', 'u', 'i', 'o', 'o', 'u', 'o', 'e', 'a', 'o']
# Matching characters in a range
pattern = r"[a-z]" # Matches any lowercase letter
matches = re.findall(pattern, text)
print("Lowercase letters:", matches) # Output: Lowercase letters: ['h', 'e', 'q', 'u', 'i', 'c', 'k', 'b', 'r', 'o', 'w', 'n', 'f', 'o', 'x', 'j', 'u', 'm', 'p', 's', 'o', 'v', 'e', 'r', 't', 'h', 'e', 'l', 'a', 'z', 'y', 'd', 'o', 'g']
# Matching characters in multiple ranges
pattern = r"[a-zA-Z]" # Matches any uppercase or lowercase letter
matches = re.findall(pattern, text)
print("Letters:", matches) # Output: Letters: ['T', 'h', 'e', 'q', 'u', 'i', 'c', 'k', 'b', 'r', 'o', 'w', 'n', 'f', 'o', 'x', 'j', 'u', 'm', 'p', 's', 'o', 'v', 'e', 'r', 't', 'h', 'e', 'l', 'a', 'z', 'y', 'd', 'o', 'g']
# Using negation (^) to match characters not in the set
pattern = r"[^aeiou]" # Matches any character that is not a vowel
matches = re.findall(pattern, text)
print("Consonants:", matches) # Output: Consonants: ['T', 'h', ' ', 'q', 'c', 'k', ' ', 'b', 'r', 'w', 'n', ' ', 'f', 'x', ' ', 'j', 'm', 'p', 's', ' ', 'v', 'r', ' ', 't', 'h', ' ', 'l', 'z', 'y', ' ', 'd', 'g', '.']
In these examples:
[aeiou]
matches any vowel in the text.[a-z]
matches any lowercase letter.[a-zA-Z]
matches any uppercase or lowercase letter.[^aeiou]
matches any character that is not a vowel (using negation with ^
).re.findall()
to find all occurrences of the pattern in the text.Sets and character ranges provide a powerful way to specify groups of characters to match in regular expressions, allowing for flexible and precise pattern matching.
in regular expressions allow you to capture and extract specific parts of a matched pattern by enclosing them in parentheses (()
). This is useful for extracting meaningful information from strings that match a certain pattern.
Examples of Extracting Data from Strings:
Let's see some examples of using groups and grouping to extract data from strings:
import re
# Example string containing phone numbers with area codes
text = "John's phone number is (123) 456-7890, and Jane's phone number is (456) 789-0123."
# Extracting phone numbers using groups
pattern = r"\((\d{3})\) (\d{3})-(\d{4})" # Pattern to match phone numbers with area codes
matches = re.findall(pattern, text)
for match in matches:
area_code, first_three_digits, last_four_digits = match
print(f"Area Code: {area_code}, Number: {first_three_digits}-{last_four_digits}")
# Example string containing email addresses
text = "Contact us at info@example.com or support@example.org."
# Extracting email addresses using groups
pattern = r"(\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b)" # Pattern to match email addresses
matches = re.findall(pattern, text)
for match in matches:
print("Email:", match)
In these examples:
()
is used to define groups in the regular expression pattern.\d
to match digits and \w
to match word characters in phone numbers and email addresses, respectively.{}
to specify the number of occurrences for digit sequences.re.findall()
to find all occurrences of the pattern in the text.Output:
Area Code: 123, Number: 456-7890
Area Code: 456, Number: 789-0123
Email: info@example.com
Email: support@example.org
In summary, groups and grouping in regular expressions allow us to capture and extract specific parts of a matched pattern, enabling us to extract meaningful data from strings that match a certain pattern.After exploring the capabilities and syntax of regular expressions in Python, it's time to put your knowledge into practice. Below are some exercises designed to help you apply what you've learned. These exercises involve extracting phone numbers, email addresses, and URLs from a block of text, showcasing the practical applications of regex in data extraction and text processing tasks.
Given a block of text, extract all phone numbers. Assume phone numbers can be in the formats: (123) 456-7890
, 123-456-7890
, 123.456.7890
, or 1234567890
.
import re
text_to_search = """
Call me at (123) 456-7890 tomorrow.
My office number is 123-456-7890.
Reach me at 123.456.7890 for urgent matters.
Emergency contact: 1234567890.
"""
pattern = re.compile(r'\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}')
phone_numbers = pattern.findall(text_to_search)
print("Extracted phone numbers:", phone_numbers)
Extract all email addresses from a given text. Assume email addresses follow the pattern username@domain.com
, where the domain can vary (.com
, .net
, .org
, etc.).
text_to_search = """
For support, contact support@example.com or helpdesk@example.net.
For inquiries, email us at info@example.org.
"""
pattern = re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b')
emails = pattern.findall(text_to_search)
print("Extracted email addresses:", emails)
Extract all URLs from a block of text. Consider both http
and https
protocols and optional www.
prefixes.
text_to_search = """
Visit our website at http://www.example.com.
Our secure site is at https://example.net, and our blog is located at http://blog.example.org.
"""
pattern = re.compile(r'https?://(www\.)?\w+\.\w+')
urls = pattern.findall(text_to_search)
print("Extracted URLs:", urls)
These exercises are designed to help reinforce the concepts you've learned about regex. As you work through them, you'll gain a deeper understanding of how to construct regex patterns to match specific criteria, a skill that is invaluable for text processing and data extraction tasks.
# Python Program to find the area of triangle
a = 5
b = 6
c = 7
# Uncomment below to take inputs from the user
# a = float(input('Enter first side: '))
# b = float(input('Enter second side: '))
# c = float(input('Enter third side: '))
# calculate the semi-perimeter
s = (a + b + c) / 2
# calculate the area
area = (s*(s-a)*(s-b)*(s-c)) ** 0.5
print('The area of the triangle is %0.2f' %area)