[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Allan Visochek titled Practical Data Wrangling. This book covers practical data wrangling techniques in Python and R to turn your noisy data into relevant, insight-ready information.[/box]
In today’s tutorial, we will learn how to manipulate text data using regular expressions in Python.
A regular expression, or regex for short, is simply a sequence of characters that specifies a certain search pattern. Regular expressions have been around for quite a while and are a field of computer science in and of themselves.
In Python, regular expression operations are handled using Python's built in re module. In this section, I will walk through the basics of creating regular expressions and using them to You can implement a regular expression with the following steps:
The first step to creating a regular expression in Python is to import the re module:
import re
Python regular expressions are expressed using pattern strings, which are strings that specify the desired search pattern. In its simplest form, a pattern string can consist only of letters, numbers, and spaces. The following pattern string expresses a search query for an exact sequence of characters. You can think of each character as an individual pattern. In later examples, I will discuss more sophisticated patterns:
import re
pattern_string = "this is the pattern"
The next step is to process the pattern string into an object that Python can use in order to search for the pattern. This is done using the compile() method of the re module. The compile() method takes the pattern string as an argument and returns a regex object:
import re
pattern_string = "this is the pattern" regex = re.compile(pattern_string)
Once you have a regex object, you can use it to search within a search string for the pattern specified in the pattern string. A search string is just the name for the string in which you are looking for a pattern. To search for the pattern, you can use the search() method of the regex object as follows:
import re
pattern_string = "this is the pattern" regex = re.compile(pattern_string)
match = regex.search("this is the pattern")
If the pattern specified in the pattern string is in the search string, the search() method will return a match object. Otherwise, it returns the None data type, which is an empty value.
Since Python interprets True and False values rather loosely, the result of the search function can be used like a Boolean value in an if statement, which can be rather convenient:
....
match = regex.search("this is the pattern") if match:
print("this was a match!")
The search string this is the pattern should produce a match, because it matches exactly the pattern specified in the pattern string. The search function will produce a match if the pattern is found at any point in the search string as the following demonstrates:
....
match = regex.search("this is the pattern") if match:
print("this was a match!")
if regex.search("*** this is the pattern ***"): print("this was not a match!")
if not regex.search("this is not the pattern"): print("this was not a match!")
Regular expressions depend on the use of certain special characters in order to express patterns. Due to this, the following characters should not be used directly unless they are used for their intended purpose:
. ^ $ * + ? {} () [] |
If you do need to use any of the previously mentioned characters in a pattern string to search for that character, you can write the character preceded by a backslash character. This is called escaping characters. Here's an example:
pattern string = "c*b"
## matches "c*b"
If you need to search for the backslash character itself, you use two backslash characters, as follows:
pattern string = "cb"
## matches "cb"
Using s at any point in the pattern string matches a whitespace character. This is more general then the space character, as it applies to tabs and newline characters:
....
a_space_b = re.compile("asb") if a_space_b.search("a b"):
print("'a b' is a match!")
if a_space_b.search("1234 a b 1234"): print("'1234 a b 1234' is a match")
if a_space_b.search("ab"):
print("'1234 a b 1234' is a match")
If the ^ character is used at the beginning of the pattern string, the regular expression will only produce a match if the pattern is found at the beginning of the search string:
....
a_at_start = re.compile("^a") if a_at_start.search("a"):
print("'a' is a match")
if a_at_start.search("a 1234"): print("'a 1234' is a match")
if a_at_start.search("1234 a"): print("'1234 a' is a match")
Similarly, if the $ symbol is used at the end of the pattern string, the regular expression will only produce a match if the pattern appears at the end of the search string:
....
a_at_end = re.compile("a$") if a_at_end.search("a"):
print("'a' is a match") if a_at_end.search("a 1234"):
print("'a 1234' is a match") if a_at_end.search("1234 a"):
print("'1234 a' is a match")
It is possible to match a range of characters instead of just one. This can add some flexibility to the pattern:
[A-Z] matches all capital letters
[a-z] matches all lowercase letters
[0-9] matches all digits
....
lower_case_letter = re.compile("[a-z]") if lower_case_letter.search("a"):
print("'a' is a match")
if lower_case_letter.search("B"): print("'B' is a match")
if lower_case_letter.search("123 A B 2"): print("'123 A B 2' is a match")
digit = re.compile("[0-9]") if digit.search("1"):
print("'a' is a match") if digit.search("342"):
print("'a' is a match") if digit.search("asdf abcd"):
print("'a' is a match")
If there is a fixed number of patterns that would constitute a match, they can be combined using the following syntax:
(<pattern1>|<pattern2>|<pattern3>)
The following a_or_b regular expression will match any string where there is either an a character or a b character:
....
a_or_b = re.compile("(a|b)") if a_or_b.search("a"):
print("'a' is a match") if a_or_b.search("b"):
print("'b' is a match") if a_or_b.search("c"):
print("'c' is a match")
If the + character comes after another character or pattern, the regular expression will match an arbitrarily long sequence of that pattern. This is quite useful, because it makes it easy to express something like a word or number that can be of arbitrary length.
More sophisticated patterns can be produced by combining pattern strings one after the other. In the following example, I've created a regular expression that searches for a number strictly followed by a word. The pattern string that generates the regular expression is composed of the following:
A pattern string that matches a sequence of digits: [0-9]+ A pattern string that matches a whitespace character: s A pattern string that matches a sequence of letters: [a-z]+
A pattern string that matches either the end of the string or a whitespace character: (s|$)
....
number_then_word = re.compile("[0-9]+s[a-z]+(s|$)")
Regex objects in Python also have a split() method. The split method splits the search string into an array of substrings. The splits occur at each location along the string where the pattern is identified. The result is an array of strings that occur between instances of the pattern. If the pattern occurs at the beginning or end of the search string, an empty string is included at the beginning or end of the resulting array, respectively:
....
print(a_or_b.split("123a456b789")) print(a_or_b.split("a1b"))
If you are interested, the Python documentation has a more complete coverage of regular expressions. It can be found at https://docs.python.org/3.6/library/re.html.
We saw various ways of using regular expressions in Python. To know more about data wrangling techniques using simple and real-world data-sets you may check out this book Practical Data Wrangling.