Manipulating text data using Python Regular Expressions (regex)

[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Allan Visochek titled Practical Data Wrangling. This book covers practical data wrangling techniques in Python and R to turn your noisy data into relevant, insight-ready information.[/box]

In today’s tutorial, we will learn how to manipulate text data using regular expressions in Python.

What is a Regular expression

A regular expression, or regex for short, is simply a sequence of characters that specifies a certain search pattern. Regular expressions have been around for quite a while and are a field of computer science in and of themselves.

In Python, regular expression operations are handled using Python's built in re module. In this section, I will walk through the basics of creating regular expressions and using them to You can implement a regular expression with the following steps:

Specify a pattern string.
Compile the pattern string to a regular expression object.
Use the regular expression object to search a string for the pattern.
Optional: Extract the matched pattern from the string.

Writing and using a regular expression

The first step to creating a regular expression in Python is to import the re module:

import re

Python regular expressions are expressed using pattern strings, which are strings that specify the desired search pattern. In its simplest form, a pattern string can consist only of letters, numbers, and spaces. The following pattern string expresses a search query for an exact sequence of characters. You can think of each character as an individual pattern. In later examples, I will discuss more sophisticated patterns:

import re

pattern_string = "this is the pattern"

The next step is to process the pattern string into an object that Python can use in order to search for the pattern. This is done using the compile() method of the re module. The compile() method takes the pattern string as an argument and returns a regex object:

import re

pattern_string = "this is the pattern" regex = re.compile(pattern_string)

Once you have a regex object, you can use it to search within a search string for the pattern specified in the pattern string. A search string is just the name for the string in which you are looking for a pattern. To search for the pattern, you can use the search() method of the regex object as follows:

import re

pattern_string = "this is the pattern" regex = re.compile(pattern_string)

match = regex.search("this is the pattern")

If the pattern specified in the pattern string is in the search string, the search() method will return a match object. Otherwise, it returns the None data type, which is an empty value.

Since Python interprets True and False values rather loosely, the result of the search function can be used like a Boolean value in an if statement, which can be rather convenient:

....

match = regex.search("this is the pattern") if match:

print("this was a match!")

The search string this is the pattern should produce a match, because it matches exactly the pattern specified in the pattern string. The search function will produce a match if the pattern is found at any point in the search string as the following demonstrates:

....

match = regex.search("this is the pattern") if match:

print("this was a match!")

if regex.search("*** this is the pattern ***"): print("this was not a match!")

if not regex.search("this is not the pattern"): print("this was not a match!")

Special Characters

Regular expressions depend on the use of certain special characters in order to express patterns. Due to this, the following characters should not be used directly unless they are used for their intended purpose:

. ^ $ * + ? {} () [] |

If you do need to use any of the previously mentioned characters in a pattern string to search for that character, you can write the character preceded by a backslash character. This is called escaping characters. Here's an example:

pattern string = "c*b"

## matches "c*b"

If you need to search for the backslash character itself, you use two backslash characters, as follows:

pattern string = "cb"

## matches "cb"

Matching whitespace

Using s at any point in the pattern string matches a whitespace character. This is more general then the space character, as it applies to tabs and newline characters:

....

a_space_b = re.compile("asb") if a_space_b.search("a b"):

print("'a b' is a match!")

if a_space_b.search("1234 a b 1234"): print("'1234 a b 1234' is a match")

if a_space_b.search("ab"):

print("'1234 a b 1234' is a match")

Matching the start of string

If the ^ character is used at the beginning of the pattern string, the regular expression will only produce a match if the pattern is found at the beginning of the search string:

....

a_at_start = re.compile("^a") if a_at_start.search("a"):

print("'a' is a match")

if a_at_start.search("a 1234"): print("'a 1234' is a match")

if a_at_start.search("1234 a"): print("'1234 a' is a match")

Matching the end of a string

Similarly, if the $ symbol is used at the end of the pattern string, the regular expression will only produce a match if the pattern appears at the end of the search string:

....

a_at_end = re.compile("a$") if a_at_end.search("a"):

print("'a' is a match") if a_at_end.search("a 1234"):

print("'a 1234' is a match") if a_at_end.search("1234 a"):

print("'1234 a' is a match")

Matching a range of characters

It is possible to match a range of characters instead of just one. This can add some flexibility to the pattern:

[A-Z] matches all capital letters

[a-z] matches all lowercase letters

[0-9] matches all digits

....

lower_case_letter = re.compile("[a-z]") if lower_case_letter.search("a"):

print("'a' is a match")

if lower_case_letter.search("B"): print("'B' is a match")

if lower_case_letter.search("123 A B 2"): print("'123 A B 2' is a match")

digit = re.compile("[0-9]") if digit.search("1"):

print("'a' is a match") if digit.search("342"):

print("'a' is a match") if digit.search("asdf abcd"):

print("'a' is a match")

Matching any one of several patterns

If there is a fixed number of patterns that would constitute a match, they can be combined using the following syntax:

(<pattern1>|<pattern2>|<pattern3>)

The following a_or_b regular expression will match any string where there is either an a character or a b character:

....

a_or_b = re.compile("(a|b)") if a_or_b.search("a"):

print("'a' is a match") if a_or_b.search("b"):

print("'b' is a match") if a_or_b.search("c"):

print("'c' is a match")

Matching a sequence instead of just one character

If the + character comes after another character or pattern, the regular expression will match an arbitrarily long sequence of that pattern. This is quite useful, because it makes it easy to express something like a word or number that can be of arbitrary length.

Putting patterns together

More sophisticated patterns can be produced by combining pattern strings one after the other. In the following example, I've created a regular expression that searches for a number strictly followed by a word. The pattern string that generates the regular expression is composed of the following:

A pattern string that matches a sequence of digits: [0-9]+ A pattern string that matches a whitespace character: s A pattern string that matches a sequence of letters: [a-z]+

A pattern string that matches either the end of the string or a whitespace character: (s|$)

....

number_then_word = re.compile("[0-9]+s[a-z]+(s|$)")

The regex split() function

Regex objects in Python also have a split() method. The split method splits the search string into an array of substrings. The splits occur at each location along the string where the pattern is identified. The result is an array of strings that occur between instances of the pattern. If the pattern occurs at the beginning or end of the search string, an empty string is included at the beginning or end of the resulting array, respectively:

....

print(a_or_b.split("123a456b789")) print(a_or_b.split("a1b"))

If you are interested, the Python documentation has a more complete coverage of regular expressions. It can be found at https://docs.python.org/3.6/library/re.html.

We saw various ways of using regular expressions in Python. To know more about data wrangling techniques using simple and real-world data-sets you may check out this book Practical Data Wrangling.