Going deeper into regular expressions
In this recipe, we'll learn more about how to deal with regular expressions. After introducing the basics, we will dig a little deeper into pattern elements, introduce groups as a better way to retrieve and parse strings, learn how to search for multiple occurrences of the same string, and deal with longer texts.
How to do it…
- Import
re
:>>> import re
- Match a phone pattern as part of a group (in brackets). Note the use of
\d
as a special character for any digit:>>> match = re.search(r'the phone number is ([\d-]+)', '37: the phone number is 1234-567-890') >>> match.group() 'the phone number is 1234-567-890' >>> match.group(1) '1234-567-890'
- Compile a pattern and capture a case-insensitive pattern with a
yes
|no
option:>>> pattern = re.compile(r'The answer to question (\w+) is (yes|no)', re.IGNORECASE) >>> pattern.search('Naturally, the answer to question 3b is YES') <_sre.SRE_Match object; span=(10, 42), match='the answer to question 3b is YES' > >>> pattern.search('Naturally, the answer to question 3b is YES').groups() ('3b', 'YES')
- Match all the occurrences of cities and state abbreviations in the text. Note that they are separated by a single character, and the name of the city always starts with an uppercase letter. Only four states are matched for simplicity:
>>> PATTERN = re.compile(r'([A-Z][\w\s]+?).(TX|OR|OH|MI)') >>> TEXT ='the jackalopes are the team of Odessa,TX while the knights are native of Corvallis OR and the mud hens come from Toledo.OH; the whitecaps have their base in Grand Rapids,MI' >>> list(PATTERN.finditer(TEXT)) [<_sre.SRE_Match object; span=(31, 40), match='Odessa,TX'>, <_sre.SRE_Match object; span=(73, 85), match='Corvallis OR'>, <_sre.SRE_Match object; span=(113, 122), match='Toledo.OH'>, <_sre.SRE_Match object; span=(157, 172), match='Grand Rapids,MI'>] >>> _[0].groups() ('Odessa', 'TX')
How it works…
The new special characters that were introduced are as follows:
\d
: Marks any digit (0
to9
).\s
: Marks any character that's a whitespace, including tabs and other whitespace special characters. Note that this is the reverse of\S
, which was introduced in the previous recipe.\w
: Marks any letter (this includes digits, but excludes characters such as periods)..:
(dot
): Marks any character.
Note that the same letter in uppercase or lowercase means the opposite match, for example, \d
matches a digit, while \D
matches a non-digit.
To define groups, put the defined groups in parentheses. Groups can be retrieved individually. This makes them perfect for matching a bigger pattern that contains a variable part to be processed in the next step, as demonstrated in step 2. Note the difference with the step 6 pattern in the previous recipe. In this case, the pattern is not only the number, but it includes the prefix text, even if we then extract only the number:
>>> re.search(r'the phone number is ([\d-]+)', '37: the phone number is 1234-567-890')
<_sre.SRE_Match object; span=(4, 36), match='the phone number is 1234-567-890'>
>>> _.group(1)
'1234-567-890'
>>> re.search(r'[0123456789-]+', '37: the phone number is 1234-567-890')
<_sre.SRE_Match object; span=(0, 2), match='37'>
>>> _.group()
'37'
Remember that group 0 (.group()
or .group(0)
) is always the whole match. The rest of the groups are ordered as they appear.
Patterns can be compiled as well. This saves some time if the pattern needs to be matched over and over. To use it that way, compile the pattern and then use that object to perform searches, as shown in steps 3 and 4. Some extra flags can be added, such as making the pattern case insensitive.
Step 4's pattern requires a little bit of information. It's composed of two groups, separated by a single character. The special character ".
" (dot) means it matches everything. In our example, it matches a period, a whitespace, and a comma. The second group is a straightforward selection of defined options, in this case, US state abbreviations.
The first group starts with an uppercase letter ([A-Z]
) and accepts any combination of letters or spaces ([\w\s]+?
), but not punctuation marks such as periods or commas. This matches the cities, including those that are composed of more than one word.
The final +?
makes the match of letters non-greedy, matching as few characters as possible. This avoids some problems such as when there are no punctuation symbols between the cities. Take a look at the result where we don't include the non-greedy qualifier for the second match and how it includes two elements:
>>> PATTERN = re.compile(r'([A-Z][\w\s]+).(TX|OR|OH|MI)')
>>> TEXT ='the jackalopes are the team of Odessa,TX while the knights are native of Corvallis OR and the mud hens come from Toledo.OH; the whitecaps have their base in Grand Rapids,MI'
>>> list(PATTERN.finditer(TEXT))[1]
<re.Match object; span=(73, 122), match='Corvallis OR and the mud hens come from Toledo.OH>
Note that this pattern starts on any uppercase letter and keeps matching until it finds a state, unless separated by a punctuation mark, which may not be what's expected, for example:
>>> re.search(r'([A-Z][\w\s]+?).(TX|OR|OH|MI)', 'This is a test, Escanaba MI')
<_sre.SRE_Match object; span=(16, 27), match='Escanaba MI'>
>>> re.search(r'([A-Z][\w\s]+?).(TX|OR|OH|MI)', 'This is a test with Escanaba MI')
<_sre.SRE_Match object; span=(0, 31), match='This is a test with Escanaba MI'>
Step 4 also shows you how to find more than one occurrence in a long text. While the .findall()
method exists, it doesn't return the full match
object, while .findalliter()
does. As is common now in Python 3, .findalliter()
returns an iterator that can be used in a for
loop or list comprehension. Note that .search()
returns only the first occurrence of the pattern, even if more matches appear:
>>> PATTERN.search(TEXT)
<_sre.SRE_Match object; span=(31, 40), match='Odessa,TX'>
>>> PATTERN.findall(TEXT)
[('Odessa', 'TX'), ('Corvallis', 'OR'), ('Toledo', 'OH')]
There's more…
The special characters can be reversed if they are case swapped. For example, the reverse of the ones we used are as follows:
- \D: Marks any non-digit.
- \W: Marks any non-letter.
- \B: Marks a position that's not at the start or end of a word. For example,
r'thing\B'
will match things but not thing.
The most commonly used special characters are typically \d
(digits) and \w
(letters and digits), as they mark common patterns to search for.
Groups can be assigned names as well. This makes them more explicit at the expense of making the group more verbose in the following shape— (?P<groupname>PATTERN)
. Groups can be referred to by name with .group(groupname)
or by calling .groupdict()
while maintaining its numeric position.
For example, the step 4 pattern can be described as follows:
>>> PATTERN = re.compile(r'(?P<city>[A-Z][\w\s]+?).(?P<state>TX|OR|OH|MN)')
>>> match = PATTERN.search(TEXT)
>>> match.groupdict() {'city': 'Odessa', 'state': 'TX'}
>>> match.group('city') 'Odessa'
>>> match.group('state') 'TX'
>>> match.group(1), match.group(2) ('Odessa', 'TX')
Regular expressions are a very extensive topic. There are whole technical books devoted to them and they can be notoriously deep. The Python documentation is a good reference to use (https://docs.python.org/3/library/re.html) and to learn more.
If you feel a little intimidated at the start, it's a perfectly natural feeling. Analyze each of the patterns with care, dividing them into smaller parts, and they will start to make sense. Don't be afraid to run a regex interactive analyzer!
Regexes can be really powerful and generic, but they may not be the proper tool for what you are trying to achieve. We've seen some caveats and patterns that have subtleties. As a rule of thumb, if a pattern starts to feel complicated, it's time to search for a different tool. Remember the previous recipes as well and the options they presented, such as parse
.
See also
- The Introducing regular expressions recipe, covered earlier in the chapter, to learn the basics of using regular expressions.
- The Using a third-party tool—parse recipe, covered earlier in the chapter, to learn a different technique to extract information from text.