In this recipe, we'll see more about how to deal with regular expressions. After introducing the basics, we will dig a little deeper into pattern elements, introduce groups as a better way to retrieve and parse strings, see how to search for multiple occurrences of the same string, and deal with longer texts.
Going deeper into regular expressions
How to do it...
- Import re:
>>> import re
- Match a phone pattern as part of a group (in brackets). Note the use of \d as a special character for any digit:
>>> match = re.search(r'the phone number is ([\d-]+)', '37: the phone number is 1234-567-890')
>>> match.group()
'the phone number is 1234-567-890'
>>> match.group(1)
'1234-567-890'
- Compile a pattern and capture a case insensitive pattern with a yes|no option:
>>> pattern = re.compile(r'The answer to question (\w+) is (yes|no)', re.IGNORECASE)
>>> pattern.search('Naturaly, the answer to question 3b is YES')
<_sre.SRE_Match object; span=(10, 42), match='the answer to question 3b is YES'>
>>> _.groups()
('3b', 'YES')
- Match all the occurrences of cities and state abbreviations in the text. Note that they are separated by a single character and the name of the city always starts with an uppercase letter. Only four states are matched for simplicity:
>>> PATTERN = re.compile(r'([A-Z][\w\s]+).(TX|OR|OH|MI)')
>>> TEXT ='the jackalopes are the team of Odessa,TX while the knights are native of Corvallis OR and the mud hens come from Toledo.OH; the whitecaps have their base in Grand Rapids,MI'
>>> list(PATTERN.finditer(TEXT))
[<_sre.SRE_Match object; span=(31, 40), match='Odessa,TX'>, <_sre.SRE_Match object; span=(73, 85), match='Corvallis OR'>, <_sre.SRE_Match object; span=(113, 122), match='Toledo.OH'>, <_sre.SRE_Match object; span=(157, 172), match='Grand Rapids,MI'>]
>>> _[0].groups()
('Odessa', 'TX')
How it works...
The new special characters that were introduced are as follows. Note that the same letter in uppercase or lowercase means the opposite match, for example \d matches a digit, while \D matches a non digit.:
- \d: Marks any digit (0 to 9).
- \s: Marks any character that's a whitespace, including tabs and other whitespace special characters. Note that this is the reverse of \S, introduced in the previous recipe.
- \w: Marks any letter (includes digits, but excludes characters such as periods).
- .: Marks any character.
To define groups, put the defined groups in brackets. Groups can be retrieved individually, making them perfect for matching a bigger pattern that contains a variable part that we'll treat later, as demonstrated in step 2. Note the difference with the step 6 pattern in the previous recipe. In this case, the pattern is not only the number, but includes the prefix, even if we then extract the number. Check out this difference, where there's a number that's not the number we want to capture:
>>> re.search(r'the phone number is ([\d-]+)', '37: the phone number is 1234-567-890')
<_sre.SRE_Match object; span=(4, 36), match='the phone number is 1234-567-890'>
>>> _.group(1)
'1234-567-890'
>>> re.search(r'[0123456789-]+', '37: the phone number is 1234-567-890')
<_sre.SRE_Match object; span=(0, 2), match='37'>
>>> _.group()
'37'
Remember that group 0 (.group() or .group(0)) is always the whole match. The rest of the groups are ordered as they appear.
Patterns can be compiled as well. This saves some time if the pattern needs to be matched over and over. To use it that way, compile the pattern and then use that object to perform searches, as shown in steps 3 and 4. Some extra flags can be added, such as making the pattern case insensitive.
Step 4's pattern requires a little bit of information. It's composed of two groups, separated by a single character. The special character . means it matches everything, in our example a period, a whitespace, and a comma. The second group is a straightforward selection of defined options, in this case US state abbreviations.
The first group starts with an uppercase letter ([A-Z]), and accepts any combination of letters or spaces ([\w\s]+), but not punctuation marks such as periods or commas. This matches the cities, including when composed of more than one word.
Note that this pattern starts on any uppercase letter and keeps matching until finding a state, unless separated by a punctuation mark, which may not be what's expected, for example:
>>> re.search(r'([A-Z][\w\s]+).(TX|OR|OH|MI)', 'This is a test, Escanaba MI')
<_sre.SRE_Match object; span=(16, 27), match='Escanaba MI'>
>>> re.search(r'([A-Z][\w\s]+).(TX|OR|OH|MI)', 'This is a test with Escanaba MI')
<_sre.SRE_Match object; span=(0, 31), match='This is a test with Escanaba MI'>
Step 4 also shows how to find more than one occurrence in a long text. While the .findall() method exists, it doesn't return the full match object, while .findalliter() does. Commonplace now in Python 3, .findalliter() returns an iterator that can be used in a for loop or list comprehension. Note that .search() returns only the first occurrence of the pattern, even if more matches appear:
>>> PATTERN.search(TEXT)
<_sre.SRE_Match object; span=(31, 40), match='Odessa,TX'>
>>> PATTERN.findall(TEXT)
[('Odessa', 'TX'), ('Corvallis', 'OR'), ('Toledo', 'OH')]
There's more...
The special characters can be reversed if they are case swapped. For example, the reverse of the ones we used are as follows:
- \D: Marks any non-digit
- \W: Marks any non-letter
- \B: Marks any character that's not at the start or end of a word
Groups can be assigned names as well. This makes them more explicit at the expense of making the group more verbose in the following shape—(?P<groupname>PATTERN). Groups can be referred to by name with .group(groupname) or by calling .groupdict() while maintaining its numeric position.
For example, the step 4 pattern can be described as follows:
>>> PATTERN = re.compile(r'(?P<city>[A-Z][\w\s]+?).(?P<state>TX|OR|OH|MN)')
>>> match = PATTERN.search(TEXT)
>>> match.groupdict()
{'city': 'Odessa', 'state': 'TX'}
>>> match.group('city')
'Odessa'
>>> match.group('state')
'TX'
>>> match.group(1), match.group(2)
('Odessa', 'TX')
Regular expressions are a very extensive topic. There are whole technical books devoted to them and they can be notoriously deep. The Python documentation is good to be used as reference (https://docs.python.org/3/library/re.html) and to learn more.
If you feel a little intimidated at the start, it's a perfectly natural feeling. Analyze each of the patterns with care, dividing it into different parts, and they will start to make sense. Don't be afraid to run a regex interactive analyzer!
Regexes can be really powerful and generic, but they may not be the proper tool for what you are trying to achieve. We've seen some caveats and patterns that have subtleties. As a rule of thumb, if a pattern starts to feel complicated, it's time to search for a different tool. Remember the previous recipes as well and the options they presented, such as parse.
See also
- The Introducing regular expressions recipe
- The Using a third-party tool—parse recipe