Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Python Automation Cookbook

You're reading from   Python Automation Cookbook Explore the world of automation using Python recipes that will enhance your skills

Arrow left icon
Product type Paperback
Published in Sep 2018
Publisher Packt
ISBN-13 9781789133806
Length 398 pages
Edition 1st Edition
Languages
Tools
Concepts
Arrow right icon
Author (1):
Arrow left icon
Jaime Buelta Jaime Buelta
Author Profile Icon Jaime Buelta
Jaime Buelta
Arrow right icon
View More author details
Toc

Table of Contents (12) Chapters Close

Preface 1. Let Us Begin Our Automation Journey FREE CHAPTER 2. Automating Tasks Made Easy 3. Building Your First Web Scraping Application 4. Searching and Reading Local Files 5. Generating Fantastic Reports 6. Fun with Spreadsheets 7. Developing Stunning Graphs 8. Dealing with Communication Channels 9. Why Not Automate Your Marketing Campaign? 10. Debugging Techniques 11. Other Books You May Enjoy

Going deeper into regular expressions

In this recipe, we'll see more about how to deal with regular expressions. After introducing the basics, we will dig a little deeper into pattern elements, introduce groups as a better way to retrieve and parse strings, see how to search for multiple occurrences of the same string, and deal with longer texts.

How to do it...

  1. Import re:
>>> import re
  1. Match a phone pattern as part of a group (in brackets). Note the use of \d as a special character for any digit:
>>> match = re.search(r'the phone number is ([\d-]+)', '37: the phone number is 1234-567-890')
>>> match.group()
'the phone number is 1234-567-890'
>>> match.group(1)
'1234-567-890'
  1. Compile a pattern and capture a case insensitive pattern with a yes|no option:
>>> pattern = re.compile(r'The answer to question (\w+) is (yes|no)', re.IGNORECASE)
>>> pattern.search('Naturaly, the answer to question 3b is YES')
<_sre.SRE_Match object; span=(10, 42), match='the answer to question 3b is YES'>
>>> _.groups()
('3b', 'YES')
  1. Match all the occurrences of cities and state abbreviations in the text. Note that they are separated by a single character and the name of the city always starts with an uppercase letter. Only four states are matched for simplicity:
>>> PATTERN = re.compile(r'([A-Z][\w\s]+).(TX|OR|OH|MI)')
>>> TEXT ='the jackalopes are the team of Odessa,TX while the knights are native of Corvallis OR and the mud hens come from Toledo.OH; the whitecaps have their base in Grand Rapids,MI'
>>> list(PATTERN.finditer(TEXT))
[<_sre.SRE_Match object; span=(31, 40), match='Odessa,TX'>, <_sre.SRE_Match object; span=(73, 85), match='Corvallis OR'>, <_sre.SRE_Match object; span=(113, 122), match='Toledo.OH'>, <_sre.SRE_Match object; span=(157, 172), match='Grand Rapids,MI'>]
>>> _[0].groups()
('Odessa', 'TX')

How it works...

The new special characters that were introduced are as follows. Note that the same letter in uppercase or lowercase means the opposite match, for example \d matches a digit, while \D matches a non digit.:

  • \d: Marks any digit (0 to 9).
  • \s: Marks any character that's a whitespace, including tabs and other whitespace special characters. Note that this is the reverse of \S, introduced in the previous recipe.
  • \w: Marks any letter (includes digits, but excludes characters such as periods).
  • .: Marks any character.

To define groups, put the defined groups in brackets. Groups can be retrieved individually, making them perfect for matching a bigger pattern that contains a variable part that we'll treat later, as demonstrated in step 2. Note the difference with the step 6 pattern in the previous recipe. In this case, the pattern is not only the number, but includes the prefix, even if we then extract the number. Check out this difference, where there's a number that's not the number we want to capture:

>>> re.search(r'the phone number is ([\d-]+)', '37: the phone number is 1234-567-890')
<_sre.SRE_Match object; span=(4, 36), match='the phone number is 1234-567-890'>
>>> _.group(1)
'1234-567-890'
>>> re.search(r'[0123456789-]+', '37: the phone number is 1234-567-890')
<_sre.SRE_Match object; span=(0, 2), match='37'>
>>> _.group()
'37'

Remember that group 0 (.group() or .group(0)) is always the whole match. The rest of the groups are ordered as they appear.

Patterns can be compiled as well. This saves some time if the pattern needs to be matched over and over. To use it that way, compile the pattern and then use that object to perform searches, as shown in steps 3 and 4. Some extra flags can be added, such as making the pattern case insensitive.

Step 4's pattern requires a little bit of information. It's composed of two groups, separated by a single character. The special character . means it matches everything, in our example a period, a whitespace, and a comma. The second group is a straightforward selection of defined options, in this case US state abbreviations.

The first group starts with an uppercase letter ([A-Z]), and accepts any combination of letters or spaces ([\w\s]+), but not punctuation marks such as periods or commas. This matches the cities, including when composed of more than one word.

Note that this pattern starts on any uppercase letter and keeps matching until finding a state, unless separated by a punctuation mark, which may not be what's expected, for example:

>>> re.search(r'([A-Z][\w\s]+).(TX|OR|OH|MI)', 'This is a test, Escanaba MI')
<_sre.SRE_Match object; span=(16, 27), match='Escanaba MI'>
>>> re.search(r'([A-Z][\w\s]+).(TX|OR|OH|MI)', 'This is a test with Escanaba MI')
<_sre.SRE_Match object; span=(0, 31), match='This is a test with Escanaba MI'>

Step 4 also shows how to find more than one occurrence in a long text. While the .findall() method exists, it doesn't return the full match object, while .findalliter() does. Commonplace now in Python 3, .findalliter() returns an iterator that can be used in a for loop or list comprehension. Note that .search() returns only the first occurrence of the pattern, even if more matches appear:

>>> PATTERN.search(TEXT)
<_sre.SRE_Match object; span=(31, 40), match='Odessa,TX'>
>>> PATTERN.findall(TEXT)
[('Odessa', 'TX'), ('Corvallis', 'OR'), ('Toledo', 'OH')]

There's more...

The special characters can be reversed if they are case swapped. For example, the reverse of the ones we used are as follows:

  • \D: Marks any non-digit
  • \W: Marks any non-letter
  • \B: Marks any character that's not at the start or end of a word
The most commonly used special characters are typically \d (digits) and \w (letters and digits), as they mark common patterns to search for, and the plus sign for one or more.

Groups can be assigned names as well. This makes them more explicit at the expense of making the group more verbose in the following shape—(?P<groupname>PATTERN). Groups can be referred to by name with .group(groupname) or by calling .groupdict() while maintaining its numeric position.

For example, the step 4 pattern can be described as follows:

>>> PATTERN = re.compile(r'(?P<city>[A-Z][\w\s]+?).(?P<state>TX|OR|OH|MN)')
>>> match = PATTERN.search(TEXT)
>>> match.groupdict()
{'city': 'Odessa', 'state': 'TX'}
>>> match.group('city')
'Odessa'
>>> match.group('state')
'TX'
>>> match.group(1), match.group(2)
('Odessa', 'TX')

Regular expressions are a very extensive topic. There are whole technical books devoted to them and they can be notoriously deep. The Python documentation is good to be used as reference (https://docs.python.org/3/library/re.html) and to learn more.

If you feel a little intimidated at the start, it's a perfectly natural feeling. Analyze each of the patterns with care, dividing it into different parts, and they will start to make sense. Don't be afraid to run a regex interactive analyzer!

Regexes can be really powerful and generic, but they may not be the proper tool for what you are trying to achieve. We've seen some caveats and patterns that have subtleties. As a rule of thumb, if a pattern starts to feel complicated, it's time to search for a different tool. Remember the previous recipes as well and the options they presented, such as parse.

See also

  • The Introducing regular expressions recipe
  • The Using a third-party tool—parse recipe
You have been reading a chapter from
Python Automation Cookbook
Published in: Sep 2018
Publisher: Packt
ISBN-13: 9781789133806
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image