Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Python Automation Cookbook

You're reading from   Python Automation Cookbook Explore the world of automation using Python recipes that will enhance your skills

Arrow left icon
Product type Paperback
Published in Sep 2018
Publisher Packt
ISBN-13 9781789133806
Length 398 pages
Edition 1st Edition
Languages
Tools
Concepts
Arrow right icon
Author (1):
Arrow left icon
Jaime Buelta Jaime Buelta
Author Profile Icon Jaime Buelta
Jaime Buelta
Arrow right icon
View More author details
Toc

Table of Contents (12) Chapters Close

Preface 1. Let Us Begin Our Automation Journey FREE CHAPTER 2. Automating Tasks Made Easy 3. Building Your First Web Scraping Application 4. Searching and Reading Local Files 5. Generating Fantastic Reports 6. Fun with Spreadsheets 7. Developing Stunning Graphs 8. Dealing with Communication Channels 9. Why Not Automate Your Marketing Campaign? 10. Debugging Techniques 11. Other Books You May Enjoy

Introducing regular expressions

A regular expression, or regex, is a pattern to match text. In other words, it allows us to define an abstract string (typically the definition of a structured kind of text) to check with other strings to see if they match or not.

It is better to describe them with an example. Think of defining a pattern of text as a word that starts with an uppercase A and contains only lowercase Ns and As after that. The word Anna matches it, but Bob, Alice, and James does not. The words Aaan, Ana, Annnn, and Aaaan will also be matches, but ANNA won't.

If this sounds complicated, that's because it is. Regexes can be notoriously complicated because they may be incredibly intricate and difficult to follow. But they are very useful, because they allow us to perform incredibly powerful pattern matching.

Some common uses of regexes are as follow:

  • Validating input data: For example, that a phone number is only numbers, dashes, and brackets.
  • String parsing: Retrieve data from structured strings, such as logs or URLs. This is similar to what's described in the previous recipe.
  • Scrapping: Find the occurrences of something in a long text. For example, find all emails in a web page.
  • Replacement: Find and replace a word or words with others. For example, replace the owner with John Smith.
"Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems."
– Jamie Zawinski

Regular expressions are at their best when they are kept very simple. In general, if there is a specific tool to do it, prefer it over regexes. A very clear example of this is HTML parsing; check Chapter 3, Building Your First Web Scraping Application, for better tools to achieve this.

Some text editors allow us to search using regexes as well. While most are editors aimed at writing code, such as Vim, BBEdit, or Notepad++, they're also present in more general tools, such as MS Office, Open Office, or Google Documents. But be careful, as the particular syntax may be slightly different.

Getting ready

The python module to deal with regexes is called re. The main function we'll cover is re.search(), which returns a match object with information about what matched the pattern.

As regex patterns are also defined as strings, we'll differentiate them by prefixing them with an r, such as r'pattern'. This is the Python way of labeling a text as raw string literals, meaning that the string within is taken literally, without any escaping. This means that a \ is used as a backslash instead of a sequence. For example, without the r prefix, \n means newline character.

Some characters are special, and refer to concepts such as the end of the string, any digit, any character, any whitespace character, and so on.

The simplest form is just a literal string. For example, the regex pattern r'LOG' matches the string 'LOGS', but not the string 'NOT A MATCH'. If there's not a match, search returns None:

>>> import re
>>> re.search(r'LOG', 'LOGS')
<_sre.SRE_Match object; span=(0, 3), match='LOG'>
>>> re.search(r'LOG', 'NOT A MATCH')
>>>

How to do it...

  1. Import the re module:
>>> import re
  1. Then, match a pattern that is not at the start of the string:
>>> re.search(r'LOG', 'SOME LOGS')
<_sre.SRE_Match object; span=(5, 8), match='LOG'>
  1. Match a pattern that is only at the start of the string. Note the ^ character:
>>> re.search(r'^LOG', 'LOGS')
<_sre.SRE_Match object; span=(0, 3), match='LOG'>
>>> re.search(r'^LOG', 'SOME LOGS')
>>>
  1. Match a pattern only at the end of the string. Note the $ character:
>>> re.search(r'LOG$', 'SOME LOG')
<_sre.SRE_Match object; span=(5, 8), match='LOG'>
>>> re.search(r'LOG$', 'SOME LOGS')
>>>
  1. Match the word 'thing' (not excluding things), but not something or anything. Note the \b at the start of the second pattern:
>>> STRING = 'something in the things she shows me'
>>> match = re.search(r'thing', STRING)
>>> STRING[:match.start()], STRING[match.start():match.end()], STRING[match.end():]
('some', 'thing', ' in the things she shows me')
>>> match = re.search(r'\bthing', STRING)
>>> STRING[:match.start()], STRING[match.start():match.end()], STRING[match.end():]
('something in the ', 'thing', 's she shows me')

  1. Match a pattern that's only numbers and dashes (for example, a phone number). Retrieve the matched string:
>>> re.search(r'[0123456789-]+', 'the phone number is 1234-567-890')
<_sre.SRE_Match object; span=(20, 32), match='1234-567-890'>
>>> re.search(r'[0123456789-]+', 'the phone number is 1234-567-890').group()
'1234-567-890'
  1. Match an email address naively:
>>> re.search(r'\S+@\S+', 'my email is email.123@test.com').group()
'email.123@test.com'

How it works...

The re.search function matches a pattern, no matter its position in the string. As explained previously, this will return None if the pattern is not found, or a match object.

The following special characters are used:

  • ^: Marks the start of the string
  • $: Marks the end of the string
  • \b: Marks the start or end of a word
  • \S: Marks any character that's not a whitespace, including special characters

More special characters are shown in the next recipe.

In step 6 in the How to do it... section, the r'[0123456789-]+' pattern is composed of two parts. The first one is between square brackets, and matches any single character between 0 and 9 (any number) and the dash (-) character. The + sign after that means that this character can be present one or more times. This is called a quantifier in regexes. This makes a match on any combination of numbers and dashes, no matter how long it is.

Step 7 again uses the + sign to match as many characters as necessary before the @ and again after it. In this case, the character match is \S, which matches any non-whitespace character.

Please note that the naive pattern for emails described here is very naive, as it will match invalid emails such as john@smith@test.com. A better regex for most uses is r"(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)". You can go to http://emailregex.com/ for find it and links to more information.

Note that parsing a valid email including corner cases is actually a difficult and challenging problem. The previous regex should be fine for most uses covered in this book, but in a general framework project such as Django, email validation is a very long and very unreadable regex.

The resulting matching object returns the position where the matched pattern starts and ends (using the start and end methods), as shown in step 5, which splits the string into matched parts, showing the distinction between the two matching patterns.

The difference displayed in step 5 is a very common one. Trying to capture GP can end up capturing eggplant and bagpipe! Similarly, things\b won't capture things. Be sure to test and make the proper adjustments, such as capturing \bGP\b for just the word GP.

The specific matched pattern can be retrieved by calling group(), as shown in step 6. Note that the result will always be a string. It can be further processed using any of the methods that we've previously seen, such as by splitting the phone number into groups by dashes, for example:

>>> match = re.search(r'[0123456789-]+', 'the phone number is 1234-567-890')
>>> [int(n) for n in match.group().split('-')]
[1234, 567, 890]

There's more...

Dealing with regexes can be difficult and complex. Please allow time to test your matches and be sure that they work as you expect in order to avoid nasty surprises.

You can check your regexes interactively with some tools. A good one that's freely available online is https://regex101.com/, which displays each of the elements and explains the regex. Double-check that you're using the Python flavor:

See that the EXPLANATION describes that \b matches a word boundary (start or end of a word), and that thing matches literally these characters.

Regexes, in some cases, can be very slow, or even produce what's called regex denial-of-service, a string created to confuse a particular regex so that it takes an enormous amount of time, even in the worst case blocking the computer. While automating tasks probably won't get you into those problems, keep an eye out in case a regex takes too long.

See also

  • The Extracting data from structured strings recipe
  • The Using a third-party tool—parse recipe
  • The Going deeper into regular expressions recipe
You have been reading a chapter from
Python Automation Cookbook
Published in: Sep 2018
Publisher: Packt
ISBN-13: 9781789133806
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image