Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Python Automation Cookbook

You're reading from   Python Automation Cookbook 75 Python automation recipes for web scraping; data wrangling; and Excel, report, and email processing

Arrow left icon
Product type Paperback
Published in May 2020
Publisher Packt
ISBN-13 9781800207080
Length 526 pages
Edition 2nd Edition
Languages
Tools
Arrow right icon
Author (1):
Arrow left icon
Jaime Buelta Jaime Buelta
Author Profile Icon Jaime Buelta
Jaime Buelta
Arrow right icon
View More author details
Toc

Table of Contents (16) Chapters Close

Preface 1. Let's Begin Our Automation Journey 2. Automating Tasks Made Easy FREE CHAPTER 3. Building Your First Web Scraping Application 4. Searching and Reading Local Files 5. Generating Fantastic Reports 6. Fun with Spreadsheets 7. Cleaning and Processing Data 8. Developing Stunning Graphs 9. Dealing with Communication Channels 10. Why Not Automate Your Marketing Campaign? 11. Machine Learning for Automation 12. Automatic Testing Routines 13. Debugging Techniques 14. Other Books You May Enjoy
15. Index

Introducing regular expressions

A regular expression, or regex, is a pattern to match text. In other words, it allows us to define an abstract string (typically, the definition of a structured kind of text) to check with other strings to see if they match or not.

It is better to describe them with an example. Think of defining a pattern of text as a word that starts with an uppercase A and contains only lowercase "n"s and "a"s after that. Let's show some possible comparisons and results:

Text to compare Result

Anna

Match

Bob

No match (No initial A)

Alice

No match (l is not n or a after initial A)

James

No match (No initial A)

Aaan

Match

Ana

Match

Annnn

Match

Aaaan

Match

ANNA

No match (N is not n or a)

Table 1.1: A pattern matching example

If this sounds complicated, that's because it is. Regexes can be notoriously complicated because they may be incredibly intricate and difficult to follow. But they are also very useful because they allow us to perform incredibly powerful pattern matching.

Some common uses of regexes are:

  • Validating input data: For example, a phone number that is only numbers, dashes, and brackets.
  • String parsing: Retrieve data from structured strings, such as logs or URLs. This is similar to what's described in the previous recipe.
  • Scrapping: Find the occurrences of something in a long piece of text. For example, find all of the emails in a web page.
  • Replacement: Find and replace a word or words with others. For example, replace the owner with John Smith.

Getting ready

The python module to deal with regexes is called re. The main function we'll cover is re.search(), which returns a match object with information about what matched the pattern.

As regex patterns are also defined as strings, we'll differentiate them by prefixing them with an r, such as r'pattern'. This is the Python way of labeling a text as raw string literals, meaning that the string within is taken literally, without any escaping. This means that a "\" is used as a backslash instead of an escaping sequence. For example, without the r prefix, \n means a newline character.

Some characters are special and refer to concepts such as the end of the string, any digit, any character, any whitespace character, and so on.

The simplest form is just a literal string. For example, the regex pattern r'LOG' matches the string 'LOGS', but not the string 'NOT A MATCH'. If there's no match, re.search returns None. If there is, it returns a special Match object:

>>> import re
>>> re.search(r'LOG', 'LOGS')
<_sre.SRE_Match object; span=(0, 3), match='LOG'>
>>> re.search(r'LOG', 'NOT A MATCH')
>>>

How to do it…

  1. Import the re module:
    >>> import re
    
  2. Then, match a pattern that is not at the start of the string:
    >>> re.search(r'LOG', 'SOME LOGS')
    <_sre.SRE_Match object; span=(5, 8), match='LOG'>
    
  3. Match a pattern that is only at the start of the string. Note the ^ character:
    >>> re.search(r'^LOG', 'LOGS')
    <_sre.SRE_Match object; span=(0, 3), match='LOG'>
    >>> re.search(r'^LOG', 'SOME LOGS')
    >>>
    
  4. Match a pattern only at the end of the string. Note the $ character:
    >>> re.search(r'LOG$', 'SOME LOG')
    <_sre.SRE_Match object; span=(5, 8), match='LOG'>
    >>> re.search(r'LOG$', 'SOME LOGS')
    >>>
    
  5. Match the word 'thing' (not excluding things), but not something or anything. Note the \b at the start of the second pattern:
    >>> STRING = 'something in the things she shows me'
    >>> match = re.search(r'thing', STRING)
    >>> STRING[:match.start()], STRING[match.start():match.end()], STRING[match.end():]
    ('some', 'thing', ' in the things she shows me')
    >>> match = re.search(r'\bthing', STRING)
    >>> STRING[:match.start()], STRING[match.start():match.end()], STRING[match.end():]
    ('something in the ', 'thing', 's she shows me')
    
  6. Match a pattern that's only numbers and dashes (for example, a phone number). Retrieve the matched string:
    >>> re.search(r'[0123456789-]+', 'the phone number is 1234-567-890') <_sre.SRE_Match object; span=(20, 32), match='1234-567-890'>
    >>> re.search(r'[0123456789-]+', 'the phone number is 1234-567-890').group()
    '1234-567-890'
    
  7. Match an email address naively:
    >>> re.search(r'\S+@\S+', 'my email is email.123@test.com').group() 'email.123@test.com'
    

How it works…

The re.search function matches a pattern, no matter its position in the string. As explained previously, this will return None if the pattern is not found, or a Match object.

The following special characters are used:

  • ^: Marks the start of the string
  • $: Marks the end of the string
  • \b: Marks the start or end of a word
  • \S: Marks any character that's not a whitespace, including characters like * or $

More special characters are shown in the next recipe, Going deeper into regular expressions.

In step 6 of the How to do it section, the r'[0123456789-]+' pattern is composed of two parts. The first one is between square brackets, and matches any single character between 0 and 9 (any number) and the dash (-) character. The + sign after that means that this character can be present one or more times. This is called a quantifier in regexes. This makes a match on any combination of numbers and dashes, no matter how long it is.

Step 7 again uses the + sign to match as many characters as necessary before the @ and again after it. In this case, the character match is \S, which matches any non-whitespace character.

Please note that the naive pattern for emails described here is very naive, as it will match invalid emails such as john@smith@test.com. A better regex for most uses is r"(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)". You can go to http://emailregex.com/ to find it, along with links to more information.

Note that parsing a valid email including corner cases is actually a difficult and challenging problem. The previous regex should be fine for most uses covered in this book, but in a general framework project such as Django, email validation is a very long and hard-to-read regex.

The resulting matching object returns the position where the matched pattern starts and ends (using the start and end methods), as shown in step 5, which splits the string into matched parts, showing the distinction between the two matching patterns.

The difference displayed in step 5 is a very common one. Trying to capture GP (as in General Practitioner, for a medical doctor) can end up capturing eggplant and bagpipe! Similarly, things\b won't capture things. Be sure to test and make the proper adjustments, such as capturing \bGP\b for just the word GP.

The specific matched pattern can be retrieved by calling group(), as shown in step 6. Note that the result will always be a string. It can be further processed using any of the methods that we've previously seen, such as by splitting the phone number into groups by dashes, for example:

>>> match = re.search(r'[0123456789-]+', 'the phone number is 1234-567-890')
>>> [int(n) for n in match.group().split('-')]
[1234, 567, 890]

There's more…

Dealing with regexes can be difficult and complex. Please allow time to test your matches and be sure that they work as you expect in order to avoid nasty surprises.

"Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems."

– Jamie Zawinski

Regular expressions are at their best when they are kept very simple. In general, if there is a specific tool to do it, prefer it over regexes. A very clear example of this is with HTML parsing; refer to Chapter 3, Building Your First Web Scraping Application, for better tools to achieve this.

Some text editors allow us to search using regexes as well. While most are editors aimed at writing code, such as Vim, BBEdit, or Notepad++, they're also present in more general tools, such as MS Office, Open Office, or Google Documents. But be careful, as the particular syntax may be slightly different.

You can check your regexes interactively with some tools. A good one that's freely available online is https://regex101.com/, which displays each of the elements and explains the regex. Double-check that you're using the Python flavor:

Figure 1.1: An example using RegEx101

Note that the EXPLANATION box in the preceding image describes that \b matches a word boundary (the start or end of a word), and that thing matches literally these characters.

Regexes, in some cases, can be very slow, or even susceptible to what's called a regex denial-of-service attack, a string created to confuse a particular regex so that it takes an enormous amount of time. In the worst-case scenario, it can even block the computer. While automating tasks probably won't get you into those problems, keep an eye out in case a regex takes too long to process.

See also

  • The Extracting data from structured strings recipe, covered earlier in the chapter, to learn simple techniques to extract information from text.
  • The Using a third-party tool—parse recipe, covered earlier in the chapter, to use a third-party tool to extract information from text.
  • The Going deeper into regular expressions recipe, covered later in the chapter, to further your knowledge of regular expressions.
lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image