You're reading from Python Automation Cookbook Explore the world of automation using Python recipes that will enhance your skills

Product type Paperback

Published in Sep 2018

Publisher Packt

ISBN-13 9781789133806

Length 398 pages

Edition 1st Edition

Languages

Python

Tools

Excel

Concepts

DevOps

Author (1):

Jaime Buelta

View More author details

Introducing regular expressions

A regular expression, or regex, is a pattern to match text. In other words, it allows us to define an abstract string (typically the definition of a structured kind of text) to check with other strings to see if they match or not.

It is better to describe them with an example. Think of defining a pattern of text as a word that starts with an uppercase A and contains only lowercase Ns and As after that. The word Anna matches it, but Bob, Alice, and James does not. The words Aaan, Ana, Annnn, and Aaaan will also be matches, but ANNA won't.

If this sounds complicated, that's because it is. Regexes can be notoriously complicated because they may be incredibly intricate and difficult to follow. But they are very useful, because they allow us to perform incredibly powerful pattern matching.

Some common uses of regexes are as follow:

Validating input data: For example, that a phone number is only numbers, dashes, and brackets.
String parsing: Retrieve data from structured strings, such as logs or URLs. This is similar to what's described in the previous recipe.
Scrapping: Find the occurrences of something in a long text. For example, find all emails in a web page.
Replacement: Find and replace a word or words with others. For example, replace the owner with John Smith.

"Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems."
– Jamie Zawinski

Regular expressions are at their best when they are kept very simple. In general, if there is a specific tool to do it, prefer it over regexes. A very clear example of this is HTML parsing; check Chapter 3, Building Your First Web Scraping Application, for better tools to achieve this.

Some text editors allow us to search using regexes as well. While most are editors aimed at writing code, such as Vim, BBEdit, or Notepad++, they're also present in more general tools, such as MS Office, Open Office, or Google Documents. But be careful, as the particular syntax may be slightly different.

Getting ready

The python module to deal with regexes is called re. The main function we'll cover is re.search(), which returns a match object with information about what matched the pattern.

As regex patterns are also defined as strings, we'll differentiate them by prefixing them with an r, such as r'pattern'. This is the Python way of labeling a text as raw string literals, meaning that the string within is taken literally, without any escaping. This means that a \ is used as a backslash instead of a sequence. For example, without the r prefix, \n means newline character.

Some characters are special, and refer to concepts such as the end of the string, any digit, any character, any whitespace character, and so on.

The simplest form is just a literal string. For example, the regex pattern r'LOG' matches the string 'LOGS', but not the string 'NOT A MATCH'. If there's not a match, search returns None:

>>> import re
>>> re.search(r'LOG', 'LOGS')
<_sre.SRE_Match object; span=(0, 3), match='LOG'>
>>> re.search(r'LOG', 'NOT A MATCH')
>>>

How to do it...

Import the re module:

>>> import re

Then, match a pattern that is not at the start of the string:

>>> re.search(r'LOG', 'SOME LOGS')
<_sre.SRE_Match object; span=(5, 8), match='LOG'>

Match a pattern that is only at the start of the string. Note the ^ character:

>>> re.search(r'^LOG', 'LOGS')
<_sre.SRE_Match object; span=(0, 3), match='LOG'>
>>> re.search(r'^LOG', 'SOME LOGS')
>>>

Match a pattern only at the end of the string. Note the $ character:

>>> re.search(r'LOG$', 'SOME LOG')
<_sre.SRE_Match object; span=(5, 8), match='LOG'>
>>> re.search(r'LOG$', 'SOME LOGS')
>>>

Match the word 'thing' (not excluding things), but not something or anything. Note the \b at the start of the second pattern:

>>> STRING = 'something in the things she shows me'
>>> match = re.search(r'thing', STRING)
>>> STRING[:match.start()], STRING[match.start():match.end()], STRING[match.end():]
('some', 'thing', ' in the things she shows me')
>>> match = re.search(r'\bthing', STRING)
>>> STRING[:match.start()], STRING[match.start():match.end()], STRING[match.end():]
('something in the ', 'thing', 's she shows me')

Match a pattern that's only numbers and dashes (for example, a phone number). Retrieve the matched string:

>>> re.search(r'[0123456789-]+', 'the phone number is 1234-567-890')
<_sre.SRE_Match object; span=(20, 32), match='1234-567-890'>
>>> re.search(r'[0123456789-]+', 'the phone number is 1234-567-890').group()
'1234-567-890'

Match an email address naively:

>>> re.search(r'\S+@\S+', 'my email is email.123@test.com').group()
'email.123@test.com'

How it works...

The re.search function matches a pattern, no matter its position in the string. As explained previously, this will return None if the pattern is not found, or a match object.

The following special characters are used:

^: Marks the start of the string
$: Marks the end of the string

\b: Marks the start or end of a word
\S: Marks any character that's not a whitespace, including special characters

More special characters are shown in the next recipe.

In step 6 in the How to do it... section, the r'[0123456789-]+' pattern is composed of two parts. The first one is between square brackets, and matches any single character between 0 and 9 (any number) and the dash (-) character. The + sign after that means that this character can be present one or more times. This is called a quantifier in regexes. This makes a match on any combination of numbers and dashes, no matter how long it is.

Step 7 again uses the + sign to match as many characters as necessary before the @ and again after it. In this case, the character match is \S, which matches any non-whitespace character.

Please note that the naive pattern for emails described here is very naive, as it will match invalid emails such as john@smith@test.com. A better regex for most uses is r"(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)". You can go to http://emailregex.com/ for find it and links to more information.

Note that parsing a valid email including corner cases is actually a difficult and challenging problem. The previous regex should be fine for most uses covered in this book, but in a general framework project such as Django, email validation is a very long and very unreadable regex.

The resulting matching object returns the position where the matched pattern starts and ends (using the start and end methods), as shown in step 5, which splits the string into matched parts, showing the distinction between the two matching patterns.

The difference displayed in step 5 is a very common one. Trying to capture GP can end up capturing eggplant and bagpipe! Similarly, things\b won't capture things. Be sure to test and make the proper adjustments, such as capturing \bGP\b for just the word GP.

The specific matched pattern can be retrieved by calling group(), as shown in step 6. Note that the result will always be a string. It can be further processed using any of the methods that we've previously seen, such as by splitting the phone number into groups by dashes, for example:

>>> match = re.search(r'[0123456789-]+', 'the phone number is 1234-567-890')
>>> [int(n) for n in match.group().split('-')]
[1234, 567, 890]

There's more...

Dealing with regexes can be difficult and complex. Please allow time to test your matches and be sure that they work as you expect in order to avoid nasty surprises.

You can check your regexes interactively with some tools. A good one that's freely available online is https://regex101.com/, which displays each of the elements and explains the regex. Double-check that you're using the Python flavor:

See that the EXPLANATION describes that \b matches a word boundary (start or end of a word), and that thing matches literally these characters.

Regexes, in some cases, can be very slow, or even produce what's called regex denial-of-service, a string created to confuse a particular regex so that it takes an enormous amount of time, even in the worst case blocking the computer. While automating tasks probably won't get you into those problems, keep an eye out in case a regex takes too long.

You're reading from Python Automation Cookbook Explore the world of automation using Python recipes that will enhance your skills

Table of Contents (12) Chapters

Introducing regular expressions

Getting ready

How to do it...

How it works...

There's more...

See also

Authors (1)

Other recommended products

Personalised recommendations for you