Example of a regular expression
To get a feeling for how regular expressions are used, let's start with a real-life example so that you can see how a regex works when put to use on a common task.
Identifying an email address
Suppose you wanted to extract all email addresses from an HTML document. You'd need a regular expression that would match email addresses but not all the other text in the document. An email address consists of a username, an @
character, and a domain name. The domain name in turn consists of a company or organization name, a dot, and a top-level domain name such as com, edu
, or de
.
Knowing this, here are the parts that we need to put together to create a regular expression:
User name
This consists of alphanumeric characters (
0-9, a-z, A-Z
) as well as dots, plus signs, dashes, and underscores. Other characters are allowed by the RFC specification for email addresses, but these are very rarely used so I have not included them here.@ character
One of the mandatory characters in an email address, the
@
character must be present, so it is a good way to help distinguish an email address from other text.Domain Name
For example
cnn.com
. Could also contain sub-domains, somail.cnn.com
would be validTop-Level Domain
This is the fi nal part of the email address, and is part of the domain name. The top-level domain usually indicates what country the domain is located in (though domains such as
.com
or.org
are used in countries all around the world). The top-level domain is between two and four characters long (excluding the dot character that precedes it).
Putting all these parts together, we end up with the following regular expression:
\b[-\w.+]+@[\w.]+\.[a-zA-Z]{2,4}\b
The [-\w.+]+
part corresponds to the username, the @
character is matched literally against the @
in the email address, and the domain name part corresponds to [\w.]+\.\w{2,4}
. Unless you are already familiar with regular expressions, none of this will make sense to you right now, but by the time you've finished reading this appendix, you will know exactly what this regular expression does.
Let's get started learning about regular expressions, and at the end of the chapter we'll come back to this example to see exactly how it works.