Our email address regex
At the beginning of the chapter I introduced a regular expression for extracting email addresses from web pages. As promised, let's use our newfound knowledge of regexes to see exactly how it works. Here, again, is the regular expression as it was presented in the beginning of the chapter:
\b[-\w.+]+@[\w.]+\.[a-zA-Z]{2,4}\b
We noted that an email address consists of a username, @ character, and domain name. The first part of the regex is \b
, which makes sure that the email address starts at a word boundary. Following that, we see that the [-\w.+]
character class allows for a word character as well as a dash, dot, or a plus sign. In this case, the dot does not need to be escaped as it is contained within a character class. Also worth noting is that the plus sign inside the character class is also interpreted as a literal plus and not as a repetition quantifier. There is another plus sign immediately following the character class, and this is an actual plus quantifier that is used to match against one or more occurrences of the characters within the character class.
Following this, the @
character is matched literally, as it is a requirement for it to be present in an email address. After this the same character class as before, [\w.]+
is used to allow an arbitrary number of sub-domains (for example, misec.net
and support.misec.net
are both allowed using this construct).
The second-to-last part of the regular expression is \.[a-zA-Z]{2,4}
, and this corresponds to the top-level domain in the email address (such as .com
). We see how the dot is required (and is escaped, so that it only matches the dot and not any character). Following this, a letter is required from two up to four times—this allows it to match top-level domains such as de
and com
and also four-letter domains such as info
. Finally, the last part of the regex is another \b
word-boundary assertion, to make sure the email address precedes a space or similar word-boundary marker.