Verifying the format of a string using regular expressions
Regular expressions are a language intended for performing pattern matching and replacements in texts. C++11 provides support for regular expressions within the standard library through a set of classes, algorithms, and iterators available in the header <regex>
. In this recipe, we will learn how regular expressions can be used to verify that a string matches a pattern (examples can include verifying an email or IP address formats).
Getting ready
Throughout this recipe, we will explain, whenever necessary, the details of the regular expressions that we use. However, you should have at least some basic knowledge of regular expressions in order to use the C++ standard library for regular expressions. A description of regular expressions syntax and standards is beyond the purpose of this book; if you are not familiar with regular expressions, it is recommended that you read more about them before continuing with the recipes that focus on regular expressions. Good online resources for learning, building, and debugging regular expressions can be found at https://regexr.com and https://regex101.com.
How to do it...
In order to verify that a string matches a regular expression, perform the following steps:
- Include the headers
<regex>
and<string>
and the namespacestd::string_literals
for C++14 standard user-defined literals for strings:#include <regex> #include <string> using namespace std::string_literals;
- Use raw string literals to specify the regular expression to avoid escaping backslashes (which can occur frequently). The following regular expression validates most email formats:
auto pattern {R"(^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}$)"s};
- Create an
std::regex
/std::wregex
object (depending on the character set that is used) to encapsulate the regular expression:auto rx = std::regex{pattern};
- To ignore casing or specify other parsing options, use an overloaded constructor that has an extra parameter for regular expression flags:
auto rx = std::regex{pattern, std::regex_constants::icase};
- Use
std::regex_match()
to match the regular expression with an entire string:auto valid = std::regex_match("marius@domain.com"s, rx);
How it works...
Considering the problem of verifying the format of email addresses, even though this may look like a trivial problem, in practice, it is hard to find a simple regular expression that covers all the possible cases for valid email formats. In this recipe, we will not try to find that ultimate regular expression, but rather apply a regular expression that is good enough for most cases. The regular expression we will use for this purpose is this:
^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}$
The following table explains the structure of the regular expression:
Part |
Description |
|
Start of string. |
|
At least one character in the range |
|
The character |
|
At least one character in the range |
|
A dot that separates the domain hostname and label. |
|
The DNS label of a domain that can have between 2 and 63 characters. |
|
End of the string. |
Bear in mind that, in practice, a domain name is composed of a hostname followed by a dot-separated list of DNS labels. Examples include localhost
, gmail.com
and yahoo.co.uk
. This regular expression we are using does not match domains without DNS labels, such as localhost
(an email, such as root@localhost
, is a valid email). The domain name can also be an IP address specified in brackets, such as [192.168.100.11]
(as in john.doe@[192.168.100.11]
). Email addresses containing such domains will not match the regular expression defined previously. Even though these rather rare formats will not be matched, the regular expression can cover most email formats.
The regular expression for the example in this chapter is provided for didactical purposes only, and is not intended to be used as it is in production code. As explained earlier, this sample does not cover all possible email formats.
We began by including the necessary headers; that is, <regex>
for regular expressions and <string>
for strings. The is_valid_email()
function, shown in the following code (which basically contains the samples from the How to do it... section), takes a string representing an email address and returns a Boolean indicating whether the email has a valid format or not.
We first construct an std::regex
object to encapsulate the regular expression indicated with the raw string literal. Using raw string literals is helpful because it avoids escaping backslashes, which are used for escape characters in regular expressions too. The function then calls std::regex_match()
, passing the input text and the regular expression:
bool is_valid_email_format(std::string const & email)
{
auto pattern {R"(^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}$)"s};
auto rx = std::regex{pattern, std::regex_constants::icase};
return std::regex_match(email, rx);
}
The std::regex_match()
method tries to match the regular expression against the entire string. If successful, it returns true
; otherwise, it returns false
:
auto ltest = [](std::string const & email)
{
std::cout << std::setw(30) << std::left
<< email << " : "
<< (is_valid_email_format(email) ?
"valid format" : "invalid format")
<< '\n';
};
ltest("JOHN.DOE@DOMAIN.COM"s); // valid format
ltest("JOHNDOE@DOMAIL.CO.UK"s); // valid format
ltest("JOHNDOE@DOMAIL.INFO"s); // valid format
ltest("J.O.H.N_D.O.E@DOMAIN.INFO"s); // valid format
ltest("ROOT@LOCALHOST"s); // invalid format
ltest("john.doe@domain.com"s); // invalid format
In this simple test, the only emails that do not match the regular expression are ROOT@LOCALHOST
and john.doe@domain.com
. The first contains a domain name without a dot-prefixed DNS label, and that case is not covered in the regular expression. The second contains only lowercase letters, and in the regular expression, the valid set of characters for both the local part and the domain name was uppercase letters, A to Z.
Instead of complicating the regular expression with additional valid characters (such as [A-Za-z0-9._%+-]
), we can specify that the match can ignore this case. This can be done with an additional parameter to the constructor of the std::basic_regex
class. The available constants for this purpose are defined in the regex_constants
namespace. The following slight change to is_valid_email_format()
will make it ignore the case and allow emails with both lowercase and uppercase letters to correctly match the regular expression:
bool is_valid_email_format(std::string const & email)
{
auto rx = std::regex{
R"(^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}$)"s,
std::regex_constants::icase};
return std::regex_match(email, rx);
}
This is_valid_email_format()
function is pretty simple, and if the regular expression was provided as a parameter, along with the text to match, it could be used for matching anything. However, it would be nice to be able to handle not only multi-byte strings (std::string
), but also wide strings (std::wstring
), with a single function. This can be achieved by creating a function template where the character type is provided as a template parameter:
template <typename CharT>
using tstring = std::basic_string<CharT, std::char_traits<CharT>,
std::allocator<CharT>>;
template <typename CharT>
bool is_valid_format(tstring<CharT> const & pattern,
tstring<CharT> const & text)
{
auto rx = std::basic_regex<CharT>{
pattern, std::regex_constants::icase };
return std::regex_match(text, rx);
}
We start by creating an alias template for std::basic_string
in order to simplify its use. The new is_valid_format()
function is a function template very similar to our implementation of is_valid_email()
. However, we now use std::basic_regex<CharT>
instead of the typedef std::regex
, which is std::basic_regex<char>
, and the pattern is provided as the first argument. We now implement a new function called is_valid_email_format_w()
for wide strings that relies on this function template. The function template, however, can be reused for implementing other validations, such as if a license plate has a particular format:
bool is_valid_email_format_w(std::wstring const & text)
{
return is_valid_format(
LR"(^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}$)"s,
text);
}
auto ltest2 = [](auto const & email)
{
std::wcout << std::setw(30) << std::left
<< email << L" : "
<< (is_valid_email_format_w(email) ? L"valid" : L"invalid")
<< '\n';
};
ltest2(L"JOHN.DOE@DOMAIN.COM"s); // valid
ltest2(L"JOHNDOE@DOMAIL.CO.UK"s); // valid
ltest2(L"JOHNDOE@DOMAIL.INFO"s); // valid
ltest2(L"J.O.H.N_D.O.E@DOMAIN.INFO"s); // valid
ltest2(L"ROOT@LOCALHOST"s); // invalid
ltest2(L"john.doe@domain.com"s); // valid
Of all the examples shown here, the only one that does not match is ROOT@LOCALHOST
, as expected.
The std::regex_match()
method has, in fact, several overloads, and some of them have a parameter that is a reference to an std::match_results
object to store the result of the match. If there is no match, then std::match_results
is empty and its size is 0. Otherwise, if there is a match, the std::match_results
object is not empty and its size is 1, plus the number of matched subexpressions.
The following version of the function uses the mentioned overloads and returns the matched subexpressions in an std::smatch
object. Note that the regular expression is changed as three caption groups are defined—one for the local part, one for the hostname part of the domain, and one for the DNS label. If the match is successful, then the std::smatch
object will contain four submatch objects: the first to match the entire string, the second for the first capture group (the local part), the third for the second capture group (the hostname), and the fourth for the third and last capture group (the DNS label). The result is returned in a tuple, where the first item actually indicates success or failure:
std::tuple<bool, std::string, std::string, std::string>
is_valid_email_format_with_result(std::string const & email)
{
auto rx = std::regex{
R"(^([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,})$)"s,
std::regex_constants::icase };
auto result = std::smatch{};
auto success = std::regex_match(email, result, rx);
return std::make_tuple(
success,
success ? result[1].str() : ""s,
success ? result[2].str() : ""s,
success ? result[3].str() : ""s);
}
Following the preceding code, we use C++17 structured bindings to unpack the content of the tuple into named variables:
auto ltest3 = [](std::string const & email)
{
auto [valid, localpart, hostname, dnslabel] =
is_valid_email_format_with_result(email);
std::cout << std::setw(30) << std::left
<< email << " : "
<< std::setw(10) << (valid ? "valid" : "invalid")
<< "local=" << localpart
<< ";domain=" << hostname
<< ";dns=" << dnslabel
<< '\n';
};
ltest3("JOHN.DOE@DOMAIN.COM"s);
ltest3("JOHNDOE@DOMAIL.CO.UK"s);
ltest3("JOHNDOE@DOMAIL.INFO"s);
ltest3("J.O.H.N_D.O.E@DOMAIN.INFO"s);
ltest3("ROOT@LOCALHOST"s);
ltest3("john.doe@domain.com"s);
The output of the program will be as follows:
Figure 2.3: Output of tests
There's more...
There are multiple versions of regular expressions, and the C++ standard library supports six of them: ECMAScript, basic POSIX, extended POSIX, awk, grep, and egrep (grep with the option -E
). The default grammar used is ECMAScript, and in order to use another, you have to explicitly specify the grammar when defining the regular expression. In addition to specifying the grammar, you can also specify parsing options, such as matching by ignoring the case.
The standard library provides more classes and algorithms than what we have seen so far. The main classes available in the library are as follows (all of them are class templates and, for convenience, typedefs are provided for different character types):
- The class template
std::basic_regex
defines the regular expression object:typedef basic_regex<char> regex; typedef basic_regex<wchar_t> wregex;
- The class template
std::sub_match
represents a sequence of characters that matches a capture group; this class is actually derived fromstd::pair
, and itsfirst
andsecond
members represent iterators to the first and the one-past-end characters in the match sequence. If there is no match sequence, the two iterators are equal:typedef sub_match<const char *> csub_match; typedef sub_match<const wchar_t *> wcsub_match; typedef sub_match<string::const_iterator> ssub_match; typedef sub_match<wstring::const_iterator> wssub_match;
- The class template
std::match_results
is a collection of matches; the first element is always a full match in the target, while the other elements are matches of subexpressions:typedef match_results<const char *> cmatch; typedef match_results<const wchar_t *> wcmatch; typedef match_results<string::const_iterator> smatch; typedef match_results<wstring::const_iterator> wsmatch;
The algorithms available in the regular expressions standard library are as follows:
std::regex_match()
: This tries to match a regular expression (represented by anstd::basic_regex
instance) to an entire string.std::regex_search()
: This tries to match a regular expression (represented by anÂstd::basic_regex
instance) to a part of a string (including the entire string).std::regex_replace()
: This replaces matches from a regular expression according to a specified format.
The iterators available in the regular expressions standard library are as follows:
std::regex_interator
: A constant forward iterator used to iterate through the occurrences of a pattern in a string. It has a pointer to anstd::basic_regex
that must live until the iterator is destroyed. Upon creation and when incremented, the iterator callsstd::regex_search()
and stores a copy of thestd::match_results
object returned by the algorithm.std::regex_token_iterator
: A constant forward iterator used to iterate through the submatches of every match of a regular expression in a string. Internally, it uses astd::regex_iterator
to step through the submatches. Since it stores a pointer to anstd::basic_regex
instance, the regular expression object must live until the iterator is destroyed.
It should be mentioned that the standard regex library has poorer performance compared to other implementations (such as Boost.Regex) and does not support Unicode. Moreover, it could be argued that the API itself is cumbersome to use.
See also
- Parsing the content of a string using regular expressions to learn how to perform multiple matches of a pattern in a text
- Replacing the content of a string using regular expressions to see how to perform text replacements with the help of regular expressions
- Using structured bindings to handle multi-return values in Chapter 1, Learning Modern Core Language Features, to learn how to bind variables to subobjects or elements from the initializing expressions