You're reading from Modern C++ Programming Cookbook Master C++ core language and standard library features, with over 100 recipes, updated to C++20

Product type Paperback

Published in Sep 2020

Publisher Packt

ISBN-13 9781800208988

Length 750 pages

Edition 2nd Edition

Languages

C++

Concepts

Programming Language

Author (1):

Marius Bancila

View More author details

Table of Contents (16) Chapters

Preface

Learning Modern Core Language Features

Working with Numbers and Strings FREE CHAPTER

Exploring Functions

Preprocessing and Compilation

Standard Library Containers, Algorithms, and Iterators

General-Purpose Utilities

Working with Files and Streams

Leveraging Threading and Concurrency

Robustness and Performance

Implementing Patterns and Idioms

Exploring Testing Frameworks

C Plus Plus 20 Core Features

Bibliography

Articles and books

Other Books You May Enjoy

Index

Parsing the content of a string using regular expressions

In the previous recipe, we looked at how to use std::regex_match() to verify that the content of a string matches a particular format. The library provides another algorithm called std::regex_search() that matches a regular expression against any part of a string, and not only the entire string, as regex_match() does. This function, however, does not allow us to search through all the occurrences of a regular expression in an input string. For this purpose, we need to use one of the iterator classes available in the library.

In this recipe, you will learn how to parse the content of a string using regular expressions. For this purpose, we will consider the problem of parsing a text file containing name-value pairs. Each such pair is defined on a different line and has the format name = value, but lines starting with a # represent comments and must be ignored. The following is an example:

#remove # to uncomment a line
timeout=120
server = 127.0.0.1
#retrycount=3

Before looking at the implementation details, let's consider some prerequisites.

Getting ready

For general information about regular expression support in C++11, refer to the Verifying the format of a string using regular expressions recipe, earlier in this chapter. Basic knowledge of regular expressions is required to proceed with this recipe.

In the following examples, text is a variable that's defined as follows:

auto text {
  R"(
    #remove # to uncomment a line
    timeout=120
    server = 127.0.0.1
    #retrycount=3
  )"s};

The sole purpose of this is to simplify our snippets, although in a real-world example, you will probably be reading the text from a file or other source.

How to do it...

In order to search for occurrences of a regular expression through a string, you should do the following:

Include the headers <regex> and <string> and the namespace std::string_literals for C++14 standard user-defined literals for strings:
```
#include <regex>
#include <string>
using namespace std::string_literals;
```
Use raw string literals to specify a regular expression in order to avoid escaping backslashes (which can occur frequently). The following regular expression validates the file format proposed earlier:
```
auto pattern {R"(^(?!#)(\w+)\s*=\s*([\w\d]+[\w\d._,\-:]*)$)"s};
```
Create an std::regex/std::wregex object (depending on the character set that is used) to encapsulate the regular expression:
```
auto rx = std::regex{pattern};
```
To search for the first occurrence of a regular expression in a given text, use the general-purpose algorithm std::regex_search() (example 1):
```
auto match = std::smatch{};
if (std::regex_search(text, match, rx))
{
  std::cout << match[1] << '=' << match[2] << '\n';
}
```

To find all the occurrences of a regular expression in a given text, use the iterator std::regex_iterator (example 2):

auto end = std::sregex_iterator{};
for (auto it=std::sregex_iterator{ std::begin(text),
                                   std::end(text), rx };
     it != end; ++it)
{
  std::cout << '\'' << (*it)[1] << "'='"
            << (*it)[2] << '\'' << '\n';
}

To iterate through all the subexpressions of a match, use the iterator std::regex_token_iterator (example 3):

auto end = std::sregex_token_iterator{};
for (auto it = std::sregex_token_iterator{
                  std::begin(text), std::end(text), rx };
     it != end; ++it)
{
  std::cout << *it << '\n';
}

How it works...

A simple regular expression that can parse the input file shown earlier may look like this:

^(?!#)(\w+)\s*=\s*([\w\d]+[\w\d._,\-:]*)$

This regular expression is supposed to ignore all lines that start with a #; for those that do not start with #, match a name followed by the equals sign and then a value that can be composed of alphanumeric characters and several other characters (underscore, dot, comma, and so on). The exact meaning of this regular expression is explained as follows:

Part	Description
`^`	Start of line.
`(?!#)`	A negative lookahead that makes sure that it is not possible to match the `#` character.
`(\w)+`	A capturing group representing an identifier of at least a one-word character.
`\s*`	Any whitespaces.
`=`	Equals sign.
`\s*`	Any whitespaces.
`([\w\d]+[\w\d._,\-:]*)`	A capturing group representing a value that starts with an alphanumeric character, but can also contain a dot, comma, backslash, hyphen, colon, or an underscore.
`$`	End of line.

We can use std::regex_search() to search for a match anywhere in the input text. This algorithm has several overloads, but in general, they work in the same way. You must specify the range of characters to work through, an output std::match_results object that will contain the result of the match, and an std::basic_regex object representing the regular expression and matching flags (which define the way the search is done). The function returns true if a match was found or false otherwise.

In the first example from the previous section (see the fourth list item), match is an instance of std::smatch that is a typedef of std::match_results with string::const_iterator as the template type. If a match was found, this object will contain the matching information in a sequence of values for all matched subexpressions. The submatch at index 0 is always the entire match. The submatch at index 1 is the first subexpression that was matched, the submatch at index 2 is the second subexpression that was matched, and so on. Since we have two capturing groups (which are subexpressions) in our regular expression, the std::match_results will have three submatches in the event of success. The identifier representing the name is at index 1, and the value after the equals sign is at index 2. Therefore, this code only prints the following:

Figure 2.4: Output of first example

The std::regex_search() algorithm is not able to iterate through all the possible matches in a piece of text. To do that, we need to use an iterator. std::regex_iterator is intended for this purpose. It allows not only iterating through all the matches, but also accessing all the submatches of a match.

The iterator actually calls std::regex_search() upon construction and on each increment, and it remembers the resulting std::match_results from the call. The default constructor creates an iterator that represents the end of the sequence and can be used to test when the loop through the matches should stop.

In the second example from the previous section (see the fifth list item), we first create an end-of-sequence iterator, and then we start iterating through all the possible matches. When constructed, it will call std::regex_match(), and if a match is found, we can access its results through the current iterator. This will continue until no match is found (the end of the sequence). This code will print the following output:

Figure 2.5: Output of second example

An alternative to std::regex_iterator is std::regex_token_iterator. This works similar to the way std::regex_iterator works and, in fact, it contains such an iterator internally, except that it enables us to access a particular subexpression from a match. This is shown in the third example in the How to do it... section (see the sixth list item). We start by creating an end-of-sequence iterator and then loop through the matches until the end-of-sequence is reached. In the constructor we used, we did not specify the index of the subexpression to access through the iterator; therefore, the default value of 0 is used. This means this program will print all the matches:

Figure 2.6: Output of third example

If we wanted to access only the first subexpression (this means the names in our case), all we had to do was specify the index of the subexpression in the constructor of the token iterator, as shown here:

auto end = std::sregex_token_iterator{};
for (auto it = std::sregex_token_iterator{ std::begin(text),
               std::end(text), rx, 1 };
     it != end; ++it)
{
  std::cout << *it << '\n';
}

This time, the output that we get contains only the names. This is shown in the following image:

Figure 2.7: Output containing only the names

An interesting thing about the token iterator is that it can return the unmatched parts of the string if the index of the subexpressions is -1, in which case it returns an std::match_results object that corresponds to the sequence of characters between the last match and the end of the sequence:

auto end = std::sregex_token_iterator{};
for (auto it = std::sregex_token_iterator{ std::begin(text),
               std::end(text), rx, -1 };
     it != end; ++it)
{
  std::cout << *it << '\n';
}