Regular expressions in AWK programming: What, Why, and How

AWK is a pattern-matching language. It searches for a pattern in a file and, upon finding the corresponding match, it performs the file's action on the input line. This pattern could consist of fixed strings or a pattern of text. This variable content or pattern is generally searched with the help of regular expressions. Hence, regular expressions form an important part of AWK programming language.

Today we will introduce you to the regular expressions in AWK programming and will get started with string-matching patterns and basic constructs to use with AWK.

This article is an excerpt from a book written by Shiwang Kalkhanda, titled Learning AWK Programming.

What is a regular expression?

A regular expression, or regexpr, is a set of characters used to describe a pattern. A regular expression is generally used to match lines in a file that contain a particular pattern. Many Unix utilities operate on plain text files line by line, such as grep, sed, and awk. Regular expressions search for a pattern on a single line in a file.

A regular expression doesn't search for a pattern that begins on one line and ends on another. Other programming languages may support this, notably Perl.

Why use regular expressions?

Generally, all editors have the ability to perform search-and-replace operations. Some editors can only search for patterns, others can also replace them, and others can also print the line containing that pattern. A regular expression goes many steps beyond this simple search, replace, and printing functionality, and hence it is more powerful and flexible. We can search for a word of a certain size, such as a word that has four characters or numbers. We can search for a word that ends with a particular character, let's say e. You can search for phone numbers, email IDs, and so on, and can also perform validation using regular expressions. They simplify complex pattern-matching tasks and hence form an important part of AWK programming. Other regular expression variations also exist, notably those for Perl.

Using regular expressions with AWK

There are mainly two types of regular expressions in Linux:

Basic regular expressions that are used by vi, sed, grep, and so on

Extended regular expressions that are used by awk, nawk, gawk, and egrep

Here, we will refer to extended regular expressions as regular expressions in the context of AWK.

In AWK, regular expressions are enclosed in forward slashes, '/', (forming the AWK pattern) and match every input record whose text belongs to that set.

The simplest regular expression is a string of letters, numbers, or both that matches itself. For example, here we use the ly regular expression string to print all lines that contain the ly pattern in them. We just need to enclose the regular expression in forward slashes in AWK:

$ awk '/ly/' emp.dat

The output on execution of this code is as follows:

Billy   Chabra  9911664321  bily@yahoo.com      M   lgs     1900
Emily   Kaur    8826175812  emily@gmail.com     F   Ops     2100

In this example, the /ly/ pattern matches when the current input line contains the ly sub-string, either as ly itself or as some part of a bigger word, such as Billy or Emily, and prints the corresponding line.

Regular expressions as string-matching patterns with AWK

Regular expressions are used as string-matching patterns with AWK in the following three ways. We use the '~' and '! ~' match operators to perform regular expression comparisons:

/regexpr/: This matches when the current input line contains a sub-string matched by regexpr. It is the most basic regular expression, which matches itself as a string or sub-string. For example, /mail/ matches only when the current input line contains the mail string as a string, a sub-string, or both. So, we will get lines with Gmail as well as Hotmail in the email ID field of the employee database as follows:

$ awk '/mail/' emp.dat

The output on execution of this code is as follows:

Jack    Singh   9857532312  jack@gmail.com      M   hr      2000
Jane    Kaur    9837432312  jane@gmail.com      F   hr      1800
Eva     Chabra  8827232115  eva@gmail.com       F   lgs     2100
Ana     Khanna  9856422312  anak@hotmail.com    F   Ops     2700
Victor  Sharma  8826567898  vics@hotmail.com    M   Ops     2500
John    Kapur   9911556789  john@gmail.com      M   hr      2200
Sam     khanna  8856345512  sam@hotmail.com     F   lgs     2300
Emily   Kaur    8826175812  emily@gmail.com     F   Ops     2100
Amy     Sharma  9857536898  amys@hotmail.com    F   Ops     2500

In this example, we do not specify any expression, hence it automatically matches a whole line, as follows:

$ awk '$0 ~ /mail/' emp.dat

The output on execution of this code is as follows:

Jack    Singh   9857532312  jack@gmail.com      M   hr      2000
Jane    Kaur    9837432312  jane@gmail.com      F   hr      1800
Eva     Chabra  8827232115  eva@gmail.com       F   lgs     2100
Ana     Khanna  9856422312  anak@hotmail.com    F   Ops     2700
Victor  Sharma  8826567898  vics@hotmail.com    M   Ops     2500
John    Kapur   9911556789  john@gmail.com      M   hr      2200
Sam     khanna  8856345512  sam@hotmail.com     F   lgs     2300
Emily   Kaur    8826175812  emily@gmail.com     F   Ops     2100
Amy     Sharma  9857536898  amys@hotmail.com    F   Ops     2500

expression ~ /regexpr /: This matches if the string value of the expression contains a sub-string matched by regexpr. Generally, this left-hand operand of the matching operator is a field. For example, in the following command, we print all the lines in which the value in the second field contains a /Singh/ string:

$ awk '$2 ~ /Singh/{ print }' emp.dat

We can also use the expression as follows:

$ awk '{ if($2 ~ /Singh/) print}' emp.dat

The output on execution of the preceding code is as follows:

Jack    Singh   9857532312  jack@gmail.com      M   hr      2000
Hari    Singh   8827255666  hari@yahoo.com      M   Ops     2350
Ginny   Singh   9857123466  ginny@yahoo.com     F   hr      2250
Vina    Singh   8811776612  vina@yahoo.com      F   lgs     2300

expression !~ /regexpr /: This matches if the string value of the expression does not contain a sub-string matched by regexpr. Generally, this expression is also a field variable. For example, in the following example, we print all the lines that don't contain the Singh sub-string in the second field, as follows:

$ awk '$2 !~ /Singh/{ print }' emp.dat

The output on execution of the preceding code is as follows:

Jane    Kaur    9837432312  jane@gmail.com      F   hr      1800
Eva     Chabra  8827232115  eva@gmail.com       F   lgs     2100
Amit    Sharma  9911887766  amit@yahoo.com      M   lgs     2350
Julie   Kapur   8826234556  julie@yahoo.com     F   Ops     2500
Ana     Khanna  9856422312  anak@hotmail.com    F   Ops     2700
Victor  Sharma  8826567898  vics@hotmail.com    M   Ops     2500
John    Kapur   9911556789  john@gmail.com      M   hr      2200
Billy   Chabra  9911664321  bily@yahoo.com      M   lgs     1900
Sam     khanna  8856345512  sam@hotmail.com     F   lgs     2300
Emily   Kaur    8826175812  emily@gmail.com     F   Ops     2100
Amy     Sharma  9857536898  amys@hotmail.com    F   Ops     2500

Any expression may be used in place of /regexpr/ in the context of ~; and !~. The expression here could also be if, while, for, and do statements.

Basic regular expression construct

Regular expressions are made up of two types of characters: normal text characters, called literals, and special characters, such as the asterisk (*, +, ?, .), called metacharacters. There are times when you want to match a metacharacter as a literal character. In such cases, we prefix that metacharacter with a backslash (), which is called an escape sequence.

The basic regular expression construct can be summarized as follows:

Here is the list of metacharacters, also known as special characters, that are used in building regular expressions:

^ $ . [ ] | ( ) * + ?

The following table lists the remaining elements that are used in building a basic regular expression, apart from the metacharacters mentioned before:

Literal	A literal character (non-metacharacter ), such as A, that matches itself.
Escape sequence	An escape sequence that matches a special symbol: for example t matches tab.
Quoted metacharacter ()	In quoted metacharacters, we prefix metacharacter with a backslash, such as $ that matches the metacharacter literally.
Anchor (`^`)	Matches the beginning of a string.
Anchor (`$`)	Matches the end of a string.
Dot (`.`)	Matches any single character.
Character classes (`...`)	A character class [ABC] matches any one of the A, B, or C characters. Character classes may include abbreviations, such as [A-Za-z]. They match any single letter.
Complemented character classes	Complemented character classes [^0-9] match any character except a digit.

These operators combine regular expressions into larger ones:

Alternation (\|)	A\|B matches A or B.
Concatenation	AB matches A immediately followed by B.
Closure (*)	A* matches zero or more As.
Positive closure (+)	A+ matches one or more As.
Zero or one (?)	A? matches the null string or A.
Parentheses ()	Used for grouping regular expressions and back-referencing. Like regular expressions, (r) can be accessed using n digit in future.