AWK is a pattern-matching language. It searches for a pattern in a file and, upon finding the corresponding match, it performs the file's action on the input line. This pattern could consist of fixed strings or a pattern of text. This variable content or pattern is generally searched with the help of regular expressions. Hence, regular expressions form an important part of AWK programming language.
Today we will introduce you to the regular expressions in AWK programming and will get started with string-matching patterns and basic constructs to use with AWK.
This article is an excerpt from a book written by Shiwang Kalkhanda, titled Learning AWK Programming.
A regular expression, or regexpr, is a set of characters used to describe a pattern. A regular expression is generally used to match lines in a file that contain a particular pattern. Many Unix utilities operate on plain text files line by line, such as grep, sed, and awk. Regular expressions search for a pattern on a single line in a file.
Generally, all editors have the ability to perform search-and-replace operations. Some editors can only search for patterns, others can also replace them, and others can also print the line containing that pattern. A regular expression goes many steps beyond this simple search, replace, and printing functionality, and hence it is more powerful and flexible. We can search for a word of a certain size, such as a word that has four characters or numbers. We can search for a word that ends with a particular character, let's say e. You can search for phone numbers, email IDs, and so on, and can also perform validation using regular expressions. They simplify complex pattern-matching tasks and hence form an important part of AWK programming. Other regular expression variations also exist, notably those for Perl.
There are mainly two types of regular expressions in Linux:
Here, we will refer to extended regular expressions as regular expressions in the context of AWK.
In AWK, regular expressions are enclosed in forward slashes, '/', (forming the AWK pattern) and match every input record whose text belongs to that set.
The simplest regular expression is a string of letters, numbers, or both that matches itself. For example, here we use the ly regular expression string to print all lines that contain the ly pattern in them. We just need to enclose the regular expression in forward slashes in AWK:
$ awk '/ly/' emp.dat
The output on execution of this code is as follows:
Billy Chabra 9911664321 bily@yahoo.com M lgs 1900 Emily Kaur 8826175812 emily@gmail.com F Ops 2100
In this example, the /ly/ pattern matches when the current input line contains the ly sub-string, either as ly itself or as some part of a bigger word, such as Billy or Emily, and prints the corresponding line.
Regular expressions are used as string-matching patterns with AWK in the following three ways. We use the '~' and '! ~' match operators to perform regular expression comparisons:
$ awk '/mail/' emp.dat
The output on execution of this code is as follows:
Jack Singh 9857532312 jack@gmail.com M hr 2000 Jane Kaur 9837432312 jane@gmail.com F hr 1800 Eva Chabra 8827232115 eva@gmail.com F lgs 2100 Ana Khanna 9856422312 anak@hotmail.com F Ops 2700 Victor Sharma 8826567898 vics@hotmail.com M Ops 2500 John Kapur 9911556789 john@gmail.com M hr 2200 Sam khanna 8856345512 sam@hotmail.com F lgs 2300 Emily Kaur 8826175812 emily@gmail.com F Ops 2100 Amy Sharma 9857536898 amys@hotmail.com F Ops 2500
In this example, we do not specify any expression, hence it automatically matches a whole line, as follows:
$ awk '$0 ~ /mail/' emp.dat
The output on execution of this code is as follows:
Jack Singh 9857532312 jack@gmail.com M hr 2000 Jane Kaur 9837432312 jane@gmail.com F hr 1800 Eva Chabra 8827232115 eva@gmail.com F lgs 2100 Ana Khanna 9856422312 anak@hotmail.com F Ops 2700 Victor Sharma 8826567898 vics@hotmail.com M Ops 2500 John Kapur 9911556789 john@gmail.com M hr 2200 Sam khanna 8856345512 sam@hotmail.com F lgs 2300 Emily Kaur 8826175812 emily@gmail.com F Ops 2100 Amy Sharma 9857536898 amys@hotmail.com F Ops 2500
$ awk '$2 ~ /Singh/{ print }' emp.dat
We can also use the expression as follows:
$ awk '{ if($2 ~ /Singh/) print}' emp.dat
The output on execution of the preceding code is as follows:
Jack Singh 9857532312 jack@gmail.com M hr 2000 Hari Singh 8827255666 hari@yahoo.com M Ops 2350 Ginny Singh 9857123466 ginny@yahoo.com F hr 2250 Vina Singh 8811776612 vina@yahoo.com F lgs 2300
$ awk '$2 !~ /Singh/{ print }' emp.dat
The output on execution of the preceding code is as follows:
Jane Kaur 9837432312 jane@gmail.com F hr 1800 Eva Chabra 8827232115 eva@gmail.com F lgs 2100 Amit Sharma 9911887766 amit@yahoo.com M lgs 2350 Julie Kapur 8826234556 julie@yahoo.com F Ops 2500 Ana Khanna 9856422312 anak@hotmail.com F Ops 2700 Victor Sharma 8826567898 vics@hotmail.com M Ops 2500 John Kapur 9911556789 john@gmail.com M hr 2200 Billy Chabra 9911664321 bily@yahoo.com M lgs 1900 Sam khanna 8856345512 sam@hotmail.com F lgs 2300 Emily Kaur 8826175812 emily@gmail.com F Ops 2100 Amy Sharma 9857536898 amys@hotmail.com F Ops 2500
Any expression may be used in place of /regexpr/ in the context of ~; and !~. The expression here could also be if, while, for, and do statements.
Regular expressions are made up of two types of characters: normal text characters, called literals, and special characters, such as the asterisk (*, +, ?, .), called metacharacters. There are times when you want to match a metacharacter as a literal character. In such cases, we prefix that metacharacter with a backslash (), which is called an escape sequence.
The basic regular expression construct can be summarized as follows:
Here is the list of metacharacters, also known as special characters, that are used in building regular expressions:
^ $ . [ ] | ( ) * + ?
The following table lists the remaining elements that are used in building a basic regular expression, apart from the metacharacters mentioned before:
Literal | A literal character (non-metacharacter ), such as A, that matches itself. |
Escape sequence | An escape sequence that matches a special symbol: for example t matches tab. |
Quoted metacharacter
() |
In quoted metacharacters, we prefix metacharacter with a backslash, such as $ that matches the metacharacter literally. |
Anchor (^) | Matches the beginning of a string. |
Anchor ($) | Matches the end of a string. |
Dot (.) | Matches any single character. |
Character classes (...) | A character class [ABC] matches any one of the A, B, or C characters. Character classes may include abbreviations, such as [A-Za-z]. They match any single letter. |
Complemented character classes | Complemented character classes [^0-9] match any character except a digit. |
These operators combine regular expressions into larger ones:
Alternation (|) | A|B matches A or B. |
Concatenation | AB matches A immediately followed by B. |
Closure (*) | A* matches zero or more As. |
Positive closure (+) | A+ matches one or more As. |
Zero or one (?) | A? matches the null string or A. |
Parentheses () | Used for grouping regular expressions and back-referencing. Like regular expressions, (r) can be accessed using n digit in future. |
Do check out the book Learning AWK Programming to learn more about the intricacies of AWK programming language for text processing.
What is the difference between functional and object-oriented programming?