This section describes various useful AWK commands and their usage. We will be using the two sample files, cars.dat and emp.dat, for illustrating various useful AWK examples to kick-start your journey with AWK. Most of these examples will be short one-liners that you can include in your daily task automation. You will get the most out of this section if you practice the examples with us in your system while going through them.
Printing without pattern: The simplest AWK program can be as basic as the following:
awk { print } filename
This program consists of only one line, which is an action. In the absence of a pattern, all input lines are printed on the stdout. Also, if you don't specify any field with the print statement, it takes $0, so print $0 will do the same thing, as $0 represents the entire input line:
$ awk '{ print }' cars.dat
This can also be performed as follows:
$ awk '{ print $0 }' cars.dat
This program is equivalent of the cat command implemented on Linux as cat cars.dat. The output on execution of this code is as follows:
maruti swift 2007 50000 5
honda city 2005 60000 3
maruti dezire 2009 3100 6
chevy beat 2005 33000 2
honda city 2010 33000 6
chevy tavera 1999 10000 4
toyota corolla 1995 95000 2
maruti swift 2009 4100 5
maruti esteem 1997 98000 1
ford ikon 1995 80000 1
honda accord 2000 60000 2
fiat punto 2007 45000 3
Printing without action statements: In this example, the program has a pattern, but we don't specify any action statements. The pattern is given between forward slashes, which indicates that it is a regular expression:
$ awk '/honda/' cars.dat
The output on execution of this code is as follows:
honda city 2005 60000 3
honda city 2010 33000 6
honda accord 2000 60000 2
In this case, AWK selects only those input lines that contain the honda pattern/string in them. When we don't specify any action, AWK assumes the action is to print the whole line.
Printing columns or fields: In this section, we will print fields without patterns, with patterns, in a different printing order, and with regular expression patterns:
- Printing fields without specifying any pattern: In this example, we will not include any pattern. The given AWK command prints the first field ($1) and third field ($3) of each input line that is separated by a space (the output field separator, indicated by a comma):
$ awk '{ print $1, $3 }' cars.dat
The output on execution of this code is as follows:
maruti 2007
honda 2005
maruti 2009
chevy 2005
honda 2010
chevy 1999
toyota 1995
maruti 2009
maruti 1997
ford 1995
honda 2000
fiat 2007
- Printing fields with matching patterns: In this example, we will include both actions and patterns. The given AWK command prints the first field ($1) separated by tab (specifying \t as the output separator) with the third field ($3) of input lines, which contain the maruti string in them:
$ awk '/maruti/{ print $1 "\t" $3 }' cars.dat
The output on execution of this code is as follows:
maruti 2007
maruti 2009
maruti 2009
maruti 1997
- Printing fields for matching regular expressions: In this example, AWK selects the lines containing matches for the ith regular expression in them and prints the first ($1), second ($2), and third ($3) field, separated by tab:
$ awk '/i/{ print $1 "\t" $2 "\t" $3 }' cars.dat
The output on execution of this code is as follows:
maruti swift 2007
honda city 2005
maruti dezire 2009
honda city 2010
maruti swift 2009
maruti esteem 1997
ford ikon 1995
fiat punto 2007
- Printing fields in any order with custom text: In this example, we will print fields in different orders. Here, in the action statement, we put the "Mileage in kms is : " text before the $4 field and the " for car model -> " text before the $1 field in the output:
$ awk '{ print "Mileage in kms is : " $4 ", for car model -> " $1,$2 }' cars.dat
The output on execution of this code is as follows:
Mileage in kms is : 50000, for car model -> maruti swift
Mileage in kms is : 60000, for car model -> honda city
Mileage in kms is : 3100, for car model -> maruti dezire
Mileage in kms is : 33000, for car model -> chevy beat
Mileage in kms is : 33000, for car model -> honda city
Mileage in kms is : 10000, for car model -> chevy tavera
Mileage in kms is : 95000, for car model -> toyota corolla
Mileage in kms is : 4100, for car model -> maruti swift
Mileage in kms is : 98000, for car model -> maruti esteem
Mileage in kms is : 80000, for car model -> ford ikon
Mileage in kms is : 60000, for car model -> honda accord
Mileage in kms is : 45000, for car model -> fiat punto
Printing the number of fields in a line: You can print any number of fields, such as $1 and $2. In fact, you can use any expression after $ and the numeric outcome of the expression will print the corresponding field. AWK has built-in variables to count and store the number of fields in the current input line, for example, NF. So, in the given example, we will print the number of the field for each input line, followed by the first field and the last field (accessed using NF):
$ awk '{ print NF, $1, $NF }' cars.dat
The output on execution of this code is as follows:
5 maruti 5
5 honda 3
5 maruti 6
5 chevy 2
5 honda 6
5 chevy 4
5 toyota 2
5 maruti 5
5 maruti 1
5 ford 1
5 honda 2
5 fiat 3
Deleting empty lines using NF: We can print all the lines with at least 1 field using NF > 0. This is the easiest method to remove empty lines from the file using AWK:
$ awk 'NF > 0 { print }' /etc/hosts
On execution of the preceding command, only non-empty lines from the /etc/hosts file will be displayed on the Terminal as output:
#
# hosts This file describes a number of hostname-to-address
# mappings for the TCP/IP subsystem. It is mostly
# used at boot time, when no name servers are running.
# On small systems, this file can be used instead of a
# "named" name server.
# Syntax:
#
# IP-Address Full-Qualified-Hostname Short-Hostname
#
127.0.0.1 localhost
# special IPv6 addresses
::1 localhost ipv6-localhost ipv6-loopback
fe00::0 ipv6-localnet
ff00::0 ipv6-mcastprefix
ff02::1 ipv6-allnodes
ff02::2 ipv6-allrouters
ff02::3 ipv6-allhosts
Printing line numbers in the output: AWK has a built-in variable known as NR. It counts the number of input lines read so far. We can use NR to prefix $0 in the print statement to display the line numbers of each line that has been processed:
$ awk '{ print NR, $0 }' cars.dat
The output on execution of this code is as follows:
1 maruti swift 2007 50000 5
2 honda city 2005 60000 3
3 maruti dezire 2009 3100 6
4 chevy beat 2005 33000 2
5 honda city 2010 33000 6
6 chevy tavera 1999 10000 4
7 toyota corolla 1995 95000 2
8 maruti swift 2009 4100 5
9 maruti esteem 1997 98000 1
10 ford ikon 1995 80000 1
11 honda accord 2000 60000 2
12 fiat punto 2007 45000 3
Count the numbers of lines in a file using NR: In our next example, we will count the number of lines in a file using NR. As NR stores the current input line number, we need to process all the lines in a file, so we will not specify any pattern. We also don't want to print the line numbers for each line, as our requirement is to just fetch the total lines in a file. Since the END block is executed after processing the input line is done, we will print NR in the END block to print the total number of lines in the file:
$ awk ' END { print "The total number of lines in file are : " NR } ' cars.dat
>>>
The total number of lines in file are : 12
Printing numbered lines exclusively from the file: We know NR contains the line number of the current input line. You can easily print any line selectively, by matching the line number with the current input line number stored in NR, as follows:
$ awk 'NR==2 { print NR, $0 }' cars.dat
>>>
2 honda city 2005 60000 3
Printing the even-numbered lines in a file: Using NR, we can easily print even-numbered files by specifying expressions (divide each line number by 2 and find the remainder) in pattern space, as shown in the following example:
$ awk 'NR % 2 == 0 { print NR, $0 }' cars.dat
The output on execution of this code is as follows:
2 honda city 2005 60000 3
4 chevy beat 2005 33000 2
6 chevy tavera 1999 10000 4
8 maruti swift 2009 4100 5
10 ford ikon 1995 80000 1
12 fiat punto 2007 45000 3
Printing odd-numbered lines in a file: Similarly, we can print odd-numbered lines in a file using NR, by performing basic arithmetic operations in the pattern space:
$ awk ' NR % 2 == 1 { print NR, $0 } ' cars.dat
The output on execution of this code is as follows:
1 maruti swift 2007 50000 5
3 maruti dezire 2009 3100 6
5 honda city 2010 33000 6
7 toyota corolla 1995 95000 2
9 maruti esteem 1997 98000 1
11 honda accord 2000 60000 2
Printing a group of lines using the range operator (,) and NR: We can combine the range operator (,) and NR to print a group of lines from a file based on their line numbers. The next example displays the lines 4 to 6 from the cars.dat file:
$ awk ' NR==4, NR==6 { print NR, $0 }' cars.dat
The output on execution of this code is as follows :
4 chevy beat 2005 33000 2
5 honda city 2010 33000 6
6 chevy tavera 1999 10000 4
Printing a group of lines using the range operator and patterns: We can also combine the range operator (,) and string in pattern space to print a group of lines in a file starting from the first pattern, up to the second pattern. The following example displays the line starting from the first appearance of the /ford/ pattern to the occurrence of the second /fiat/ pattern in the cars.dat file:
$ awk ' /ford/,/fiat/ { print NR, $0 }' cars.dat
The output on execution of this code is as follows:
10 ford ikon 1995 80000 1
11 honda accord 2000 60000 2
12 fiat punto 2007 45000 3
Printing by selection: AWK patterns allows the selection of desired input lines for further processing. As patterns without actions print all the matching lines, on most occasions, AWK programs consist of a single pattern. The following are a few examples of useful patterns:
- Selection using the match operator (~): The match operator (~) is used for matching a pattern in a specified field in the input line of a file. In the next example, we will select and print all lines containing 'c' in the second field of the input line, as follows:
$ awk ' $2 ~ /c/ { print NR, $0 } ' cars.dat
The output on execution of this code is as follows:
2 honda city 2005 60000 3
5 honda city 2010 33000 6
7 toyota corolla 1995 95000 2
11 honda accord 2000 60000 2
- Selection using the match operator (~) and anchor (^): The caret (^) in regular expressions (also known as anchor) is used to match at the beginning of a line. In the next example, we combine it with the match operator (~) to print all the lines in which the second field begins with the 'c' character, as follows:
$ awk ' $2 ~ /^c/ { print NR, $0 } ' cars.dat
The output on execution of this code is as follows:
2 honda city 2005 60000 3
5 honda city 2010 33000 6
7 toyota corolla 1995 95000 2
- Selection using the match operator (~) and character classes ([ ]): The character classes, [ ], in regular expressions are used to match a single character out of those specified within square brackets. Here, we combine the match operator (~) with character classes (/^[cp]/) to print all the lines in which the second field begins with the 'c' or 'p' character, as follows:
$ awk ' $2 ~ /^[cp]/ { print NR, $0 } ' cars.dat
The output on execution of this code is as follows:
2 honda city 2005 60000 3
5 honda city 2010 33000 6
7 toyota corolla 1995 95000 2
12 fiat punto 2007 45000 3
- Selection using the match operator (~) and anchor ($): The dollar sign ($) in regular expression (also known as anchor) is used to match at the end of a line. In the next example, we combine it with the match operator (~) to print all the lines in the second field end with the 'a' character, as follows:
$ awk ' $2 ~ /a$/ { print NR, $0 } ' cars.dat
The output on execution of this code is as follows:
6 chevy tavera 1999 10000 4
7 toyota corolla 1995 95000 2
- Selection by numeric comparison: You can use relation operators (==, =>, <=, >, <, !=) for performing numeric comparison. Here, we perform a numeric match (==) to print the lines that have the 2005 value in the third field, as follows:
$ awk ' $3 == 2005 { print NR, $0 } ' cars.dat
The output on execution of this code is as follows:
2 honda city 2005 60000 3
4 chevy beat 2005 33000 2
- Selection by text content/string matching in a field: Besides numeric matches, we can use string matches to find the lines containing a particular string in a field. String content for matches should be given in double quotes as a string. In our next example, we print all the lines that contain "swift" in the second field ($2), as follows:
$ awk ' $2 == "swift" { print NR, $0 } ' cars.dat
The output on execution of this code is as follows:
1 maruti swift 2007 50000 5
8 maruti swift 2009 4100 5
- Selection by combining patterns: You can combine patterns with parentheses and logical operators, &&, ||, and !, which stand for AND, OR, and NOT. Here, we print the lines containing a value greater than or equal to 2005 in the third field and a value less than or equal to 2010 in the third field. This will print the cars that were manufactured between 2005 and 2010 from the cars.dat file:
$ awk ' $3 >= 2005 && $3 <= 2010 { print NR, $0 } ' cars.dat
The output on execution of this code is as follows:
1 maruti swift 2007 50000 5
2 honda city 2005 60000 3
3 maruti dezire 2009 3100 6
4 chevy beat 2005 33000 2
5 honda city 2010 33000 6
8 maruti swift 2009 4100 5
12 fiat punto 2007 45000 3
Data validation: Human error is difficult to eliminate from gathered data. In this situation, AWK is a reliable tool for checking that data has reasonable values and is in the right format. This process is generally known as data validation. Data validation is the reverse process of printing the lines that have undesirable properties. In data validation, we print the lines with errors or those that we suspect to have errors.
In the following example, we use the validation method while printing the selected records. First, we check whether any of the records in the input file don't have 5 fields, that is, a record with incomplete information, by using the AWK NF built-in variable. Then, we find the cars whose manufacture year is older than 2000 and suffix these rows with the car fitness expired text. Next, we print those records where the car's manufacture year is newer than 2009, and suffix these rows with the Better car for resale text, shown as follows :
$ vi validate.awk
NF !=5 { print $0, "number of fields is not equal to 5" }
$3 < 2000 { print $0, "car fitness expired" }
$3 > 2009 { print $0, "Better car for resale" }
$ awk -f validate.awk cars.dat
The output on execution of this code is as follows :
honda city 2010 33000 6 Better car for resale
chevy tavera 1999 10000 4 car fitness expired
toyota corolla 1995 95000 2 car fitness expired
maruti esteem 1997 98000 1 car fitness expired
ford ikon 1995 80000 1 car fitness expired
BEGIN and END pattern examples: BEGIN is a special pattern in which actions are performed before the processing of the first line of the first input file. END is a pattern in which actions are performed after the last line of the last file has been processed.
Using BEGIN to print headings: The BEGIN block can be used for printing headings, initializing variables, performing calculations, or any other task that you want to be executed before AWK starts processing the lines in the input file.
In the following AWK program, BEGIN is used to print a heading for each column for the cars.dat input file. Here, the first column contains the make of each car followed by the model, year of manufacture, mileage in kilometers, and price. So, we print the heading for the first field as Make, for the second field as Model, for the third field as Year, for the fourth field as Kms, and for the fifth field as Price. The heading is separated from the body by a blank line. The second action statement, { print }, has no pattern and displays all lines from the input as follows:
$ vi header.awk
BEGIN { print "Make Model Year Kms Price" ; print "" }
{ print }
$ awk -f header.awk cars.dat
The output on execution of this code is as follows:
Make Model Year Kms Price
maruti swift 2007 50000 5
honda city 2005 60000 3
maruti dezire 2009 3100 6
chevy beat 2005 33000 2
honda city 2010 33000 6
chevy tavera 1999 10000 4
toyota corolla 1995 95000 2
maruti swift 2009 4100 5
maruti esteem 1997 98000 1
ford ikon 1995 80000 1
honda accord 2000 60000 2
fiat punto 2007 45000 3
In the preceding example, we have given multiple action statements on a single line by separating them with a semicolon. The print " " prints a blank line; it is different from plain print, which prints the current input line.
Using END to print the last input line: The END block is executed after the processing of the last line of the last file is completed, and $0 stores the value of each input line processed, but its value is not retained in the END block. The following is one way to print the last input line:
$ awk '{ last = $0 } END { print last }' cars.dat
The output on execution of this code is as follows:
fiat punto 2007 45000 3
And to print the total number of lines in a file we use NR, because it retains its value in the END block, as follows:
$ awk 'END { print "Total no of lines in file : ", NR }' cars.dat
The output on execution of this code is as follows:
Total no of lines in file : 12
Length function: By default, the length function stores the count of the number of characters in the input line. In the next example, we will prefix each line with the number of characters in it using the length function, as follows:
$ awk '{ print length, $0 }' /etc/passwd
The output on execution of this code is as follows:
56 at:x:25:25:Batch jobs daemon:/var/spool/atjobs:/bin/bash
59 avahi:x:481:480:User for Avahi:/run/avahi-daemon:/bin/false
79 avahi-autoipd:x:493:493:User for Avahi IPv4LL:/var/lib/avahi-autoipd:/bin/false
28 bin:x:1:1:bin:/bin:/bin/false
35 daemon:x:2:2:Daemon:/sbin:/bin/fale
53 dnsmasq:x:486:65534:dnsmasq:/var/lib/empty:/bin/false
42 ftp:x:40:49:FTP account:/srv/ftp:/bin/false
49 games:x:12:100:Games account:/var/games:/bin/false
49 lp:x:4:7:Printing daemon:/var/spool/lpd:/bin/false
60 mail:x:8:12:Mailer daemon:/var/spool/clientmqueue:/bin/false
56 man:x:13:62:Manual pages viewer:/var/cache/man:/bin/false
56 messagebus:x:499:499:User for D-Bus:/run/dbus:/bin/false
....................................
....................................
Changing the field separator using FS: The fields in the examples we have discussed so far have been separated by space characters. The default behavior of FS is any number of space or tab characters; we can change it to regular expressions or any single or multiple characters using the FS variable or the -F option. The value of the field separator is contained in the FS variable and it can be changed multiple times in an AWK program. Generally, it is good to redefine FS in a BEGIN statement.
In the following example, we demonstrate the use of FS. In this, we use the /etc/passwd file of Linux, which delimits fields with colons (:). So, we change the input of FS to a colon before reading any data from the file, and print the list of usernames, which is stored in the first field of the file, as follows:
$ awk 'BEGIN { FS = ":"} { print $1 }' /etc/passwd
Alternatively, we could use the -F option:
$ awk -F: '{ print $1 }' /etc/passwd
The output on execution of the code is as follows:
at
avahi
avahi-autoipd
bin
daemon
dnsmasq</strong>
ftp
.........
.........
Control structures: AWK supports control (flow) statements, which can be used to change the order of the execution of commands within an AWK program. Different constructs, such as the if...else, while, and for control structures are supported by AWK. In addition, the break and continue statements work in combination with the control structures to modify the order of execution of commands. We will look at these in detail in future chapters.
Let's try a basic example of a while loop to print a list of numbers under 10:
$ awk 'BEGIN{ n=1; while (n < 10 ){ print n; n++; } }'
Alternatively, we can create a script, such as the following:
$ vi while1.awk
BEGIN { n=1
while ( n < 10)
{
print n;
n++;
}
}
$ awk -f while1.awk
The output on execution of both of these commands is as follows:
1
2
3
4
5
6
7
8
9