AWK is an interpreted programming language designed for text processing and report generation. It is typically used for data manipulation, such as searching for items within data, performing arithmetic operations, and restructuring raw data for generating reports in most Unix-like operating systems.
Today, we will explore the AWK philosophy and different types of AWK that exist, starting from its original implementation in 1977 at AT&T's Laboratories, Inc. We will also look at the various implementation areas of AWK in data science today.
Using AWK programs, one can handle repetitive text-editing problems with very simple and short programs. It is a pattern-action language; it searches for patterns in a given input and, when a match is found, it performs the corresponding action. The pattern can be made of strings, regular expressions, comparison operations on numbers, fields, variables, and so on. It reads the input files and splits each input line of the file into fields automatically.
AWK has most of the well-designed features that every programming language should contain. Its syntax particularly resembles that of the C programming language. It is named after its original three authors:
AWK is a very powerful, elegant, and simple that every person dealing with text processing should be familiar with.
This article is an excerpt from a book written by Shiwang Kalkhanda, titled Learning AWK Programming. This book will introduce you to AWK programming language and get you hands-on working with practical implementation of AWK.
The AWK language was originally implemented as an AWK utility on Unix. Today, most Linux distributions provide GNU implementation of AWK (GAWK), and a symlink for AWK is created from the original GAWK binary. The AWK utility can be categorized into the following three types, depending upon the type of interpreter it uses for executing AWK programs:
AWK is simpler than any other utility for text processing and is available as the default on Unix-like operating systems. However, some people might say Perl is a superior choice for text processing, as AWK is functionally a subset of Perl, but the learning curve for Perl is steeper than that of AWK; AWK is simpler than Perl. AWK programs are smaller and hence quicker to execute. Anybody who knows the Linux command line can start writing AWK programs in no time. Here are a few use cases of AWK:
Programs written in AWK are smaller than they would be in other higher-level languages for similar text processing operations. AWK programs are interpreted on a GNU/Linux Terminal and thus avoid the compiling, debugging phase of software development in other languages.
This section describes how to set up the AWK environment on your GNU/Linux system, and we'll also discuss the workflow of AWK. Then, we'll look at different methods for executing AWK programs.
Generally, AWK is installed by default on most GNU/Linux distributions. Using the which command, you can check whether it is installed on your system or not. In case AWK is not installed on your system, you can do so in one of two ways:
Let's take a look at each method in detail in the following sections.
Different flavors of GNU/Linux distribution have different package-management utilities. If you are using a Debian-based GNU/Linux distribution, such as Ubuntu, Mint, or Debian, then you can install it using the Advance Package Tool (APT) package manager, as follows:
[ shiwang@linux ~ ] $ sudo apt-get update -y [ shiwang@linux ~ ] $ sudo apt-get install gawk -y
Similarly, to install AWK on an RPM-based GNU/Linux distribution, such as Fedora, CentOS, or RHEL, you can use the Yellowdog Updator Modified (YUM) package manager, as follows:
[ root@linux ~ ] # yum update -y [ root@linux ~ ] # yum install gawk -y
For installation of AWK on openSUSE, you can use the zypper (zypper command line) package-management utility, as follows:
[ root@linux ~ ] # zypper update -y [ root@linux ~ ] # zypper install gawk -y
Once the installation is finished, make sure AWK is accessible through the command line. We can check that using the which command, which will return the absolute path of AWK on our system:
[ root@linux ~ ] # which awk /usr/bin/awk
You can also use awk --version to find the AWK version on our system:
[ root@linux ~ ] # awk --version
Like every other open source utility, the GNU AWK source code is freely available for download as part of the GNU project. Previously, you saw how to install AWK using the package manager; now, you will see how to install AWK by compiling from its source code on the GNU/Linux distribution. The following steps are applicable to most of the GNU/Linux software for installation:
[ shiwang@linux ~ ] $ wget http://ftp.gnu.org/gnu/gawk/gawk-4.1.3.tar.xz
[ shiwang@linux ~ ] $ tar xvf gawk-4.1.3.tar.xz
[ shiwang@linux ~ ] $ cd gawk-4.1.3 && ./configure
[ shiwang@linux ~ ] $ make
[ shiwang@linux ~ ] $ sudo make install
[ root@linux ~ ] # which awk /usr/bin/awk
Now you have a working AWK/GAWK installation and we are ready to begin AWK programming, but before that, our next section describes the workflow of the AWK interpreter.
Having a basic knowledge of the AWK interpreter workflow will help you to better understand AWK and will result in more efficient AWK program development. Hence, before getting your hands dirty with AWK programming, you need to understand its internals. The AWK workflow can be summarized as shown in the following figure:
Let's take a look at each operation:
The following flowchart depicts the workflow:
We introduced you to the AWK programming language and got ourselves a quick primer to get started with application development.
If you found this post is useful, do check out the book Learning AWK Programming to learn more about the intricacies of AWK programming language for text processing.
The oldest programming languages in use today
What is the difference between functional and object oriented programming?
Systems programming with Go in UNIX and Linux