(For more resources related to this topic, see here.)
The shell scripting language is packed with all the essential problem-solving components for Unix/Linux systems. Text processing is one of the key areas where shell scripting is used, and there are beautiful utilities such as sed, awk, grep, and cut, which can be combined to solve problems related to text processing.
Various utilities help to process a file in fine detail of a character, line, word, column, row, and so on, allowing us to manipulate a text file in many ways. Regular expressions are the core of pattern-matching techniques, and most of the text-processing utilities come with support for it. By using suitable regular expression strings, we can produce the desired output, such as filtering, stripping, replacing, and searching.
Using regular expressions
Regular expressions are the heart of text-processing techniques based on pattern matching. For fluency in writing text-processing tools, one must have a basic understanding of regular expressions. Using wild card techniques, the scope of matching text with patterns is very limited. Regular expressions are a form of tiny, highly-specialized programming language used to match text. A typical regular expression for matching an e-mail address might look like [a-z0-9_]+@[a-z0-9]+\.[a-z]+.
If this looks weird, don't worry, it is really simple once you understand the concepts through this recipe.
How to do it...
Regular expressions are composed of text fragments and symbols, which have special meanings. Using these, we can construct any suitable regular expression string to match any text according to the context. As regex is a generic language to match texts, we are not introducing any tools in this recipe.
Let's see a few examples of text matching:
To match all words in a given text, we can write the regex as follows:
( ?[a-zA-Z]+ ?)
? is the notation for zero or one occurrence of the previous expression, which in this case is the space character. The [a-zA-Z]+ notation represents one or more alphabet characters (a-z and A-Z).
To match an IP address, we can write the regex as follows:
[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}
Or:
[[:digit:]]{1,3}\.[[:digit:]]{1,3}\.[[:digit:]]{1,3}\.[[:digit:]]{1,3}
We know that an IP address is in the form 192.168.0.2. It is in the form of four integers (each from 0 to 255), separated by dots (for example, 192.168.0.2).
[0-9] or [:digit:] represents a match for digits from 0 to 9. {1,3} matches one to three digits and \. matches the dot character (.).
This regex will match an IP address in the text being processed. However, it doesn't check for the validity of the address. For example, an IP address of the form 123.300.1.1 will be matched by the regex despite being an invalid IP. This is because when parsing text streams, usually the aim is to only detect IPs.
How it works...
Let's first go through the basic components of regular expressions (regex):
regex
Description
Example
^
This specifies the start of the line marker.
^tux matches a line that starts with tux.
$
This specifies the end of the line marker.
tux$ matches a line that ends with tux.
.
This matches any one character.
Hack. matches Hack1, Hacki, but not Hack12 or Hackil; only one additional character matches.
[]
This matches any one of the characters enclosed in [chars].
coo[kl] matches cook or cool.
[^]
This matches any one of the characters except those that are enclosed in [^chars].
9[^01] matches 92 and 93, but not 91 and 90.
[-]
This matches any character within the range specified in [].
[1-5] matches any digits from 1 to 5.
?
This means that the preceding item must match one or zero times.
colou?r matches color or colour, but not colouur.
+
This means that the preceding item must match one or more times.
Rollno-9+ matches Rollno-99 and Rollno-9, but not Rollno-.
*
This means that the preceding item must match zero or more times.
co*l matches cl, col, and coool.
()
This treats the terms enclosed as one entity
ma(tri)?x matches max or matrix.
{n}
This means that the preceding item must match n times.
[0-9]{3} matches any three-digit number. [0-9]{3} can be expanded as [0-9][0-9][0-9].
{n,}
This specifies the minimum number of times the preceding item should match.
[0-9]{2,} matches any number that is two digits or longer.
{n, m}
This specifies the minimum and maximum number of times the preceding item should match.
[0-9]{2,5} matches any number has two digits to five digits.
|
This specifies the alternation-one of the items on either of sides of | should match.
Oct (1st | 2nd) matches Oct 1st or Oct 2nd.
\
This is the escape character for escaping any of the special characters mentioned previously.
a\.b matches a.b, but not ajb. It ignores the special meaning of . because of \.
For more details on the regular expression components available, you can refer to the following URL:
http://www.linuxforu.com/2011/04/sed-explained-part-1/
There's more...
Let's see how the special meanings of certain characters are specified in the regular expressions.
Treatment of special characters
Regular expressions use some characters, such as $, ^, ., *, +, {, and }, as special characters. But, what if we want to use these characters as normal text characters? Let's see an example of a regex, a.txt.
This will match the character a, followed by any character (due to the '.' character), which is then followed by the string txt . However, we want '.' to match a literal '.' instead of any character. In order to achieve this, we precede the character with a backward slash \ (doing this is called escaping the character). This indicates that the regex wants to match the literal character rather than its special meaning. Hence, the final regex becomes a\.txt.
Visualizing regular expressions
Regular expressions can be tough to understand at times, but for people who are good at understanding things with diagrams, there are utilities available to help in visualizing regex. Here is one such tool that you can use by browsing to http://www.regexper.com; it basically lets you enter a regular expression and creates a nice graph to help understand it. Here is a screenshot showing the regular expression we saw in the previous section:
Searching and mining a text inside a file with grep
Searching inside a file is an important use case in text processing. We may need to search through thousands of lines in a file to find out some required data, by using certain specifications. This recipe will help you learn how to locate data items of a given specification from a pool of data.
How to do it...
The grep command is the magic Unix utility for searching in text. It accepts regular expressions, and can produce output in various formats. Additionally, it has numerous interesting options. Let's see how to use them:
To search for lines of text that contain the given pattern:
$ grep pattern filenamethis is the line containing pattern
Or:
$ grep "pattern" filenamethis is the line containing pattern
We can also read from stdin as follows:
$ echo -e "this is a word\nnext line" | grep wordthis is a word
Perform a search in multiple files by using a single grep invocation, as follows:
$ grep "match_text" file1 file2 file3 ...
We can highlight the word in the line by using the --color option as follows:
$ grep word filename --color=autothis is the line containing word
Usually, the grep command only interprets some of the special characters in match_text. To use the full set of regular expressions as input arguments, the -E option should be added, which means an extended regular expression. Or, we can use an extended regular expression enabled grep command, egrep. For example:
$ grep -E "[a-z]+" filename
Or:
$ egrep "[a-z]+" filename
In order to output only the matching portion of a text in a file, use the -o option as follows:
$ echo this is a line. | egrep -o "[a-z]+\."
line.
In order to print all of the lines, except the line containing match_pattern, use:
$ grep -v match_pattern file
The -v option added to grep inverts the match results.
Count the number of lines in which a matching string or regex match appears in a file or text, as follows:
$ grep -c "text" filename
10
It should be noted that -c counts only the number of matching lines, not the number of times a match is made. For example:
$ echo -e "1 2 3 4\nhello\n5 6" | egrep -c "[0-9]"
2
Even though there are six matching items, it prints 2, since there are only two matching lines. Multiple matches in a single line are counted only once.
To count the number of matching items in a file, use the following trick:
$ echo -e "1 2 3 4\nhello\n5 6" | egrep -o "[0-9]" | wc -l
6
Print the line number of the match string as follows:
$ cat sample1.txt
gnu is not unix
linux is fun
bash is art
$ cat sample2.txt
planetlinux
$ grep linux -n sample1.txt
2:linux is fun
or
$ cat sample1.txt | grep linux -n
If multiple files are used, it will also print the filename with the result as follows:
$ grep linux -n sample1.txt sample2.txt
sample1.txt:2:linux is fun
sample2.txt:2:planetlinux
Print the character or byte offset at which a pattern matches, as follows:
$ echo gnu is not unix | grep -b -o "not"
7:not
The character offset for a string in a line is a counter from 0, starting with the first character. In the preceding example, not is at the seventh offset position (that is, not starts from the seventh character in the line; that is, gnu is not unix).
The -b option is always used with -o.
To search over multiple files, and list which files contain the pattern, we use the following:
$ grep -l linux sample1.txt sample2.txt
sample1.txt
sample2.txt
The inverse of the -l argument is -L. The -L argument returns a list of non-matching files.
There's more...
We have seen the basic usages of the grep command, but that's not it; the grep command comes with even more features. Let's go through those.
Recursively search many files
To recursively search for a text over many directories of descendants, use the following command:
$ grep "text" . -R -n
In this command, "." specifies the current directory.
The options -R and -r mean the same thing when used with grep.
For example:
$ cd src_dir
$ grep "test_function()" . -R -n
./miscutils/test.c:16:test_function();
test_function() exists in line number 16 of miscutils/test.c.
This is one of the most frequently used commands by developers. It is used to find files in the source code where a certain text exists.
Ignoring case of pattern
The -i argument helps match patterns to be evaluated, without considering the uppercase or lowercase. For example:
$ echo hello world | grep -i "HELLO"
hello
grep by matching multiple patterns
Usually, we specify single patterns for matching. However, we can use an argument -e to specify multiple patterns for matching, as follows:
$ grep -e "pattern1" -e "pattern"
This will print the lines that contain either of the patterns and output one line for each match. For example:
$ echo this is a line of text | grep -e "this" -e "line" -o
this
line
There is also another way to specify multiple patterns. We can use a pattern file for reading patterns. Write patterns to match line-by-line, and execute grep with a -f argument as follows:
$ grep -f pattern_filesource_filename
For example:
$ cat pat_file
hello
cool
$ echo hello this is cool | grep -f pat_file
hello this is cool
Including and excluding files in a grep search
grep can include or exclude files in which to search. We can specify include files or exclude files by using wild card patterns.
To search only for .c and .cpp files recursively in a directory by excluding all other file types, use the following command:
$ grep "main()" . -r --include *.{c,cpp}
Note, that some{string1,string2,string3} expands as somestring1 somestring2 somestring3.
Exclude all README files in the search, as follows:
$ grep "main()" . -r --exclude "README"
To exclude directories, use the --exclude-dir option.
To read a list of files to exclude from a file, use --exclude-from FILE.
Using grep with xargs with zero-byte suffix
The xargs command is often used to provide a list of file names as a command-line argument to another command. When filenames are used as command-line arguments, it is recommended to use a zero-byte terminator for the filenames instead of a space terminator. Some of the filenames can contain a space character, and it will be misinterpreted as a terminator, and a single filename may be broken into two file names (for example, New file.txt can be interpreted as two filenames New and file.txt). This problem can be avoided by using a zero-byte suffix. We use xargs so as to accept a stdin text from commands such as grep and find. Such commands can output text to stdout with a zero-byte suffix. In order to specify that the input terminator for filenames is zero byte (\0), we should use -0 with xargs.
Create some test files as follows:
$ echo "test" > file1
$ echo "cool" > file2
$ echo "test" > file3
In the following command sequence, grep outputs filenames with a zero-byte terminator (\0), because of the -Z option with grep. xargs -0 reads the input and separates filenames with a zero-byte terminator:
$ grep "test" file* -lZ | xargs -0 rm
Usually, -Z is used along with -l.
Silent output for grep
Sometimes, instead of actually looking at the matched strings, we are only interested in whether there was a match or not. For this, we can use the quiet option (-q), where the grep command does not write any output to the standard output. Instead, it runs the command and returns an exit status based on success or failure.
We know that a command returns 0 on success, and non-zero on failure.
Let's go through a script that makes use of grep in a quiet mode, for testing whether a match text appears in a file or not.
#!/bin/bash
#Filename: silent_grep.sh
#Desc: Testing whether a file contain a text or not
if [ $# -ne 2 ]; then
echo "Usage: $0 match_text filename"
exit 1
fi
match_text=$1
filename=$2
grep -q "$match_text" $filename
if [ $? -eq 0 ]; then
echo "The text exists in the file"
else
echo "Text does not exist in the file"
fi
The silent_grep.sh script can be run as follows, by providing a match word (Student) and a file name (student_data.txt) as the command argument:
$ ./silent_grep.sh Student student_data.txt
The text exists in the file
Printing lines before and after text matches
Context-based printing is one of the nice features of grep. Suppose a matching line for a given match text is found, grep usually prints only the matching lines. But, we may need "n" lines after the matching line, or "n" lines before the matching line, or both. This can be performed by using context-line control in grep. Let's see how to do it.
In order to print three lines after a match, use the -A option:
$ seq 10 | grep 5 -A 3
5
6
7
8
In order to print three lines before the match, use the -B option:
$ seq 10 | grep 5 -B 3
2
3
4
5
Print three lines after and before the match, and use the -C option as follows:
$ seq 10 | grep 5 -C 3
2
3
4
5
6
7
8
If there are multiple matches, then each section is delimited by a line "--":
$ echo -e "a\nb\nc\na\nb\nc" | grep a -A 1
a
b
--
a
b
Cutting a file column-wise with cut
We may need to cut the text by a column rather than a row. Let's assume that we have a text file containing student reports with columns, such as Roll, Name, Mark, and Percentage. We need to extract only the name of the students to another file or any nth column in the file, or extract two or more columns. This recipe will illustrate how to perform this task.
How to do it...
cut is a small utility that often comes to our help for cutting in column fashion. It can also specify the delimiter that separates each column. In cut terminology, each column is known as a field .
To extract particular fields or columns, use the following syntax:
cut -f FIELD_LIST filename
FIELD_LIST is a list of columns that are to be displayed. The list consists of column numbers delimited by commas. For example:
$ cut -f 2,3 filename
Here, the second and the third columns are displayed.
cut can also read input text from stdin.
Tab is the default delimiter for fields or columns. If lines without delimiters are found, they are also printed. To avoid printing lines that do not have delimiter characters, attach the -s option along with cut. An example of using the cut command for columns is as follows:
$ cat student_data.txt
No Name Mark Percent
1 Sarath 45 90
2 Alex 49 98
3 Anu 45 90
$ cut -f1 student_data.txt
No
1
2
3
Extract multiple fields as follows:
$ cut -f2,4 student_data.txt
Name Percent
Sarath 90
Alex 98
Anu 90
To print multiple columns, provide a list of column numbers separated by commas as arguments to -f.
We can also complement the extracted fields by using the --complement option. Suppose you have many fields and you want to print all the columns except the third column, then use the following command:
$ cut -f3 --complement student_data.txt
No Name Percent
1 Sarath 90
2 Alex 98
3 Anu 90
To specify the delimiter character for the fields, use the -d option as follows:
$ cat delimited_data.txt
No;Name;Mark;Percent
1;Sarath;45;90
2;Alex;49;98
3;Anu;45;90
$ cut -f2 -d";" delimited_data.txt
Name
Sarath
Alex
Anu
There's more
The cut command has more options to specify the character sequences to be displayed as columns. Let's go through the additional options available with cut.
Specifying the range of characters or bytes as fields
Suppose that we don't rely on delimiters, but we need to extract fields in such a way that we need to define a range of characters (counting from 0 as the start of line) as a field. Such extractions are possible with cut.
Let's see what notations are possible:
N-
from the Nth byte, character, or field, to the end of line
N-M
from the Nth to Mth (included) byte, character, or field
-M
from first to Mth (included) byte, character, or field
We use the preceding notations to specify fields as a range of bytes or characters with the following options:
-b for bytes
-c for characters
-f for defining fields
For example:
$ cat range_fields.txt
abcdefghijklmnopqrstuvwxyz
abcdefghijklmnopqrstuvwxyz
abcdefghijklmnopqrstuvwxyz
abcdefghijklmnopqrstuvwxy
You can print the first to fifth characters as follows:
$ cut -c1-5 range_fields.txt
abcde
abcde
abcde
abcde
The first two characters can be printed as follows:
$ cut range_fields.txt -c -2
ab
ab
ab
ab
Replace -c with -b to count in bytes.
We can specify the output delimiter while using with -c, -f, and -b, as follows:
--output-delimiter "delimiter string"
When multiple fields are extracted with -b or -c, the --output-delimiter is a must. Otherwise, you cannot distinguish between fields if it is not provided. For example:
$ cut range_fields.txt -c1-3,6-9 --output-delimiter ","
abc,fghi
abc,fghi
abc,fghi
abc,fghi
Read more