Parsing data from a website
It is often useful to parse data from web pages by eliminating unnecessary details. sed
and awk
are the main tools that we will use for this task. You might have come across a list of actress rankings in a grep
recipe in the Chapter 4, Texting and driving; it was generated by parsing the website page http://www.johntorres.net/BoxOfficefemaleList.html.
Let us see how we can parse the same data by using text-processing tools.
How to do it...
Let's go through the commands used to parse details of actresses from the website:
$ lynx -dump -nolist http://www.johntorres.net/BoxOfficefemaleList.html | \ grep -o "Rank-.*" | \ sed -e 's/ *Rank-\([0-9]*\) *\(.*\)/\1\t\2/' | \ sort -nk 1 > actresslist.txt
The output will be as follows:
# Only 3 entries shown. All others omitted due to space limits 1 Keira Knightley 2 Natalie Portman 3 Monica Bellucci
How it works...
Lynx is a command-line web browser—it can dump a text version of a website as we would see in a web...