The lynx, sed, and awk commands can be used to mine data from websites. You might have come across a list of actress rankings in a Searching and mining text inside a file with grep recipe in Chapter 4, Texting and Driving; it was generated by parsing the http://www.johntorres.net/BoxOfficefemaleList.html web page.
Parsing data from a website
How to do it...
Let's go through the commands used to parse details of actresses from the website:
$ lynx -dump -nolist \ http://www.johntorres.net/BoxOfficefemaleList.html grep -o "Rank-.*" | \ sed -e 's/ *Rank-\([0-9]*\) *\(.*\)/\1\t\2/' | \ sort -nk 1 > actresslist.txt
The output is as follows:
# Only 3 entries shown. All others omitted due to space limits 1 Keira Knightley...