Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon

Tangled Web? Not At All!

Save for later
  • 20 min read
  • 22 Jun 2017

article-image

In this article by Clif Flynt, the author of the book Linux Shell Scripting Cookbook - Third Edition, we can see a collection of shell-scripting recipes that talk to services on the Internet. This articleis intended to help readers understand how to interact with the Web using shell scripts to automate tasks such as collecting and parsing data from web pages. This is discussed using POST and GET to web pages, writing clients to web services.

(For more resources related to this topic, see here.)

In this article, we will cover the following recipes:

  • Downloading a web page as plain text
  • Parsing data from a website
  • Image crawler and downloader
  • Web photo album generator
  • Twitter command-line client
  • Tracking changes to a website
  • Posting to a web page and reading response
  • Downloading a video from the Internet

The Web has become the face of technology and the central access point for data processing.

The primary interface to the web is via a browser that's designed for interactive use. That's great for searching and reading articles on the web, but you can also do a lot to automate your interactions with shell scripts.

For instance, instead of checking a website daily to see if your favorite blogger has added a new blog, you can automate the check and be informed when there's new information.

Similarly, twitter is the current hot technology for getting up-to-the-minute information. But if I subscribe to my local newspaper's twitter account because I want the local news, twitter will send me all news, including high-school sports that I don't care about.

With a shell script, I can grab the tweets and customize my filters to match my desires, not rely on their filters.

Downloading a web page as plain text

Web pages are simply text with HTML tags, JavaScript and CSS. The HTML tags define the content of a web page, which we can parse for specific content. Bash scripts can parse web pages. An HTML file can be viewed in a web browser to see it properly formatted.

Parsing a text document is simpler than parsing HTML data because we aren't required to strip off the HTML tags. Lynx is a command-line web browser which download a web page as plaintext.

Getting Ready

Lynx is not installed in all distributions, but is available via the package manager.

# yum install lynx

or

apt-get install lynx

How to do it...

Let's download the webpage view, in ASCII character representation, in a text file by using the -dump flag with the lynx command:

$ lynx URL -dump > webpage_as_text.txt

This command will list all the hyperlinks <a href="link"> separately under a heading References, as the footer of the text output. This lets us parse links separately with regular expressions.

For example:

$lynx -dump http://google.com > plain_text_page.txt

You can see the plaintext version of text by using the cat command:

$ cat plain_text_page.txt
   Search [1]Images [2]Maps [3]Play [4]YouTube [5]News [6]Gmail [7]Drive
   [8]More »
   [9]Web History | [10]Settings | [11]Sign in

   [12]St. Patrick's Day 2017

     _______________________________________________________
   Google Search  I'm Feeling Lucky    [13]Advanced search
      [14]Language tools

   [15]Advertising Programs     [16]Business Solutions     [17]+Google
    [18]About Google

                      © 2017 - [19]Privacy - [20]Terms

References

Parsing data from a website

The lynx, sed, and awk commands can be used to mine data from websites.

How to do it...

Let's go through the commands used to parse details of actresses from the website:

$ lynx -dump -nolist 
  http://www.johntorres.net/BoxOfficefemaleList.html  | 
  grep -o "Rank-.*" | 
sed -e 's/ *Rank-([0-9]*) *(.*)/1t2/' | 
  sort -nk 1 > actresslist.txt

The output is:

# Only 3 entries shown. All others omitted due to space limits
1   Keira Knightley 
2   Natalie Portman 
3   Monica Bellucci

How it works...

Lynx is a command-line web browser—it can dump a text version of a website as we would see in a web browser, instead of returning the raw html as wget or cURL do. This saves the step of removing HTML tags. The -nolist option shows the links without numbers. Parsing and formatting the lines that contain Rank is done with sed:

sed -e 's/ *Rank-([0-9]*) *(.*)/1t2/'

These lines are then sorted according to the ranks.

See also

The Downloading a web page as plain text recipe in this article explains the lynx command.

Image crawler and downloader

Image crawlers download all the images that appear in a web page. Instead of going through the HTML page by hand to pick the images, we can use a script to identify the images and download them automatically.

How to do it...

This Bash script will identify and download the images from a web page:

#!/bin/bash
#Desc: Images downloader
#Filename: img_downloader.sh

if [ $# -ne 3 ];
then
  echo "Usage: $0 URL -d DIRECTORY"
  exit -1
fi

   while [ -n $1 ]
do
  case $1 in
  -d) shift; directory=$1; shift ;;
   *) url=$1; shift;;
esac
done

mkdir -p $directory;
baseurl=$(echo $url | egrep -o "https?://[a-z.-]+")

echo Downloading $url
curl -s $url | egrep -o "<imgsrc=[^>]*>" | 
sed's/<imgsrc="([^"]*).*/1/g' |
sed"s,^/,$baseurl/,"> /tmp/$$.list

cd $directory;

while read filename;
do
  echo Downloading $filename
  curl -s -O "$filename" --silent

done < /tmp/$$.list

An example usage is:

$ ./img_downloader.sh http://www.flickr.com/search/?q=linux -d images

How it works...

The image downloader script reads an HTML page, strips out all tags except <img>, parses src="URL" from the <img> tag, and downloads them to the specified directory. This script accepts a web page URL and the destination directory as command-line arguments.

The [ $# -ne 3 ] statement checks whether the total number of arguments to the script is three, otherwise it exits and returns a usage example. Otherwise, this code parses the URL and destination directory:

while [ -n "$1" ]
do 
  case $1 in
  -d) shift; directory=$1; shift ;;
   *) url=${url:-$1}; shift;;
esac
done

The while loop runs until all the arguments are processed. The shift command shifts arguments to the left so that $1 will take the next argument's value; that is, $2, and so on. Hence, we can evaluate all arguments through $1 itself.

The case statement checks the first argument ($1). If that matches -d, the next argument must be a directory name, so the arguments are shifted and the directory name is saved. If the argument is any other string it is a URL.

The advantage of parsing arguments in this way is that we can place the -d argument anywhere in the command line:

$ ./img_downloader.sh -d DIR URL

Or:

$ ./img_downloader.sh URL -d DIR

The egrep -o "<imgsrc=[^>]*>"code will print only the matching strings, which are the <img> tags including their attributes. The phrase [^>]*matches all the characters except the closing >, that is, <imgsrc="image.jpg">.

sed's/<imgsrc="([^"]*).*/1/g' extracts the url from the string src="url".

There are two types of image source paths—relative and absolute. Absolute paths contain full URLs that start with http:// or https://. Relative URLs starts with / or image_name itself. An example of an absolute URL is http://example.com/image.jpg. An example of a relative URL is /image.jpg.

For relative URLs, the starting / should be replaced with the base URL to transform it to http://example.com/image.jpg. The script initializes the baseurl by extracting it from the initial url with the command:

baseurl=$(echo $url | egrep -o "https?://[a-z.-]+")

The output of the previously described sed command is piped into another sed command to replace a leading / with the baseurl, and the results are saved in a file named for the script's PID: /tmp/$$.list.

sed"s,^/,$baseurl/,"> /tmp/$$.list

The final while loop iterates through each line of the list and uses curl to downloas the images. The --silent argument is used with curl to avoid extra progress messages from being printed on the screen.

The final while loop iterates through each line of the list and uses curl to downloas the images. The --silent argument is used with curl to avoid extra progress messages from being printed on the screen.

Web photo album generator

Web developers frequently create photo albums of full sized and thumbnail images. When a thumbnail is clicked, a large version of the picture is displayed. This requires resizing and placing many images. These actions can be automated with a simple bash script. The script creates thumbnails, places them in exact directories, and generates the code fragment for <img> tags automatically. 

Web developers frequently create photo albums of full sized and thumbnail images. When a thumbnail is clicked, a large version of the picture is displayed. This requires resizing and placing many images. These actions can be automated with a simple bash script. The script creates thumbnails, places them in exact directories, and generates the code fragment for <img> tags automatically.

Getting ready

This script uses a for loop to iterate over every image in the current directory. The usual Bash utilities such as cat and convert (from the Image Magick package) are used. These will generate an HTML album, using all the images, in index.html.

How to do it...

This Bash script will generate an HTML album page:

#!/bin/bash
#Filename: generate_album.sh
#Description: Create a photo album using images in current directory

echo "Creating album.."
mkdir -p thumbs
cat <<EOF1 > index.html
<html>
<head>
<style>

body 
{ 
  width:470px;
margin:auto;
  border: 1px dashed grey;
  padding:10px; 
} 

img
{ 
  margin:5px;
  border: 1px solid black;

} 
</style>
</head>
<body>
<center><h1> #Album title </h1></center>
<p>
EOF1

for img in *.jpg;
do 
  convert "$img" -resize "100x""thumbs/$img"
  echo "<a href="$img">">>index.html
  echo "<imgsrc="thumbs/$img" title="$img" /></a>">> index.html
done

cat <<EOF2 >> index.html

</p>
</body>
</html>
EOF2 

echo Album generated to index.html

Run the script as follows:

$ ./generate_album.sh
Creating album..
Album generated to index.html

How it works...

The initial part of the script is used to write the header part of the HTML page.

The following script redirects all the contents up to EOF1 to index.html:

cat <<EOF1 > index.html
contents...
EOF1

The header includes the HTML and CSS styling.

for img in *.jpg *.JPG; iterates over the file names and evaluates the body of the loop.

convert "$img" -resize "100x""thumbs/$img"

creates images of 100 px width as thumbnails.

The following statements generate the required <img> tag and appends it to index.html:

Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at ₹800/month. Cancel anytime
echo "<a href="$img">"
echo "<imgsrc="thumbs/$img" title="$img" /></a>">> index.html

Finally, the footer HTML tags are appended with cat as done in the first part of the script.

Twitter command-line client

Twitter is the hottest micro-blogging platform, as well as the latest buzz of the online social media now. We can use Twitter API to read tweets on our timeline from the command line!

Twitter is the hottest micro-blogging platform, as well as the latest buzz of the online social media now. We can use Twitter API to read tweets on our timeline from the command line!

Let's see how to do it.

Getting ready

Recently, Twitter stopped allowing people to log in by using plain HTTP Authentication, so we must use OAuth to authenticate ourselves.  Perform the following steps:

  1. Download the bash-oauth library from https://github.com/livibetter/bash-oauth/archive/master.zip, and unzip it to any directory.
  2. Go to that directory and then inside the subdirectory bash-oauth-master, run make install-all as root.Go to https://apps.twitter.com/ and register a new app. This will make it possible to use OAuth.
  3. After registering the new app, go to your app's settings and change Access type to Read and Write.
  4. Now, go to the Details section of the app and note two things—Consumer Key and Consumer Secret, so that you can substitute these in the script we are going to write.

Great, now let's write the script that uses this.

How to do it...

This Bash script uses the OAuth library to read tweets or send your own updates.

#!/bin/bash
#Filename: twitter.sh
#Description: Basic twitter client

oauth_consumer_key=YOUR_CONSUMER_KEY
oauth_consumer_scret=YOUR_CONSUMER_SECRET

config_file=~/.$oauth_consumer_key-$oauth_consumer_secret-rc

if [[ "$1" != "read" ]] && [[ "$1" != "tweet" ]];
then 
  echo -e "Usage: $0 tweet status_messagen   ORn      $0 readn"
  exit -1;
fi

#source /usr/local/bin/TwitterOAuth.sh
source bash-oauth-master/TwitterOAuth.sh
TO_init

if [ ! -e $config_file ]; then
TO_access_token_helper
 if (( $? == 0 )); then
   echo oauth_token=${TO_ret[0]} > $config_file
   echo oauth_token_secret=${TO_ret[1]} >> $config_file
 fi
fi

source $config_file

if [[ "$1" = "read" ]];
then
TO_statuses_home_timeline'''YOUR_TWEET_NAME''10'
  echo $TO_ret |  sed's/,"/n/g' | sed's/":/~/' | 
awk -F~ '{} 
      {if ($1 == "text") 
        {txt=$2;} 
       else if ($1 == "screen_name") 
printf("From: %sn Tweet: %snn", $2, txt);} 
      {}' | tr'"'''

elif [[ "$1" = "tweet" ]];
then 
  shift
TO_statuses_update''"$@"
  echo 'Tweeted :)'
fi

Run the script as follows:

$./twitter.sh read
Please go to the following link to get the PIN: https://api.twitter.com/oauth/authorize?oauth_token=LONG_TOKEN_STRING
PIN: PIN_FROM_WEBSITE
Now you can create, edit and present Slides offline.
 - by A Googler 
$./twitter.sh tweet "I am reading Packt Shell Scripting Cookbook"
Tweeted :)
$./twitter.sh read | head -2
From: Clif Flynt
 Tweet: I am reading Packt Shell Scripting Cookbook

How it works...

First of all, we use the source command to include the TwitterOAuth.sh library, so we can use its functions to access Twitter. The TO_init function initializes the library.

Every app needs to get an OAuth token and token secret the first time it is used. If these are not present, we use the library function TO_access_token_helper to acquire them. Once we have the tokens, we save them to a config file so we can simply source it the next time the script is run.

The library function TO_statuses_home_timeline fetches the tweets from Twitter. This data is retuned as a single long string in JSON format, which starts like this:

[{"created_at":"Thu Nov 10 14:45:20 +0000 "016","id":7...9,"id_str":"7...9","text":"Dining...

Each tweet starts with the created_at tag and includes a text and a screen_nametag. The script will extract the text and screen name data and display only those fields.

The script assigns the long string to the variable TO_ret.

The JSON format uses quoted strings for the key and may or may not quote the value. The key/value pairs are separated by commas, and the key and value are separated by a colon :.

The first sed to replaces each," character set with a newline, making each key/value a separate line. These lines are piped to another sed command to replace each occurrence of ": with a tilde ~ which creates a line like

screen_name~"Clif_Flynt"

The final awk script reads each line. The -F~ option splits the line into fields at the tilde, so $1 is the key and $2 is the value. The if command checks for text or screen_name. The text is first in the tweet, but it's easier to read if we report the sender first, so the script saves a text return until it sees a screen_name, then prints the current value of $2 and the saved value of the text.

The TO_statuses_updatelibrary function generates a tweet. The empty first parameter defines our message as being in the default format, and the message is a part of the second parameter.

Tracking changes to a website

Tracking website changes is useful to both web developers and users. Checking a website manually impractical, but a change tracking script can be run at regular intervals. When a change occurs, it generate a notification.

Getting ready

Tracking changes in terms of Bash scripting means fetching websites at different times and taking the difference by using the diff command. We can use curl and diff to do this.

How to do it...

This bash script combines different commands, to track changes in a webpage:

#!/bin/bash
#Filename: change_track.sh
#Desc: Script to track changes to webpage

if [ $# -ne 1 ];
then 
  echo -e "$Usage: $0 URLn"
  exit 1;
fi

first_time=0
# Not first time

if [ ! -e "last.html" ];
then
first_time=1
  # Set it is first time run
fi

curl --silent $1 -o recent.html

if [ $first_time -ne 1 ];
then
  changes=$(diff -u last.html recent.html)
  if [ -n "$changes" ];
  then
    echo -e "Changes:n"
    echo "$changes"
  else
    echo -e "nWebsite has no changes"
  fi
else
  echo "[First run] Archiving.."

fi


cp recent.html last.html

Let's look at the output of the track_changes.sh script on a website you control. First we'll see the output when a web page is unchanged, and then after making changes.

Note that you should change MyWebSite.org to your website name.

  • First, run the following command:
    $ ./track_changes.sh http://www.MyWebSite.org
    [First run] Archiving..
  • Second, run the command again.
    $ ./track_changes.sh http://www.MyWebSite.org	
    Website has no changes
  • Third, run the following command after making changes to the web page:
    
    $ ./track_changes.sh http://www.MyWebSite.org
    
    Changes: 
    
    --- last.html	2010-08-01 07:29:15.000000000 +0200 
    +++ recent.html	2010-08-01 07:29:43.000000000 +0200 
    @@ -1,3 +1,4 @@
    +added line :)

    data

How it works...

The script checks whether the script is running for the first time by using [ ! -e "last.html" ];. If last.html doesn't exist, it means that it is the first time and, the webpage must be downloaded and saved as last.html.

If it is not the first time, it downloads the new copy recent.html and checks the difference with the diff utility. Any changes will be displayed as diff output.Finally, recent.html is copied to last.html.

Note that changing the website you're checking will generate a huge diff file the first time you examine it. If you need to track multiple pages, you can create a folder for each website you intend to watch.

Posting to a web page and reading the response

POST and GET are two types of requests in HTTP to send information to, or retrieve information from a website. In a GET request, we send parameters (name-value pairs) through the webpage URL itself. The POST command places the key/value pairs in the message body instead of the URL. POST is commonly used when submitting long forms or to conceal the information submitted from a casual glance.

Getting ready

For this recipe, we will use the sample guestbook website included in the tclhttpd package.  You can download tclhttpd from http://sourceforge.net/projects/tclhttpd

and then run it on your local system to create a local webserver. The guestbook page requests a name and URL which it adds to a guestbook to show who has visited a site when the user clicks the Add me to your guestbook button.

This process can be automated with a single curl (or wget) command.

How to do it...

Download the tclhttpd package and cd to the bin folder. Start the tclhttpd daemon with this command:

tclsh httpd.tcl

The format to POST and read the HTML response from generic website resembles this:

$ curl URL -d "postvar=postdata2&postvar2=postdata2"

Consider the following example:

$ curl http://127.0.0.1:8015/guestbook/newguest.html 
-d "name=Clif&url=www.noucorp.com&http=www.noucorp.com"

curl prints a response page like this:

<HTML>
<Head>
<title>Guestbook Registration Confirmed</title>
</Head>
<Body BGCOLOR=white TEXT=black>
<a href="www.noucorp.com">www.noucorp.com</a>

<DL>
<DT>Name
<DD>Clif
<DT>URL
<DD>
</DL>
www.noucorp.com

</Body>

-d is the argument used for posting. The string argument for -d is similar to the GET request semantics. var=value pairs are to be delimited by &.

You can POST the data using wget by using --post-data "string". For example:

$ wgethttp://127.0.0.1:8015/guestbook/newguest.cgi 
--post-data "name=Clif&url=www.noucorp.com&http=www.noucorp.com" 
-O output.html

Use the same format as cURL for name-value pairs. The text in output.html is the same as that returned by the cURL command.

The string to the post arguments (for example, to -d or --post-data) should always be given in quotes. If quotes are not used, & is interpreted by the shell to indicate that this should be a background process.

How to do it...

If you look at the website source (use the View Source option from the web browser), you will see an HTML form defined, similar to the following code:

<form action="newguest.cgi"" method="post">
<ul>
<li> Name: <input type="text" name="name" size="40">
<li> Url: <input type="text" name="url" size="40">
<input type="submit">
</ul>
</form>

Here, newguest.cgi is the target URL. When the user enters the details and clicks on the Submit button, the name and url inputs are sent to newguest.cgi as a POST request, and the response page is returned to the browser.

Downloading a video from the internet

There are many reasons for downloading a video. If you are on a metered service, you might want to download videos during off-hours when the rates are cheaper. You might want to watch videos where the bandwidth doesn't support streaming, or you might just want to make certain that you always have that video of cute cats to show your friends.

Getting ready

One program for downloading videos is youtube-dl. This is not included in most distributions and the repositories may not be up to date, so it's best to go to the youtube-dl main site:http://yt-dl.org

You'll find links and information on that page for downloading and installing youtube-dl.

How to do it…

Using youtube-dl is easy. Open your browser and find a video you like. Then copy/paste that URL to the youtube-dl command line.

youtube-dl  https://www.youtube.com/watch?v=AJrsl3fHQ74

While youtube-dl is downloading the file it will generate a status line on your terminal.

How it works…

The youtube-dl program works by sending a GET message to the server, just as a browser would do. It masquerades as a browser so that YouTube or other video providers will download a video as if the device were streaming.

The –list-formats (-F) option will list the available formats a video is available in, and the –format (-f) option will specify which format to download. This is useful if you want to download a higher-resolution video than your internet connection can reliably stream.

Summary

In this article we learned how to download and parse website data, send data to forms, and automate website-usage tasks and similar activities. We can automate many activities that we perform interactively through a browser with a few lines of scripting.

Resources for Article:


Further resources on this subject: