Scraping the web and collecting files
In this recipe, we will learn how to collect data by web scraping. We will write a script for that.
Getting ready
Besides having a Terminal open, you need to have basic knowledge of the grep
and wget
commands.
How to do it…
Now, we will write a script to scrape the contents from imdb.com
. We will use the grep
and wget
commands in the script to get the contents. Create a scrap_contents.sh
script and write the following code in it:
$ mkdir -p data $ cd data $ wget -q -r -l5 -x 5 https://imdb.com $ cd .. $ grep -r -Po -h '(?<=href=")[^"]*' data/ > links.csv $ grep "^http" links.csv > links_filtered.csv $ sort -u links_filtered.csv > links_final.csv $ rm -rf data links.csv links_filtered.csv
How it works…
In the preceding script, we have written code to get contents from a website. The wget utility is used for retrieving files from the web using the http
, https
, and ftp
protocols. In this example, we are getting data from imdb.com
and therefore we specified...