One million web pages
To test the performance of concurrent downloading, it would be preferable to have a larger target website. For this reason, we will use the Alexa list in this chapter, which tracks the top 1 million most popular websites according to users who have installed the Alexa Toolbar. Only a small percentage of people use this browser plugin, so the data is not authoritative, but is fine for our purposes.
These top 1 million web pages can be browsed on the Alexa website at http://www.alexa.com/topsites. Additionally, a compressed spreadsheet of this list is available at http://s3.amazonaws.com/alexa-static/top-1m.csv.zip, so scraping Alexa is not necessary.
Parsing the Alexa list
The Alexa list is provided in a spreadsheet with columns for the rank and domain:

Extracting this data requires a number of steps, as follows:
Download the
.zip
file.Extract the CSV file from this
.zip
file.Parse the CSV file.
Iterate each row of the CSV file to extract the domain.
Here is an implementation...