In this recipe, we will be downloading and exploring a dump of Wikipedia so we can have a real-life example. The dataset that we will be downloading in this recipe is a dump of Wikipedia articles. You will either need the command-line tool curl, or a browser to retrieve a compressed file, which is about 13.6 GB at this time. Due to the size, we recommend the curl command-line tool.
Downloading a complete dump of Wikipedia for a real-life Spark ML project
How to do it...
- You can start with downloading the dataset using the following command:
curl -L -O http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles-multistream.xml.bz2
- Now you want to decompress the ZIP file:
bunzip2 enwiki-latest-pages-articles-multistream...