Downloading a complete dump of Wikipedia for a real-life Spark ML project
In this recipe, we will be downloading and exploring a dump of Wikipedia so we can have a real-life example. The dataset that we will be downloading in this recipe is a dump of Wikipedia articles. You will either need the command-line tool curl, or a browser to retrieve a compressed file, which is about 13.6 GB at this time. Due to the size, we recommend the curl command-line tool.
How to do it...
- You can start with downloading the dataset using the following command:
curl -L -O http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles-multistream.xml.bz2
- Now you want to decompress the ZIP file:
bunzip2 enwiki-latest-pages-articles-multistream.xml.bz2
This should create an uncompressed file which is named enwiki-latest-pages-articles-multistream.xml
and is about 56 GB.
- Let us take a look at the Wikipedia XML file:
head -n50 enwiki-latest-pages-articles-multistream.xml <mediawiki xmlns=http://www.mediawiki.org...