Downloading the data
This chapter makes use of the Reuters-21578 dataset: a venerable collection of articles that were published on the Reuters newswire in 1987. It is one of the most widely used for testing the categorization and classification of text. The copyright for the text of articles and annotations in the Reuters-21578 collection resides with Reuters Ltd. Reuters Ltd. and Carnegie Group, Inc. have agreed to allow the free distribution of this data for research purposes only.
Note
You can download the example code for this chapter from the Packt Publishing's website or from https://github.com/clojuredatascience/ch6-clustering.
As usual, within the sample code is a script to download and unzip the files to the data directory. You can run it from within the project directory with the following command:
script/download-data.sh
Alternatively, at the time of writing, the Reuters dataset can be downloaded from http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz. The rest...