It is possible to download the data manually from the original website or many online repositories. However, there are also many versions of the dataset--some are cleaned in a certain way and some in the raw form. To avoid confusion, it is best to use a consistent acquisition method. The scikit-learn library provides a utility function of loading the dataset.Once the dataset is downloaded, it is automatically cached. We won’t need to download the same dataset twice. In most cases, caching the dataset, especially for a relatively small one, is considered a good practice. Other Python libraries also support download utilities, but not all of them implement automatic caching. This is another reason why we love scikit-learn.
To load the data, we can import the loader function for the 20 newsgroups data as follows:
>>> from sklearn.datasets import fetch_20newsgroups
Then we can download...