Choosing between libraries, APIs, and source data
As part of this demonstration, I showed several ways to pull useful data off of the internet. I showed that several libraries have ways to load data directly but that there are limitations to what they have available. NLTK only offered a small portion of the complete Gutenberg book archive, so we had to use the Requests
library to load The Metamorphosis. I also demonstrated that Requests
accompanied by BeautifulSoup
can easily harvest links and raw text.
Python libraries can also make loading data very easy when those libraries have data loading functionality as part of their library, but you are limited by what those libraries make available. If you just want some data to play with, with minimal cleanup, this may be ideal, but there will still be cleanup. You will not get away from that when working with text.
Other web resources expose their own APIs, which makes it pretty simple to load data after sending a request to them...