Getting started with Newspaper3k
Before you can use Newspaper3k
, you must install it. This is as simple as running the following command:
pip install newspaper3k
At one point in a previous installation, I received an error stating that an NLTK component was not downloaded. Keep an eye out for weird errors. The fix was as simple as running a command for an NLTK download. Other than that, the library has worked very well for me. Once the installation is complete, you will be able to import it into your Python code and make use of it immediately.
In the previous chapter, I showed flexible but more manual approaches to scraping websites. A lot of junk text snuck through, and cleaning the data was quite involved and difficult to standardize. Newspaper3k
takes scraping to another level, making it easier than I have ever seen anywhere else. I recommend that you use Newspaper3k
for your news scraping whenever you can.
Scraping all news URLs from a website
Harvesting URLs from...