Web content mining
This type of mining focuses on extracting information from the content of web pages. Each page is usually gathered and organized (using a parsing technique), processed to remove the unimportant parts from the text (natural language processing), and then analyzed using an information retrieval system to match the relevant documents to a given query. These three components are discussed in the following paragraphs.
Parsing
A web page is written in HTML format, so the first operation is to extract the relevant pieces of information. An HTML parser builds a tree of tags from which the content can be extracted. Nowadays, there are many parsers available, but as an example, we use the Scrapy library see Chapter 7, Movie Recommendation System Web Application which provides a command-line parser. Let's say we want to parse the main page of Wikipedia, https://en.wikipedia.org/wiki/Main_Page. We simply type this in a terminal:
scrapy shell 'https://en.wikipedia.org/wiki...