Chapter 4. Web Mining Techniques
Web data mining techniques are used to explore the data available online and then extract the relevant information from the Internet. Searching on the web is a complex process that requires different algorithms, and they will be the main focus of this chapter. Given a search query, the relevant pages are obtained using the data available on each web page, which is usually divided in the page content and the page hyperlinks to other pages. Usually, a search engine has multiple components:
- A web crawler or spider for collecting web pages
- A parser that extracts content and preprocesses web pages
- An indexer to organize the web pages in data structures
- A retrieval information system to score the most important documents related to a query
- A ranking algorithm to order the web pages in a meaningful manner
These parts can be divided into web structure mining techniques and web content mining techniques.
The web crawler, indexer, and ranking procedures refer to...