Data harvesting through web scraping
The technique for extracting data from web pages using software is called web scraping. It is an important component of data harvesting, typically implemented through programs called web crawlers. Data harvesting or data mining is a useful technique, often used in data science workflows to collect information from the internet, usually from websites (as opposed to APIs), and then to process that data for different purposes using various algorithms.Â
At a very high level, the process involves making a request for a web page, fetching its content, parsing its structure, and then extracting the desired information. This can be images, paragraphs of text, or tabular data containing stock information and prices, for example—pretty much anything that is present on a web page. If the content is spread across multiple web pages, the crawler will also extract the links and will automatically follow them to pull the rest of the pages, repeatedly applying the same...