Web scraping
Even though some sites offer APIs, most websites are designed mainly for human eyes and only provide HTML pages formatted for humans. If we want a program to fetch some data from such a website, we have to parse the markup to get the information we need. Web scraping is the method of using a computer program to analyze a web page and get the data needed.
There are many methods to fetch the content from the site with Python modules:
- Use
urllib
/urllib2
to create an HTTP request that will fetch the webpage, and usingBeautifulSoup
to parse the HTML - To parse an entire website we can use Scrapy (http://scrapy.org), which helps to create web spiders
- Use requests module to fetch and lxml to parse
urllib / urllib2 module
Urllib is a high-level module that allows us to script different services such as HTTP, HTTPS, and FTP.
Useful methods of urllib/urllib2
Urllib/urllib2 provide methods that can be used for getting resources from URLs, which includes opening web pages, encoding...