Automated scraping with Scrapely
For scraping the annotated fields Portia uses a library called Scrapely, which is a useful open-source tool developed independently of Portia and is available at https://github.com/scrapy/scrapely. Scrapely uses training data to build a model of what to scrape from a web page, and then this model can be applied to scrape other web pages with the same structure in future. Here is an example to show how it works:
(portia_example)$ python >>> from scrapely import Scraper >>> s = Scraper() >>> train_url = 'http://example.webscraping.com/view/Afghanistan-1' >>> s.train(train_url, {'name': 'Afghanistan', 'population': '29,121,286'}) >>> test_url = 'http://example.webscraping.com/view/United-Kingdom-239' >>> s.scrape(test_url) [{u'name': [u'United Kingdom'], u'population': [u'62,348,447']}]
First, Scrapely is given the data we want to scrape from the Afghanistan
web page to train the model, being the country...