What do you get with eBook?

Instant access to your Digital eBook purchase

Download this book in EPUB and PDF formats

Access this title in our online reader with advanced features

DRM FREE - Read whenever, wherever and however you want

Scraping the Data

In the previous chapter, we built a crawler which follows links to download the web pages we want. This is interesting but not useful-the crawler downloads a web page, and then discards the result. Now, we need to make this crawler achieve something by extracting data from each web page, which is known as scraping.

We will first cover browser tools to examine a web page, which you may already be familiar with if you have a web development background. Then, we will walk through three approaches to extract data from a web page using regular expressions, Beautiful Soup and lxml. Finally, the chapter will conclude with a comparison of these three scraping alternatives.

In this chapter, we will cover the following topics:

Analyzing a web page
Approaches to scrape a web page
Using the console
xpath selectors
Scraping results

...

Key benefits

A hands-on guide to web scraping using Python with solutions to real-world problems

Create a number of different web scrapers in Python to extract information

This book includes practical examples on using the popular and well-maintained libraries in Python for your web scraping needs

Description

The Internet contains the most useful set of data ever assembled, most of which is publicly accessible for free. However, this data is not easily usable. It is embedded within the structure and style of websites and needs to be carefully extracted. Web scraping is becoming increasingly useful as a means to gather and make sense of the wealth of information available online. This book is the ultimate guide to using the latest features of Python 3.x to scrape data from websites. In the early chapters, you'll see how to extract data from static web pages. You'll learn to use caching with databases and files to save time and manage the load on servers. After covering the basics, you'll get hands-on practice building a more sophisticated crawler using browsers, crawlers, and concurrent scrapers. You'll determine when and how to scrape data from a JavaScript-dependent website using PyQt and Selenium. You'll get a better understanding of how to submit forms on complex websites protected by CAPTCHA. You'll find out how to automate these actions with Python packages such as mechanize. You'll also learn how to create class-based scrapers with Scrapy libraries and implement your learning on real websites. By the end of the book, you will have explored testing websites with scrapers, remote scraping, best practices, working with images, and many other relevant topics.

What you will learn

• Extract data from web pages with simple Python programming

• Build a concurrent crawler to process web pages in parallel

• Follow links to crawl a website

• Extract features from the HTML

• Cache downloaded HTML for reuse

• Compare concurrent models to determine the fastest crawler

• Find out how to parse JavaScript-dependent websites

• Interact with forms and sessions

What do you get with eBook?

Instant access to your Digital eBook purchase

Download this book in EPUB and PDF formats

Access this title in our online reader with advanced features

DRM FREE - Read whenever, wherever and however you want

Frequently bought together

€32.99

€41.99

€29.99

Total € 104.97

Gerry Aug 26, 2017

Finally a book that covers more than just the basics of webscraping. Packt needs better proof readers though. Language errors.

Amazon Verified review

Anonymous Feb 17, 2018

I would not recommend this book for any beginners in Python Web Scraping. Why? The website example they use in the book HAS NOT BEEN maintained and the code used in the book to reference the example website DOES NOT MATCH. I also found multiple complaints on the Internet from others. You will be so frustrated figuring out if you typed the code wrong, where in fact, the website links of the actual site don't match what's typed in the book. I'm glad I have some prior programming experience where I can fix some of the issues I experienced on the fly, but this takes additional time and testing. Overall, the book does go in depth and I think will be good for those with prior Python Web Scraping experience.

Python Web Scraping: Hands-on data scraping and crawling using PyQT, Selnium, HTML and Python , Second Edition

What do you get with eBook?

Python Web Scraping

Scraping the Data

Analyzing a web page

Three approaches to scrape a web page

Regular expressions

CSS selectors and your Browser Console

XPath Selectors

LXML and Family Trees

Comparing performance

Scraping results

Summary

Page 1 of 9

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with eBook?

Product Details

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

People who bought this also bought

About the author

FAQs

Python Web Scraping: Hands-on data scraping and crawling using PyQT, Selnium, HTML and Python , Second Edition

What do you get with eBook?

Contact Details

Billing Address

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with eBook?

Contact Details

Billing Address

Product Details

Packt Subscriptions

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

People who bought this also bought

About the author

FAQs