Explore Products

Best Sellers

New Releases

Books

Videos

Audiobooks

Learning Hub

Conferences

Free Learning

You're reading from Python Web Scraping Cookbook Over 90 proven recipes to get you scraping with Python, microservices, Docker, and AWS

Product type Paperback

Published in Feb 2018

Publisher Packt

ISBN-13 9781787285217

Length 364 pages

Edition 1st Edition

Languages

Python

Tools

AWS

Concepts

Data Mining

Author (1):

Michael Heydt

View More author details

Table of Contents (13) Chapters

Preface

1. Getting Started with Scraping

2. Data Acquisition and Extraction FREE CHAPTER

3. Processing Data

4. Working with Images, Audio, and other Assets

5. Scraping - Code of Conduct

6. Scraping Challenges and Solutions

7. Text Wrangling and Analysis

8. Searching, Mining and Visualizing Data

9. Creating a Simple Data API

10. Creating Scraper Microservices with Docker

11. Making the Scraper as a Service Real

12. Other Books You May Enjoy

Leave a review - let other readers know what you think

Respecting robots.txt

Many sites want to be crawled. It is inherent in the nature of the beast: Web hosters put content on their sites to be seen by humans. But it is also important that other computers see the content. A great example is search engine optimization (SEO). SEO is a process where you actually design your site to be crawled by spiders such as Google, so you are actually encouraging scraping. But at the same time, a publisher may only want specific parts of their site crawled, and to tell crawlers to keep their spiders off of certain portions of the site, either it is not for sharing, or not important enough to be crawled and wast the web server resources.

The rules of what you are and are not allowed to crawl are usually contained in a file that is on most sites known as robots.txt. The robots.txt is a human readable but parsable file, which can be used to identify...

The rest of the chapter is locked

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $19.99/month. Cancel anytime

Authors (1)

Michael Heydt

Michael Heydt is an independent consultant, programmer, educator, and trainer. He has a passion for learning and sharing his knowledge of new technologies. Michael has worked in multiple industry verticals, including media, finance, energy, and healthcare. Over the last decade, he worked extensively with web, cloud, and mobile technologies and managed user experiences, interface design, and data visualization for major consulting firms and their clients. Michael's current company, Seamless Thingies , focuses on IoT development and connecting everything with everything. Michael is the author of numerous articles, papers, and books, such as D3.js By Example, Instant Lucene. NET, Learning Pandas, and Mastering Pandas for Finance, all by Packt. Michael is also a frequent speaker at .NET user groups and various mobile, cloud, and IoT conferences and delivers webinars on advanced technologies.

See other products by Michael Heydt