Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Getting Started with Beautiful Soup
Getting Started with Beautiful Soup

Getting Started with Beautiful Soup: Learn how to extract information from websites using Beautiful Soup and the Python urllib2 module. This practical, hands-on guide covers everything you need to know to get a head start in website scraping.

eBook
€15.99 €23.99
Paperback
€29.99
Subscription
Free Trial
Renews at €18.99p/m

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing
Table of content icon View table of contents Preview book icon Preview Book

Getting Started with Beautiful Soup

Chapter 1. Installing Beautiful Soup

Before we begin using Beautiful Soup, we should ensure that it is properly installed on our machine. The steps required are so simple that any user can install this in no time. In this chapter, we will be covering the following topics:

  • Installing Beautiful Soup
  • Verifying the installation of Beautiful Soup

Installing Beautiful Soup

Python supports the installation of third-party modules such as Beautiful Soup. In the best case scenario, we can expect that the module developer might have prepared a platform-specific installer, for example, an executable installer, in the case of Windows; an rpm package, in the case of Red Hat-based Linux operating systems (Red Hat, Open Suse, and so on); and a Debian package, in the case of Debian-based operating systems (Debian, Ubuntu, and so on). But this is not always the case and we should know the alternatives if the platform-specific installer is not available. We will discuss the different installation options available for Beautiful Soup in different operating systems, such as Linux, Windows, and Mac OS X. The Python version that we are going to use in the later examples for installing Beautiful Soup is Python 2.7.5 and the instructions for Python 3 are probably different. You can directly go to the installation section corresponding to the operating system.

Installing Beautiful Soup in Linux

Installing Beautiful Soup is pretty simple and straightforward in Linux machines. For recent versions of Debian or Ubuntu, Beautiful Soup is available as a package and we can install this using the system package manager. For other versions of Debian or Ubuntu, where Beautiful Soup is not available as a package, we can use alternative methods for installation.

Normally, these are the following three ways to install Beautiful Soup in Linux machines:

  • Using package manager
  • Using pip
  • Using easy_install

The choices are ranked depending on the complexity levels and to avoid the trial-and-error method. The easiest method is always using the package manager since it requires less effort from the user, so we will cover this first. If the installation is successful in one step, we don't need to do the next because the three steps mentioned previously do the same thing.

Installing Beautiful Soup using package manager

Linux machines normally come with a package manager to install various packages. In the recent version of Debian or Ubuntu, since Beautiful Soup is available as a package, we will be using the system package manager for installation. In Linux machines such as Ubuntu and Debian, the default package manager is based on apt-get and hence we will use apt-get to do the task.

Just open up a terminal and type in the following command:

sudo apt-get install python-bs4

The preceding command will install Beautiful Soup Version 4 in our Linux operating system. Installing new packages in the system normally requires root user privileges, which is why we append sudo in front of the apt-get command. If we didn't append sudo, we will basically end up with a permission denied error. If the packages are already updated, we will see the following success message in the command line itself:

Installing Beautiful Soup using package manager

Since we are using a recent version of Ubuntu or Debian, python-bs4 will be listed in the apt repository. But if the preceding command fails with Package Not Found Error, it means that the package list is not up-to-date. This normally happens if we have just installed our operating system and the package list is not downloaded from the package repository. In this case, we need to first update the package list using the following command:

sudo apt-get update

The preceding command will update the necessary package list from the online package repositories. After this, we need to try the preceding command to install Beautiful Soup.

In the older versions of the Linux operating system, even after running the apt-get update command, we might not be able to install Beautiful Soup because it might not be available in the repositories. In these scenarios, we can rely on the other methods of installation using either pip or easy_install.

Installing Beautiful Soup using pip or easy_install

The pip and easy_install are the tools used for managing and installing Python packages. Either of them can be used to install Beautiful Soup.

Installing Beautiful Soup using pip

From the terminal, type the following command:

sudo pip install beautifulsoup4

The preceding command will install Beautiful Soup Version 4 in the system after downloading the necessary packages from http://pypi.python.org/.

Installing Beautiful Soup using easy_install

The easy_install tool installs the package from Python Package Index (PyPI). So, in the terminal, type the following command:

sudo easy_install beautifulsoup4

All the previous methods to install Beautiful Soup in Linux will not work if you do not have an active network connection. So, in case everything fails, we can still install Beautiful Soup. The last option would be to use the setup.py script that comes with every Python package downloaded from pypi.python.org. This method is also the recommended method to install Beautiful Soup in Windows and in Mac OS X machines. So, we will discuss this method in the Installing Beautiful Soup in Windows section.

Installing Beautiful Soup in Windows

In Windows, we will make use of the recent Python package for Beautiful Soup available from https://pypi.python.org/packages/source/b/beautifulsoup4/ and use the setup.py script to install Beautiful Soup. But before doing this, it will be easier for us if we add the path of Python in the system path. The next section discusses setting up the path to Python on a Windows machine.

Verifying Python path in Windows

Often, the path to python.exe will not be added to an environment variable by default in Windows. So, in order to check this from the Windows command-line prompt, you need to type the following command:

python.

The preceding command will work without any errors if the path to Python is already added in the environment path variable or we are already within the Python installed directory. But, it would be good to check the path variable for the Python directory entry.

If it doesn't exist in the path variable, we have to find out the actual path, which is entirely dependent on where you installed Python. For Python 2.x, it will be by C:\Python2x by default, and for Python 3.x, the path will be C:\Python3x by default.

We have to add this to the Path environment variable in the Windows machine. For this, right-click on My Computer | Properties | Environment Variables | System Variable.

Pick the Path variable and add the following section to the Path variable:

;C:\PythonXY for example C:\Python27

This is shown in the following screenshot:

Verifying Python path in Windows

Adding Python path in Windows (Python 2.7 is used in this example)

After the Python path is ready, we can follow the steps for installing Beautiful Soup on a Windows machine.

Note

The method, which will be explained in the next section, of installing Beautiful Soup using setup.py is the same for Linux, Windows, and Mac OS X operating systems.

Installing Beautiful Soup using setup.py

We can install Python packages using the setup.py script that comes with every Python package downloaded from the Python package index website: https://pypi.python.org/. The following steps are used to install the Beautiful Soup using setup.py:

  1. Download the latest tarball from https://pypi.python.org/packages/source/b/beautifulsoup4/.
  2. Unzip it to a folder (for example, BeautifulSoup).
  3. Open up the command-line prompt and navigate to the folder where you have unzipped the folder as follows:
    cd BeautifulSoup
    python setup.py install.
    
  4. The python setup.py install line will install Beautiful Soup in our system.

Note

We are not done with the list of possible options to use Beautiful Soup. We can use Beautiful Soup in our applications even if all of the options outlined until now fail.

Using Beautiful Soup without installation

The installation processes that we have discussed till now normally copy the module contents to a chosen installation directory. This varies from operating system to operating system and the path is normally /usr/local/lib/pythonX.Y/site-packages in Linux operating systems such as Debian and C:\PythonXY\Lib\site-packages in Windows (where X and Y represent the corresponding versions, such as Python 2.7). When we use import statements in the Python interpreter or as a part of a Python script, normally what the Python interpreter does is look in the predefined Python Path variable and look for the module in those directories. So, installing actually means copying the module contents into the predefined directory or copying this to some other location and adding the location into the Python path. The following method of using Beautiful Soup without going through the installation can be used in any operating system, such as Windows, Linux, or Mac OS X:

  1. Download the latest version of Beautiful Soup package from https://pypi.python.org/packages/source/b/beautifulsoup4/.
  2. Unzip the package.
  3. Copy the bs4 directory into the directory where we want to place all our Python Beautiful Soup scripts.

After we perform all the preceding steps, we are good to use Beautiful Soup. In order to import Beautiful Soup in this case, either we need to open the terminal in the directory where the bs4 directory exists or add this directory to the Python Path variable; otherwise, we will get the module not found error. This extra step is required because the method is specific to a project where the bs4 directory is included. But in the case of installing methods, as we have seen previously, Beautiful Soup will be available globally and can be used in any of the projects, and so the additional steps are not required.

Verifying the installation

To verify the installation, perform the following steps:

  1. Open up the Python interpreter in a terminal by using the following command:
    python
    
  2. Now, we can issue a simple import statement to see whether we have successfully installed Beautiful Soup or not by using the following command:
    from bs4 import BeautifulSoup
    

If we did not install Beautiful Soup and instead copied the bs4 directory in the workspace, we have to change to the directory where we have placed the bs4 directory before using the preceding commands.

Quick reference

The following table is an overview of commands and their implications:

sudo apt-get install python-bs4

This command is used for installing Python using a package manger in Linux.

sudo pip install beautifulsoup4

This command is used for installing Python using pip.

sudo easy_install beautifulsoup4

This command is used for installing Python using easy_install.

python setup.py install

This command is used for installing Python using setup.py.

from bs4 import BeautifulSoup

This command is used for verifying installation.

Summary

In this chapter, we covered the various options to install Beautiful Soup in Linux machines. We also discussed a way of installing Beautiful Soup in Windows, Linux, and Mac OS X using the Python setup.py script itself. We also discussed the method to use Beautiful Soup without even installing it. The verification of the Beautiful Soup installation was also covered.

In the next chapter, we are going to have a first look at Beautiful Soup by learning the different methods of converting HTML/XML content to different Beautiful Soup objects and thereby understanding the properties of Beautiful Soup.

Left arrow icon Right arrow icon

Description

Beautiful Soup is a Python library designed for quick turnaround projects like screen-scraping. Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need without writing excess code for an application. It doesn't take much code to write an application using Beautiful Soup. Getting Started with Beautiful Soup is a practical guide to Beautiful Soup using Python. The book starts by walking you through the installation of each and every feature of Beautiful Soup using simple examples which include sample Python codes as well as diagrams and screenshots wherever required for better understanding. The book discusses the problems of how exactly you can get data out of a website and provides an easy solution with the help of a real website and sample code. Getting Started with Beautiful Soup goes over the different methods to install Beautiful Soup in both Linux and Windows systems. You will then learn about searching, navigating, content modification, encoding support, and output formatting with the help of examples and sample Python codes for each example so that you can try them out to get a better understanding. This book is a practical guide for scraping information from any website. If you want to learn how to efficiently scrape pages from websites, then this book is for you.

Who is this book for?

If you are a budding forensic analyst, consultant, engineer, or a forensic professional wanting to expand your skillset, this is the book for you. The book will also be beneficial to those with an interest in mobile forensics or wanting to find data lost on mobile devices. It will be helpful to be familiar with forensics in general but no prior experience is required to follow this book.

What you will learn

  • Learn how to scrape HTML pages from websites
  • Implement a simple method to scrape any website with the help of developer tools, the Python urllib2 module, and Beautiful Soup
  • Learn how to search for information within an HTML/XML page
  • Modify the contents of an HTML tree
  • Understand encoding support in Beautiful Soup
  • Learn about the different types of output formatting

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Jan 24, 2014
Length: 130 pages
Edition : 1st
Language : English
ISBN-13 : 9781783289554
Languages :
Concepts :

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing

Product Details

Publication date : Jan 24, 2014
Length: 130 pages
Edition : 1st
Language : English
ISBN-13 : 9781783289554
Languages :
Concepts :

Packt Subscriptions

See our plans and pricing
Modal Close icon
€18.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
€189.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts
€264.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total 100.97
Python Data Visualization Cookbook
€32.99
Mastering Object-oriented Python
€37.99
Getting Started with Beautiful Soup
€29.99
Total 100.97 Stars icon

Table of Contents

9 Chapters
1. Installing Beautiful Soup Chevron down icon Chevron up icon
2. Creating a BeautifulSoup Object Chevron down icon Chevron up icon
3. Search Using Beautiful Soup Chevron down icon Chevron up icon
4. Navigation Using Beautiful Soup Chevron down icon Chevron up icon
5. Modifying Content Using Beautiful Soup Chevron down icon Chevron up icon
6. Encoding Support in Beautiful Soup Chevron down icon Chevron up icon
7. Output in Beautiful Soup Chevron down icon Chevron up icon
8. Creating a Web Scraper Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Full star icon Full star icon Full star icon Half star icon Empty star icon 3.5
(11 Ratings)
5 star 27.3%
4 star 36.4%
3 star 9.1%
2 star 9.1%
1 star 18.2%
Filter icon Filter
Top Reviews

Filter reviews by




Lance Hermes Jul 03, 2015
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This is a good introduction to a difficult subject. I have to give the author credit for writing this as the shelf life is likely not that long. Hopefully, he will update it as Python evolves.
Amazon Verified review Amazon
Victor Taylor Jan 25, 2016
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Exceeded all expectations.
Amazon Verified review Amazon
Doug Duncan Feb 20, 2014
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Getting Started with Beautiful Soup by Vineeth G. Nair is a book that was easy to read and fun to follow along with.The book basically has four parts to it. The first part covers chapters 1 and 2 where you install Beautiful Soup and learn how to create objects. The second part covers chapters 3, 4 and 5 where you learn about searching through, navigating over and modifying the contents of the object. The third part covers chapters 6 and 7 where you learn about encoding and output formatters. The final part is chapter 8 where you use all the techniques you've learned in the earlier chapters to build a web scraper that gets price information on Packt Pub books from the publisher itself, Amazon and Barnes & Noble.The book is well written and is recommended to anyone who wants to learn how to work with Beautiful Soup.
Amazon Verified review Amazon
Arkantos Mar 14, 2014
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
I was asked to review a book on Python's web-scraping tool beautiful soup. The book is quite well packaged and organized in terms of content. Assumes no prior knowledge of web scraping. Starts off with simple, easy to follow instructions on how to install beautiful soup, followed by creating beautiful soup objects and proceeds to introducing many features like searching and navigation within a webpage. the final chapter on creating a web scraper is quite informative for the end user.I particularly like the quick reference part, works like a bsoup doc. Overall this book is decent and if you have not worked on python at all and no idea about web scraping and want to learn the same, then this book is for you. Not to forget there is an API doc available for beautiful soup that would be helpful too.
Amazon Verified review Amazon
Merryl Feb 25, 2014
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
I was quite impressed with the details given in this book.The context is well defined and clear. Most of the key topics are taken care of.Any professional or novice could do good by reading this book on web scraping.Being a regular user of BeautifulSoup library, I found that this book encompasses everything that I had to learn from scratch and would have saved me a lot of time, had I had this book when I was studying the BeautifulSoup.I would definitely recommend this book to anyone who would be interested to get into web scraping.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is included in a Packt subscription? Chevron down icon Chevron up icon

A subscription provides you with full access to view all Packt and licnesed content online, this includes exclusive access to Early Access titles. Depending on the tier chosen you can also earn credits and discounts to use for owning content

How can I cancel my subscription? Chevron down icon Chevron up icon

To cancel your subscription with us simply go to the account page - found in the top right of the page or at https://subscription.packtpub.com/my-account/subscription - From here you will see the ‘cancel subscription’ button in the grey box with your subscription information in.

What are credits? Chevron down icon Chevron up icon

Credits can be earned from reading 40 section of any title within the payment cycle - a month starting from the day of subscription payment. You also earn a Credit every month if you subscribe to our annual or 18 month plans. Credits can be used to buy books DRM free, the same way that you would pay for a book. Your credits can be found in the subscription homepage - subscription.packtpub.com - clicking on ‘the my’ library dropdown and selecting ‘credits’.

What happens if an Early Access Course is cancelled? Chevron down icon Chevron up icon

Projects are rarely cancelled, but sometimes it's unavoidable. If an Early Access course is cancelled or excessively delayed, you can exchange your purchase for another course. For further details, please contact us here.

Where can I send feedback about an Early Access title? Chevron down icon Chevron up icon

If you have any feedback about the product you're reading, or Early Access in general, then please fill out a contact form here and we'll make sure the feedback gets to the right team. 

Can I download the code files for Early Access titles? Chevron down icon Chevron up icon

We try to ensure that all books in Early Access have code available to use, download, and fork on GitHub. This helps us be more agile in the development of the book, and helps keep the often changing code base of new versions and new technologies as up to date as possible. Unfortunately, however, there will be rare cases when it is not possible for us to have downloadable code samples available until publication.

When we publish the book, the code files will also be available to download from the Packt website.

How accurate is the publication date? Chevron down icon Chevron up icon

The publication date is as accurate as we can be at any point in the project. Unfortunately, delays can happen. Often those delays are out of our control, such as changes to the technology code base or delays in the tech release. We do our best to give you an accurate estimate of the publication date at any given time, and as more chapters are delivered, the more accurate the delivery date will become.

How will I know when new chapters are ready? Chevron down icon Chevron up icon

We'll let you know every time there has been an update to a course that you've bought in Early Access. You'll get an email to let you know there has been a new chapter, or a change to a previous chapter. The new chapters are automatically added to your account, so you can also check back there any time you're ready and download or read them online.

I am a Packt subscriber, do I get Early Access? Chevron down icon Chevron up icon

Yes, all Early Access content is fully available through your subscription. You will need to have a paid for or active trial subscription in order to access all titles.

How is Early Access delivered? Chevron down icon Chevron up icon

Early Access is currently only available as a PDF or through our online reader. As we make changes or add new chapters, the files in your Packt account will be updated so you can download them again or view them online immediately.

How do I buy Early Access content? Chevron down icon Chevron up icon

Early Access is a way of us getting our content to you quicker, but the method of buying the Early Access course is still the same. Just find the course you want to buy, go through the check-out steps, and you’ll get a confirmation email from us with information and a link to the relevant Early Access courses.

What is Early Access? Chevron down icon Chevron up icon

Keeping up to date with the latest technology is difficult; new versions, new frameworks, new techniques. This feature gives you a head-start to our content, as it's being created. With Early Access you'll receive each chapter as it's written, and get regular updates throughout the product's development, as well as the final course as soon as it's ready.We created Early Access as a means of giving you the information you need, as soon as it's available. As we go through the process of developing a course, 99% of it can be ready but we can't publish until that last 1% falls in to place. Early Access helps to unlock the potential of our content early, to help you start your learning when you need it most. You not only get access to every chapter as it's delivered, edited, and updated, but you'll also get the finalized, DRM-free product to download in any format you want when it's published. As a member of Packt, you'll also be eligible for our exclusive offers, including a free course every day, and discounts on new and popular titles.