Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Getting Started with Beautiful Soup
Getting Started with Beautiful Soup

Getting Started with Beautiful Soup: Learn how to extract information from websites using Beautiful Soup and the Python urllib2 module. This practical, hands-on guide covers everything you need to know to get a head start in website scraping.

Arrow left icon
Profile Icon Vineeth G Nair
Arrow right icon
Can$27.98 Can$39.99
Full star icon Full star icon Full star icon Half star icon Empty star icon 3.5 (11 Ratings)
eBook Jan 2014 130 pages 1st Edition
eBook
Can$27.98 Can$39.99
Paperback
Can$49.99
Subscription
Free Trial
Arrow left icon
Profile Icon Vineeth G Nair
Arrow right icon
Can$27.98 Can$39.99
Full star icon Full star icon Full star icon Half star icon Empty star icon 3.5 (11 Ratings)
eBook Jan 2014 130 pages 1st Edition
eBook
Can$27.98 Can$39.99
Paperback
Can$49.99
Subscription
Free Trial
eBook
Can$27.98 Can$39.99
Paperback
Can$49.99
Subscription
Free Trial

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Table of content icon View table of contents Preview book icon Preview Book

Getting Started with Beautiful Soup

Chapter 2. Creating a BeautifulSoup Object

We saw how to install Beautiful Soup in Linux, Windows, and Mac OS X machines in Chapter 1, Installing Beautiful Soup.

Beautiful Soup is widely used for getting data from web pages. We can use Beautiful Soup to extract any data in an HTML/XML document, for example, to get all links in a page or to get text inside tags on the page. In order to achieve this, Beautiful Soup offers us different objects, and simple searching and navigation methods.

Any input HTML/XML document is converted to different Beautiful Soup objects, and based on the different properties and methods of these objects, we can extract the required data. The list of objects in Beautiful Soup includes the following:

  • BeautifulSoup
  • Tag
  • NavigableString

Creating a BeautifulSoup object

Creating a BeautifulSoup object is the starting point of any Beautiful Soup project. A BeautifulSoup object represents the input HTML/XML document used for its creation.

BeautifulSoup is created by passing a string or a file-like object (this can be an open handle to the files stored locally in our machine or a web page).

Creating a BeautifulSoup object from a string

A string can be passed to the BeautifulSoup constructor to create an object as follows:

helloworld = "<p>Hello World</p>"
soup_string = BeautifulSoup(helloworld)

Tip

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

The previous code will create the BeautifulSoup object based on the input string helloworld. We can see that the input has...

Tag

The Tag object represents different tags of HTML and XML documents. The creation of Tag objects is done when parsing the documents. The different HTML/XML tags identified during parsing are represented as corresponding Tag objects and these objects will have attributes and contents of the HTML/XML tag. The Tag objects can be used for searching and navigation within the HTML/XML document.

Accessing the Tag object from BeautifulSoup

BeautifulSoup allows us to access any Tag object. For example, we can access the first occurrence of the <a> tag in the next example by simply calling the name of the tag <a>.

html_atag = """<html><body><p>Test html a tag example</p>
<a href="http://www.packtpub.com'>Home</a>
<a href="http;//www.packtpub.com/books'>Books</a>
</body>
</html>"""
soup  = BeautifulSoup(html_atag,'lxml')
atag = soup.a
print(atag)

The previous script will...

The NavigableString object

A NavigableString object holds the text within an HTML or an XML tag. This is a Python Unicode string with methods for searching and navigation. Sometimes we may need to navigate to other tags or text within an HTML/XML document based on the current text. With a normal Python Unicode string, the searching and navigation methods will not work. The NavigableString object will give us the text within a tag as a Unicode string, together with the different methods for searching and navigating the tree.

We can get the text stored inside a particular tag by using ".string".

first_a_string = soup_atag.string

In the previous code, the NavigableString object (first_a_string) is created and this holds the string inside the first <a> tag, u'Home'.

Quick reference

You can view the following references to get an overview of creating the following objects:

  • BeautifulSoup
    • soup = BeautifulSoup(string)
    • soup = BeautifulSoup(string,features="xml") #for xml
  • Tag
    • tag = soup.tag #accessing a tag
    • tag.name #Tag name
    • tag['attribute'] #Tag attribute
  • NavigableString
    • soup.tag.string #get Tag's string

Summary

In this chapter, we learned the different objects in the Beautiful Soup module. We understood how the HTML/XML document is converted to a BeautifulSoup object with the help of underlying TreeBuilders. We also had a look at the creation of BeautifulSoup by passing a string and a file object (for a local file and URL). Creating BeautifulSoup for XML parsing and the use of the features argument in the constructor were also explained. We saw how the different tags and texts within the HTML/XML document are represented as a Tag and NavigableString object in Beautiful Soup.

In the next chapter, we will learn the different searching methods, such as find(), find_all(), and find_next(), provided by Beautiful Soup. With the help of these searching methods, we will be able to get data out of the HTML/XML document, which is indeed the most powerful feature of Beautiful Soup.

Left arrow icon Right arrow icon

Description

Beautiful Soup is a Python library designed for quick turnaround projects like screen-scraping. Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need without writing excess code for an application. It doesn't take much code to write an application using Beautiful Soup. Getting Started with Beautiful Soup is a practical guide to Beautiful Soup using Python. The book starts by walking you through the installation of each and every feature of Beautiful Soup using simple examples which include sample Python codes as well as diagrams and screenshots wherever required for better understanding. The book discusses the problems of how exactly you can get data out of a website and provides an easy solution with the help of a real website and sample code. Getting Started with Beautiful Soup goes over the different methods to install Beautiful Soup in both Linux and Windows systems. You will then learn about searching, navigating, content modification, encoding support, and output formatting with the help of examples and sample Python codes for each example so that you can try them out to get a better understanding. This book is a practical guide for scraping information from any website. If you want to learn how to efficiently scrape pages from websites, then this book is for you.

Who is this book for?

If you are a budding forensic analyst, consultant, engineer, or a forensic professional wanting to expand your skillset, this is the book for you. The book will also be beneficial to those with an interest in mobile forensics or wanting to find data lost on mobile devices. It will be helpful to be familiar with forensics in general but no prior experience is required to follow this book.

What you will learn

  • Learn how to scrape HTML pages from websites
  • Implement a simple method to scrape any website with the help of developer tools, the Python urllib2 module, and Beautiful Soup
  • Learn how to search for information within an HTML/XML page
  • Modify the contents of an HTML tree
  • Understand encoding support in Beautiful Soup
  • Learn about the different types of output formatting

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Jan 24, 2014
Length: 130 pages
Edition : 1st
Language : English
ISBN-13 : 9781783289561
Languages :
Concepts :

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Product Details

Publication date : Jan 24, 2014
Length: 130 pages
Edition : 1st
Language : English
ISBN-13 : 9781783289561
Languages :
Concepts :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just Can$6 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just Can$6 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total Can$ 167.97
Getting Started with Beautiful Soup
Can$49.99
Mastering Object-oriented Python
Can$61.99
Python Data Visualization Cookbook
Can$55.99
Total Can$ 167.97 Stars icon

Table of Contents

9 Chapters
1. Installing Beautiful Soup Chevron down icon Chevron up icon
2. Creating a BeautifulSoup Object Chevron down icon Chevron up icon
3. Search Using Beautiful Soup Chevron down icon Chevron up icon
4. Navigation Using Beautiful Soup Chevron down icon Chevron up icon
5. Modifying Content Using Beautiful Soup Chevron down icon Chevron up icon
6. Encoding Support in Beautiful Soup Chevron down icon Chevron up icon
7. Output in Beautiful Soup Chevron down icon Chevron up icon
8. Creating a Web Scraper Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Full star icon Full star icon Full star icon Half star icon Empty star icon 3.5
(11 Ratings)
5 star 27.3%
4 star 36.4%
3 star 9.1%
2 star 9.1%
1 star 18.2%
Filter icon Filter
Top Reviews

Filter reviews by




Lance Hermes Jul 03, 2015
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This is a good introduction to a difficult subject. I have to give the author credit for writing this as the shelf life is likely not that long. Hopefully, he will update it as Python evolves.
Amazon Verified review Amazon
Victor Taylor Jan 25, 2016
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Exceeded all expectations.
Amazon Verified review Amazon
Doug Duncan Feb 20, 2014
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Getting Started with Beautiful Soup by Vineeth G. Nair is a book that was easy to read and fun to follow along with.The book basically has four parts to it. The first part covers chapters 1 and 2 where you install Beautiful Soup and learn how to create objects. The second part covers chapters 3, 4 and 5 where you learn about searching through, navigating over and modifying the contents of the object. The third part covers chapters 6 and 7 where you learn about encoding and output formatters. The final part is chapter 8 where you use all the techniques you've learned in the earlier chapters to build a web scraper that gets price information on Packt Pub books from the publisher itself, Amazon and Barnes & Noble.The book is well written and is recommended to anyone who wants to learn how to work with Beautiful Soup.
Amazon Verified review Amazon
Arkantos Mar 14, 2014
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
I was asked to review a book on Python's web-scraping tool beautiful soup. The book is quite well packaged and organized in terms of content. Assumes no prior knowledge of web scraping. Starts off with simple, easy to follow instructions on how to install beautiful soup, followed by creating beautiful soup objects and proceeds to introducing many features like searching and navigation within a webpage. the final chapter on creating a web scraper is quite informative for the end user.I particularly like the quick reference part, works like a bsoup doc. Overall this book is decent and if you have not worked on python at all and no idea about web scraping and want to learn the same, then this book is for you. Not to forget there is an API doc available for beautiful soup that would be helpful too.
Amazon Verified review Amazon
Merryl Feb 25, 2014
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
I was quite impressed with the details given in this book.The context is well defined and clear. Most of the key topics are taken care of.Any professional or novice could do good by reading this book on web scraping.Being a regular user of BeautifulSoup library, I found that this book encompasses everything that I had to learn from scratch and would have saved me a lot of time, had I had this book when I was studying the BeautifulSoup.I would definitely recommend this book to anyone who would be interested to get into web scraping.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

How do I buy and download an eBook? Chevron down icon Chevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website? Chevron down icon Chevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook? Chevron down icon Chevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to www.packtpub.com/account
  • To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us
What eBook formats do Packt support? Chevron down icon Chevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks? Chevron down icon Chevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook? Chevron down icon Chevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.