Search icon CANCEL
Subscription
0
Cart icon
Cart
Close icon
You have no products in your basket yet
Save more on your purchases!
Savings automatically calculated. No voucher code required
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Java Data Science Cookbook

You're reading from  Java Data Science Cookbook

Product type Book
Published in Mar 2017
Publisher Packt
ISBN-13 9781787122536
Pages 372 pages
Edition 1st Edition
Languages
Author (1):
Rushdi Shams Rushdi Shams
Profile icon Rushdi Shams
Toc

Table of Contents (16) Chapters close

Java Data Science Cookbook
Credits
About the Author
About the Reviewer
www.PacktPub.com
Customer Feedback
Preface
1. Obtaining and Cleaning Data 2. Indexing and Searching Data 3. Analyzing Data Statistically 4. Learning from Data - Part 1 5. Learning from Data - Part 2 6. Retrieving Information from Text Data 7. Handling Big Data 8. Learn Deeply from Data 9. Visualizing Data

Extracting web data from a URL using JSoup


A large amount of data, nowadays, can be found on the Web. This data is sometimes structured, semi-structured, or even unstructured. Therefore, very different techniques are needed to extract them. There are many different ways to extract web data. One of the easiest and handy ways is to use an external Java library named JSoup. This recipe uses a certain number of methods offered in JSoup to extract web data.

Getting ready

In order to perform this recipe, we will require the following:

  1. Go to https://jsoup.org/download, and download the jsoup-1.9.2.jar file. Add the JAR file to your Eclipse project an external library.

  2. If you are a Maven fan, please follow the instructions on the download page to include the JAR file into your Eclipse project.

How to do it...

  1. Create a method named extractDataWithJsoup(String url). The parameter is the URL of any webpage that you need to call the method. We will be extracting web data from this URL:

            public void extractDataWithJsoup(String href){  
    
  2. Use the connect() method by sending the URL where we want to connect (and extract data). Then, we will be chaining a few more methods with it. First, we will chain the timeout() method that takes milliseconds as parameters. The methods after that define the user-agent name during this connection and whether attempts will be made to ignore connection errors. The next method to chain with the previous two is the get() method that eventually returns a Document object. Therefore, we will be holding this returned object in doc of the Document class:

            doc = 
              Jsoup.connect(href).timeout(10*1000).userAgent
                ("Mozilla").ignoreHttpErrors(true).get();
  3. As this code throws IOException, we will be using a try...catch block as follows:

            Document doc = null; 
            try { 
             doc = Jsoup.connect(href).timeout(10*1000).userAgent
               ("Mozilla").ignoreHttpErrors(true).get(); 
               } catch (IOException e) { 
                  //Your exception handling here 
            } 
    

    Tip

    We are not used to seeing times in milliseconds. Therefore, it is a nice practice to write 10*1000 to denote 10 seconds when millisecond is the time unit in a coding. This enhances readability of the code.

  4. A large number of methods can be found for a Document object. If you want to extract the title of the URL, you can use title method as follows:

             if(doc != null){ 
              String title = doc.title(); 
    
  5. To only extract the textual part of the web page, we can chain the body() method with the text() method of a Document object, as follows:

            String text = doc.body().text();
    
  6. If you want to extract all the hyperlinks in a URL, you can use the select() method of a Document object with the a[href]parameter. This gives you all the links at once:

            Elements links = doc.select("a[href]"); 
    
  7. Perhaps you wanted to process the links in a web page individually? That is easy, too--you need to iterate over all the links to get the individual links:

            for (Element link : links) { 
                String linkHref = link.attr("href"); 
                String linkText = link.text(); 
                String linkOuterHtml = link.outerHtml(); 
                String linkInnerHtml = link.html();  
            System.out.println(linkHref + "t" + linkText + "t"  +  
              linkOuterHtml + "t" + linkInnterHtml);       
            }  
    
  8. Finally, close the if-condition with a brace. Close the method with a brace:

        } 
        }  

The complete method, its class, and the driver method are as follows:

import java.io.IOException; 
import org.jsoup.Jsoup; 
import org.jsoup.nodes.Document; 
import org.jsoup.nodes.Element; 
import org.jsoup.select.Elements; 
 
public class JsoupTesting { 
   public static void main(String[] args){ 
      JsoupTesting test = new JsoupTesting(); 
      test.extractDataWithJsoup("Website address preceded by http://"); 
   } 
 
   public void extractDataWithJsoup(String href){ 
      Document doc = null; 
      try { 
         doc = Jsoup.connect(href).timeout(10*1000).userAgent
             ("Mozilla").ignoreHttpErrors(true).get(); 
      } catch (IOException e) { 
         //Your exception handling here 
      } 
      if(doc != null){ 
         String title = doc.title(); 
         String text = doc.body().text(); 
         Elements links = doc.select("a[href]"); 
         for (Element link : links) { 
            String linkHref = link.attr("href"); 
            String linkText = link.text(); 
            String linkOuterHtml = link.outerHtml(); 
            String linkInnerHtml = link.html(); 
            System.out.println(linkHref + "t" + linkText + "t"  + 
                linkOuterHtml + "t" + linkInnterHtml); 
         } 
      } 
   } 
} 
You have been reading a chapter from
Java Data Science Cookbook
Published in: Mar 2017 Publisher: Packt ISBN-13: 9781787122536
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime